Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-25 Thread Anton Salikhmetov
2008/1/25, Randy Dunlap <[EMAIL PROTECTED]>:
> On Wed, 23 Jan 2008 02:21:20 +0300 Anton Salikhmetov wrote:
>
> > Add a document, which describes how the POSIX requirements on updating
> > memory-mapped file times are addressed in Linux.
>
> Hi Anton,
>
> Just a few small comments below...
>
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  Documentation/vm/00-INDEX  |2 +
> >  Documentation/vm/msync.txt |  117 
> > 
> >  2 files changed, 119 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
> > index 2131b00..2726c8d 100644
> > --- a/Documentation/vm/00-INDEX
> > +++ b/Documentation/vm/00-INDEX
> > @@ -6,6 +6,8 @@ hugetlbpage.txt
> >   - a brief summary of hugetlbpage support in the Linux kernel.
> >  locking
> >   - info on how locking and synchronization is done in the Linux vm 
> > code.
> > +msync.txt
> > + - the design document for memory-mapped file times update
> >  numa
> >   - information about NUMA specific code in the Linux vm.
> >  numa_memory_policy.txt
> > diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
> > new file mode 100644
> > index 000..571a766
> > --- /dev/null
> > +++ b/Documentation/vm/msync.txt
> > @@ -0,0 +1,117 @@
> > +
> > + The msync() system call and memory-mapped file times
> > +
> > + Copyright (C) 2008 Anton Salikhmetov
> > +
> > +The POSIX standard requires that any write reference to memory-mapped file
> > +data should result in updating the ctime and mtime for that file. Moreover,
> > +the standard mandates that updated file times should become visible to the
> > +world no later than at the next call to msync().
> > +
> > +Failure to meet this requirement creates difficulties for certain classes
> > +of important applications. For instance, database backup systems fail to
> > +pick up the files modified via the mmap() interface. Also, this is a
> > +security hole, which allows forging file data in such a manner that proving
> > +the fact that file data was modified is not possible.
> > +
> > +Briefly put, this requirement can be stated as follows:
> > +
> > + once the file data has changed, the operating system
> > + should acknowledge this fact by updating file metadata.
> > +
> > +This document describes how this POSIX requirement is addressed in Linux.
> > +
> > +1. Requirements
> > +
> > +1.1) the POSIX standard requires updating ctime and mtime not later
> > +than at the call to msync() with MS_SYNC or MS_ASYNC flags;
> > +
> > +1.2) in existing POSIX implementations, ctime and mtime
> > +get updated not later than at the call to fsync();
> > +
> > +1.3) in existing POSIX implementation, ctime and mtime
> > +get updated not later than at the call to sync(), the "auto-update" 
> > feature;
> > +
> > +1.4) the customers require and the common sense suggests that
> > +ctime and mtime should be updated not later than at the call to munmap()
> > +or exit(), the latter function implying an implicit call to munmap();
> > +
> > +1.5) the (1.1) item should be satisfied if the file is a block device
> > +special file;
> > +
> > +1.6) the (1.1) item should be satisfied for files residing on
> > +memory-backed filesystems such as tmpfs, too.
> > +
> > +The following operating systems were used as the reference platforms
> > +and are referred to as the "existing implementations" above:
> > +HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
> > +
> > +2. Lazy update
> > +
> > +Many attempts before the current version implemented the "lazy update" 
> > approach
> > +to satisfying the requirements given above. Within the latter approach, 
> > ctime
> > +and mtime get updated at last moment allowable.
> > +
> > +Since we don't update the file times immediately, some Flag has to be
> > +used. When up, this Flag means that the file data was modified and
> > +the file times need to be updated as soon as possible.
>
> I would s/up/set/ above and below.
>
> > +Any existing "dirty" flag which, when up, mean that a page has been 
> > written to,
>
> s/mean/means/
>
> > +is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
> > +would have to reset this "dirty" flag after updating ctime and mtime.
> > +The sys_msync() function itself is basically a n

Re: [PATCH -v8 2/4] Update ctime and mtime for memory-mapped files

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Wed, 23 Jan 2008, Anton Salikhmetov wrote:
> >
> > Update ctime and mtime for memory-mapped files at a write access on
> > a present, read-only PTE, as well as at a write on a non-present PTE.
>
> Ok, this one I'm applying. I agree that it leaves MS_ASYNC not updating
> the file until the next sync actually happens, but I can't really bring
> myself to care at least for an imminent 2.6.24 thing. The file times are
> actually "correct" in the sense that they will now match when the IO is
> done, and my man-page says that MS_ASYNC "schedules the io to be done".
>
> And I think this is better than we have now, and I don't think this part
> is somethign that anybody really disagrees with.
>
> We can (and should) keep the MS_ASYNC issue open.

Thank you!

I have closed the bug #2645, because this patch solves the issue
originally reported.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Wed, 23 Jan 2008, Anton Salikhmetov wrote:
> > +
> > + if (pte_dirty(*pte) && pte_write(*pte)) {
>
> Not correct.
>
> You still need to check "pte_present()" before you can test any other
> bits. For a non-present pte, none of the other bits are defined, and for
> all we know there might be architectures out there that require them to
> be non-dirty.
>
> As it is, you just possibly randomly corrupted the pte.
>
> Yeah, on all architectures I know of, it the pte is clear, neither of
> those tests will trigger, so it just happens to work, but it's still
> wrong. And for a MAP_SHARED mapping, it should be either clear or valid,
> although I can imagine that we might do swap-cache entries for tmpfs or
> something (in which case trying to clear the write-enable bit would
> corrupt the swap entry!).
>
> So the bug might be hard or even impossible to trigger in practice, but
> it's still wrong.
>
> I realize that "page_mkclean_one()" doesn't do this very obviously, but
> it's actually there (it's just hidden in page_check_address()).
>
> Quite frankly, at this point I'm getting *very* tired of this series.
> Especially since you ignored me when I suggested you just revert the
> commit that removed the page table walking - and instead send in a buggy
> patch.
>
> Yes, the VM is hard. I agree. It's nasty. But exactly because it's nasty
> and subtle and horrid, I'm also very anal about it, and I get really
> nervous when somebody touches it without (a) knowing all the rules
> intimately and (b) listening to people who do.
>
> So here's even a patch to get you started. Do this:
>
> git revert 204ec841fbea3e5138168edbc3a76d46747cc987
>
> and then use this appended patch on top of that as a starting point for
> something that compiles and *possibly* works.
>
> And no, I do *not* guarantee that this is right either! I have not tested
> it or thought about it a lot, and S390 tends to be odd about some of these
> things. In particular, I actually suspect that we should possibly do this
> the same way we do
>
> ptep_clear_flush_young()
>
> except we would do "ptep_clear_flush_wrprotect()". So even though this is
> a revert plus a simple patch to make it compile again (we've changed how
> we do dirty bits), I think a patch like this needs testing and other
> people like Nick and Peter to ack it.

I'm very sorry for my bad code which can not pass LKML's review.

I reassigned the bug #2645 to default assignee, Andrew Morton, because
it seems that people start getting tired of my patch series.

Thanks for your support.

>
> Nick? Peter? Testing? Other comments?
>
> Linus
>
> ---
>  mm/msync.c |9 ++---
>  1 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/msync.c b/mm/msync.c
> index a30487f..9b0af8f 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -32,6 +32,7 @@ static unsigned long msync_pte_range(struct vm_area_struct 
> *vma, pmd_t *pmd,
>  again:
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> do {
> +   pte_t entry;
> struct page *page;
>
> if (progress >= 64) {
> @@ -47,9 +48,11 @@ again:
> page = vm_normal_page(vma, addr, *pte);
> if (!page)
> continue;
> -   if (ptep_clear_flush_dirty(vma, addr, pte) ||
> -   page_test_and_clear_dirty(page))
> -   ret += set_page_dirty(page);
> +   entry = ptep_clear_flush(vma, addr, pte);
> +   entry = pte_wrprotect(entry);
> +   set_pte_at(mm, address, pte, entry);
> +
> +   ret += 1;
> progress += 3;
> } while (pte++, addr += PAGE_SIZE, addr != end);
> pte_unmap_unlock(pte - 1, ptl);
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > Also, it still doesn't make sense to me why we'd not need to walk the
> > > rmap, it is all the same file after all.
> >
> > It's the same file, but not the same memory map.  It basically depends
> > on how you define msync:
> >
> >  a) sync _file_ on region defined by this mmap/start/end-address
> >  b) sync _memory_region_ defined by start/end-address
>
> My mmap/msync tester program can acually check this as well, with the
> '-f' flag.  Anton, can you try that on the reference platforms?

Here it is:

$ ./a.out file -f
begin   1201085546  1201085546  1200956936
write   1201085546  1201085546  1200956936
mmap1201085546  1201085546  1200956936
b   1201085546  1201085546  1200956936
msync b 1201085550  1201085550  1200956936
c   1201085550  1201085550  1200956936
msync c 1201085552  1201085552  1200956936
d   1201085552  1201085552  1200956936
munmap  1201085552  1201085552  1200956936
close   120108  120108  1200956936
sync120108  120108  1200956936
$ ./a.out file -sf
begin   1201085572  1201085572  1200956936
write   1201085572  1201085572  1200956936
mmap1201085572  1201085572  1200956936
b   1201085572  1201085572  1200956936
msync b 1201085576  1201085576  1200956936
c   1201085576  1201085576  1200956936
msync c 1201085578  1201085578  1200956936
d   1201085578  1201085578  1200956936
munmap  1201085578  1201085578  1200956936
close   1201085581  1201085581  1200956936
sync1201085581  1201085581  1200956936
$ uname -a
FreeBSD td152.testdrive.hp.com 6.2-RELEASE FreeBSD 6.2-RELEASE #0: Fri
Jan 12 11:05:30 UTC 2007
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/SMP  i386
$

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Wed, 2008-01-23 at 02:21 +0300, Anton Salikhmetov wrote:
>
> > +static void vma_wrprotect_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
> > + unsigned long start, unsigned long end)
> > +{
> > + while (start < end) {
> > + spinlock_t *ptl;
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, start, 
> > &ptl);
> > +
> > + if (pte_dirty(*pte) && pte_write(*pte)) {
> > + pte_t entry = ptep_clear_flush(vma, start, pte);
> > +
> > + entry = pte_wrprotect(entry);
> > + set_pte_at(vma->vm_mm, start, pte, entry);
> > + }
> > +
> > + pte_unmap_unlock(pte, ptl);
> > + start += PAGE_SIZE;
> > + }
> > +}
>
> You've had two examples on how to write this loop, one from git commit
> 204ec841fbea3e5138168edbc3a76d46747cc987, and one from my draft, but
> this one looks like neither and is much less efficient. Take the lock
> only once per pmd, not once per pte please.
>
> > +static void vma_wrprotect_pud_range(struct vm_area_struct *vma, pud_t *pud,
> > + unsigned long start, unsigned long end)
> > +{
> > + pmd_t *pmd = pmd_offset(pud, start);
> > +
> > + while (start < end) {
> > + unsigned long next = pmd_addr_end(start, end);
> > +
> > + if (!pmd_none_or_clear_bad(pmd))
> > + vma_wrprotect_pmd_range(vma, pmd, start, next);
> > +
> > + ++pmd;
> > + start = next;
> > + }
> > +}
> > +
> > +static void vma_wrprotect_pgd_range(struct vm_area_struct *vma, pgd_t *pgd,
> > + unsigned long start, unsigned long end)
> > +{
> > + pud_t *pud = pud_offset(pgd, start);
> > +
> > + while (start < end) {
> > + unsigned long next = pud_addr_end(start, end);
> > +
> > + if (!pud_none_or_clear_bad(pud))
> > + vma_wrprotect_pud_range(vma, pud, start, next);
> > +
> > + ++pud;
> > + start = next;
> > + }
> > +}
> > +
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr = vma->vm_start;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > +
> > + while (addr < vma->vm_end) {
> > + unsigned long next = pgd_addr_end(addr, vma->vm_end);
> > +
> > + if (!pgd_none_or_clear_bad(pgd))
> > + vma_wrprotect_pgd_range(vma, pgd, addr, next);
> > +
> > + ++pgd;
> > + addr = next;
> > + }
> > +}
>
> I think you want to pass start, end here too, you might not need to
> sweep the whole vma.

Thanks for you feedback, Peter!

I will redesign the vma_wrprotect_pmd_range() routine the way it
acquires the spinlock outside of the loop. I will also rewrite the
vma_wrprotect() function to process only the specified range.

>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi <[EMAIL PROTECTED]>:
> > Ah, OK, this is becuase mmap doesn't actually set up the page tables
> > by default.   Try adding MAP_POPULATE to the flags.
>
> Here's an updated version of the program, with an added a '-r' option,
> that performs a read access on the mapping before doing the write
> (basically equivalent to MAP_POPULATE, but portable).
>
> Please try this on a tmpfs file.
>
> Thanks,
> Miklos
>
> ---
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> static const char *filename;
> static int msync_flag = MS_ASYNC;
> static int msync_fork = 0;
> static int msync_read = 0;
>
> static void print_times(const char *msg)
> {
> struct stat stbuf;
> stat(filename, &stbuf);
> printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
>stbuf.st_atime);
> }
>
> static void do_msync(void *addr, int len)
> {
> int res;
> if (!msync_fork) {
> res = msync(addr, len, msync_flag);
> if (res == -1) {
> perror("msync");
> exit(1);
> }
> } else {
> int pid = fork();
> if (pid == -1) {
> perror("fork");
> exit(1);
> }
> if (!pid) {
> int fd = open(filename, O_RDWR);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
> addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, 
> MAP_SHARED, fd, 0);
> if (addr == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> res = msync(addr, len, msync_flag);
> if (res == -1) {
> perror("msync");
> exit(1);
> }
> exit(0);
> }
> wait(NULL);
> }
> }
>
> static void usage(const char *progname)
> {
> fprintf(stderr, "usage: %s filename [-sfr]\n", progname);
> fprintf(stderr, " -s: use MS_SYNC instead of MS_ASYNC\n");
> fprintf(stderr, " -f: fork and perform msync in a different 
> process\n");
> fprintf(stderr, " -r: do a read access before each write access\n");
> exit(1);
> }
>
> int main(int argc, char *argv[])
> {
> int res;
> char *addr;
> char tmp[32];
> int fd;
>
> if (argc < 2)
> usage(argv[0]);
>
> filename = argv[1];
> if (argc > 2) {
> char *s;
> if (argc > 3)
> usage(argv[0]);
> s = argv[2];
> if (s[0] != '-' || !s[1])
> usage(argv[0]);
> for (s++; *s; s++) {
> switch (*s) {
> case 's':
> msync_flag = MS_SYNC;
> break;
> case 'f':
> msync_fork = 1;
> break;
> case 'r':
> msync_read = 1;
> break;
> default:
> usage(argv[0]);
> }
> }
> }
>
> fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
> if (fd == -1) {
> perror(filename);
> return 1;
> }
> print_times("begin");
> sleep(1);
> write(fd, "wxyz\n", 4);
> print_times("write");
> sleep(1);
> addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> if (addr == MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> print_times("mmap");
> sleep(1);
>
> if (msync_read) {
> sprintf(tmp, "fetch %c", addr[1]);
> print_times(tmp);
> sleep(1);
> }
> addr[1] = 'b';
> print_times("store b");
> sleep(1);
> do_msync(addr, 4);
> print_times("msync");
> sleep(1);
>
> if (msync_read) {
> sprintf(tmp, "fetch %c", addr[2]);
> print_times(tmp);
> sleep(1);
> }
> addr[2] = 'c';
> print_times("store c");
> sleep(1);
> do_msync(addr, 4);
> print_times("msync");
> sleep(1);
>
> if (msync_read) {
> sprintf(tmp, "fetch %c", addr[3]);
> print_times(tmp);
> sleep(1);
> }

Re: [PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-23 Thread Anton Salikhmetov
2008/1/23, Miklos Szeredi <[EMAIL PROTECTED]>:
> I think it would be more logical to move this patch forward, before
> the two patches which this document is actually describing.
>
> > Add a document, which describes how the POSIX requirements on updating
> > memory-mapped file times are addressed in Linux.
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  Documentation/vm/00-INDEX  |2 +
> >  Documentation/vm/msync.txt |  117 
> > 
> >  2 files changed, 119 insertions(+), 0 deletions(-)
> >
> > diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
> > index 2131b00..2726c8d 100644
> > --- a/Documentation/vm/00-INDEX
> > +++ b/Documentation/vm/00-INDEX
> > @@ -6,6 +6,8 @@ hugetlbpage.txt
> >   - a brief summary of hugetlbpage support in the Linux kernel.
> >  locking
> >   - info on how locking and synchronization is done in the Linux vm 
> > code.
> > +msync.txt
> > + - the design document for memory-mapped file times update
> >  numa
> >   - information about NUMA specific code in the Linux vm.
> >  numa_memory_policy.txt
> > diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
> > new file mode 100644
> > index 000..571a766
> > --- /dev/null
> > +++ b/Documentation/vm/msync.txt
> > @@ -0,0 +1,117 @@
> > +
> > + The msync() system call and memory-mapped file times
> > +
> > + Copyright (C) 2008 Anton Salikhmetov
> > +
> > +The POSIX standard requires that any write reference to memory-mapped file
> > +data should result in updating the ctime and mtime for that file. Moreover,
> > +the standard mandates that updated file times should become visible to the
> > +world no later than at the next call to msync().
> > +
> > +Failure to meet this requirement creates difficulties for certain classes
> > +of important applications. For instance, database backup systems fail to
> > +pick up the files modified via the mmap() interface. Also, this is a
> > +security hole, which allows forging file data in such a manner that proving
> > +the fact that file data was modified is not possible.
> > +
> > +Briefly put, this requirement can be stated as follows:
> > +
> > + once the file data has changed, the operating system
> > + should acknowledge this fact by updating file metadata.
> > +
> > +This document describes how this POSIX requirement is addressed in Linux.
> > +
> > +1. Requirements
> > +
> > +1.1) the POSIX standard requires updating ctime and mtime not later
> > +than at the call to msync() with MS_SYNC or MS_ASYNC flags;
> > +
> > +1.2) in existing POSIX implementations, ctime and mtime
> > +get updated not later than at the call to fsync();
> > +
> > +1.3) in existing POSIX implementation, ctime and mtime
> > +get updated not later than at the call to sync(), the "auto-update" 
> > feature;
> > +
> > +1.4) the customers require and the common sense suggests that
> > +ctime and mtime should be updated not later than at the call to munmap()
> > +or exit(), the latter function implying an implicit call to munmap();
> > +
> > +1.5) the (1.1) item should be satisfied if the file is a block device
> > +special file;
> > +
> > +1.6) the (1.1) item should be satisfied for files residing on
> > +memory-backed filesystems such as tmpfs, too.
> > +
> > +The following operating systems were used as the reference platforms
> > +and are referred to as the "existing implementations" above:
> > +HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
> > +
> > +2. Lazy update
> > +
> > +Many attempts before the current version implemented the "lazy update" 
> > approach
> > +to satisfying the requirements given above. Within the latter approach, 
> > ctime
> > +and mtime get updated at last moment allowable.
> > +
> > +Since we don't update the file times immediately, some Flag has to be
> > +used. When up, this Flag means that the file data was modified and
> > +the file times need to be updated as soon as possible.
> > +
> > +Any existing "dirty" flag which, when up, mean that a page has been 
> > written to,
> > +is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
> > +would have to reset this "dirty" flag after updating ctime and mtime.
> > +The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
> > +Thereby, the s

[PATCH -v8 1/4] Massive code cleanup of sys_msync()

2008-01-22 Thread Anton Salikhmetov
Use the PAGE_ALIGN() macro instead of "manual" alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   76 ---
 1 files changed, 36 insertions(+), 40 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..60efa36 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,84 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current->mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
+
error = 0;
+   unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current->mm;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma->vm_file;
+
+   error = -EBUSY;
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+   break;
+
+   error = 0;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +87,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-

[PATCH -v8 4/4] The design document for memory-mapped file times update

2008-01-22 Thread Anton Salikhmetov
Add a document, which describes how the POSIX requirements on updating
memory-mapped file times are addressed in Linux.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 Documentation/vm/00-INDEX  |2 +
 Documentation/vm/msync.txt |  117 
 2 files changed, 119 insertions(+), 0 deletions(-)

diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 2131b00..2726c8d 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -6,6 +6,8 @@ hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
 locking
- info on how locking and synchronization is done in the Linux vm code.
+msync.txt
+   - the design document for memory-mapped file times update
 numa
- information about NUMA specific code in the Linux vm.
 numa_memory_policy.txt
diff --git a/Documentation/vm/msync.txt b/Documentation/vm/msync.txt
new file mode 100644
index 000..571a766
--- /dev/null
+++ b/Documentation/vm/msync.txt
@@ -0,0 +1,117 @@
+
+   The msync() system call and memory-mapped file times
+
+   Copyright (C) 2008 Anton Salikhmetov
+
+The POSIX standard requires that any write reference to memory-mapped file
+data should result in updating the ctime and mtime for that file. Moreover,
+the standard mandates that updated file times should become visible to the
+world no later than at the next call to msync().
+
+Failure to meet this requirement creates difficulties for certain classes
+of important applications. For instance, database backup systems fail to
+pick up the files modified via the mmap() interface. Also, this is a
+security hole, which allows forging file data in such a manner that proving
+the fact that file data was modified is not possible.
+
+Briefly put, this requirement can be stated as follows:
+
+   once the file data has changed, the operating system
+   should acknowledge this fact by updating file metadata.
+
+This document describes how this POSIX requirement is addressed in Linux.
+
+1. Requirements
+
+1.1) the POSIX standard requires updating ctime and mtime not later
+than at the call to msync() with MS_SYNC or MS_ASYNC flags;
+
+1.2) in existing POSIX implementations, ctime and mtime
+get updated not later than at the call to fsync();
+
+1.3) in existing POSIX implementation, ctime and mtime
+get updated not later than at the call to sync(), the "auto-update" feature;
+
+1.4) the customers require and the common sense suggests that
+ctime and mtime should be updated not later than at the call to munmap()
+or exit(), the latter function implying an implicit call to munmap();
+
+1.5) the (1.1) item should be satisfied if the file is a block device
+special file;
+
+1.6) the (1.1) item should be satisfied for files residing on
+memory-backed filesystems such as tmpfs, too.
+
+The following operating systems were used as the reference platforms
+and are referred to as the "existing implementations" above:
+HP-UX B.11.31 and FreeBSD 6.2-RELEASE.
+
+2. Lazy update
+
+Many attempts before the current version implemented the "lazy update" approach
+to satisfying the requirements given above. Within the latter approach, ctime
+and mtime get updated at last moment allowable.
+
+Since we don't update the file times immediately, some Flag has to be
+used. When up, this Flag means that the file data was modified and
+the file times need to be updated as soon as possible.
+
+Any existing "dirty" flag which, when up, mean that a page has been written to,
+is not suitable for this purpose. Indeed, msync() called with MS_ASYNC
+would have to reset this "dirty" flag after updating ctime and mtime.
+The sys_msync() function itself is basically a no-op in the MS_ASYNC case.
+Thereby, the synchronization routines relying upon this "dirty" flag
+would lose data. Therefore, a new Flag has to be introduced.
+
+The (1.5) item coupled with (1.3) requirement leads to hard work with
+the block device inodes. Specifically, during writeback it is impossible to
+tell which block device file was originally mapped. Therefore, we need to
+traverse the list of "active" devices associated with the block device inode.
+This would lead to updating file times for block device files, which were not
+taking part in the data transfer.
+
+Also all versions prior to version 6 failed to correctly process ctime and
+mtime for files on the memory-backed filesystems such as tmpfs. So the (1.6)
+requirement was not satisfied.
+
+If a write reference has occurred between two consecutive calls to msync()
+with MS_ASYNC, the second call to the latter function should take into
+account the last write reference. The last write reference can not be caught
+if no pagefault occurs. Hence a pagefault needs to be forced. This can be done
+using two different approaches. The first one is to synchronize data even when
+msync() was called w

[PATCH -v8 2/4] Update ctime and mtime for memory-mapped files

2008-01-22 Thread Anton Salikhmetov
Update ctime and mtime for memory-mapped files at a write access on
a present, read-only PTE, as well as at a write on a non-present PTE.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/memory.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 0/4] Fixing the issue with memory-mapped file times

2008-01-22 Thread Anton Salikhmetov
This is the eighth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Since the previous version, the following has changed:

1) based on Linus' comment, a more efficient PTE walker implemented;

2) the design document added to the kernel documentation.

Functional tests successfully passed.

Please comment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v8 3/4] Enable the MS_ASYNC functionality in sys_msync()

2008-01-22 Thread Anton Salikhmetov
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   92 +--
 1 files changed, 82 insertions(+), 10 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 60efa36..87f990e 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -12,6 +13,73 @@
 #include 
 #include 
 
+static void vma_wrprotect_pmd_range(struct vm_area_struct *vma, pmd_t *pmd,
+   unsigned long start, unsigned long end)
+{
+   while (start < end) {
+   spinlock_t *ptl;
+   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+
+   if (pte_dirty(*pte) && pte_write(*pte)) {
+   pte_t entry = ptep_clear_flush(vma, start, pte);
+
+   entry = pte_wrprotect(entry);
+   set_pte_at(vma->vm_mm, start, pte, entry);
+   }
+
+   pte_unmap_unlock(pte, ptl);
+   start += PAGE_SIZE;
+   }
+}
+
+static void vma_wrprotect_pud_range(struct vm_area_struct *vma, pud_t *pud,
+   unsigned long start, unsigned long end)
+{
+   pmd_t *pmd = pmd_offset(pud, start);
+
+   while (start < end) {
+   unsigned long next = pmd_addr_end(start, end);
+
+   if (!pmd_none_or_clear_bad(pmd))
+   vma_wrprotect_pmd_range(vma, pmd, start, next);
+
+   ++pmd;
+   start = next;
+   }
+}
+
+static void vma_wrprotect_pgd_range(struct vm_area_struct *vma, pgd_t *pgd,
+   unsigned long start, unsigned long end)
+{
+   pud_t *pud = pud_offset(pgd, start);
+
+   while (start < end) {
+   unsigned long next = pud_addr_end(start, end);
+
+   if (!pud_none_or_clear_bad(pud))
+   vma_wrprotect_pud_range(vma, pud, start, next);
+
+   ++pud;
+   start = next;
+   }
+}
+
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr = vma->vm_start;
+   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+
+   while (addr < vma->vm_end) {
+   unsigned long next = pgd_addr_end(addr, vma->vm_end);
+
+   if (!pgd_none_or_clear_bad(pgd))
+   vma_wrprotect_pgd_range(vma, pgd, addr, next);
+
+   ++pgd;
+   addr = next;
+   }
+}
+
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
@@ -78,16 +146,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
-   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-   get_file(file);
-   up_read(&mm->mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start >= end)
-   goto out;
-   down_read(&mm->mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   if ((flags & MS_ASYNC))
+   vma_wrprotect(vma);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(&mm->mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start >= end)
+   goto out;
+   down_read(&mm->mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma->vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Tue, 22 Jan 2008, Anton Salikhmetov wrote:
> >
> >  /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > + unsigned long addr;
> > +
> > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > + spinlock_t *ptl;
> > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > + pud_t *pud = pud_offset(pgd, addr);
> > + pmd_t *pmd = pmd_offset(pud, addr);
> > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>
> This is extremely expensive over bigger areas, especially sparsely mapped
> ones (it does all the lookups for all four levels over and over and over
> again for eachg page).
>
> I think Peter Zijlstra posted a version that uses the regular kind of
> nested loop (with inline functions to keep the thing nice and clean),
> which gets rid of that.

Thanks for your feedback, Linus!

I will use Peter Zijlstra's version of such an operation in my next
patch series.

>
> [ The sad/funny part is that this is all how we *used* to do msync(), back
>   in the days: we're literally going back to the "pre-cleanup" logic. See
>   commit 204ec841fbea3e5138168edbc3a76d46747cc987: "mm: msync() cleanup"
>   for details ]
>
> Quite frankly, I really think you might be better off just doing a
>
> git revert 204ec841fbea3e5138168edbc3a76d46747cc987
>
> and working from there! I just checked, and it still reverts cleanly, and
> you'd end up with a nice code-base that (a) has gotten years of testing
> and (b) already has the looping-over-the-pagetables code.
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Anton Salikhmetov <[EMAIL PROTECTED]>:
> 2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> > Some very pedantic nitpicking below;
> >
> > On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> > >
> > > Update file times at write references to memory-mapped files.
> > > Force file times update at the next write reference after
> > > calling the msync() system call with the MS_ASYNC flag.
> > >
> > > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > > ---
> > >  mm/memory.c |6 ++
> > >  mm/msync.c  |   57 
> > > -
> > >  2 files changed, 50 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6dd1cd8..4b0144b 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1670,6 +1670,9 @@ gotten:
> > >  unlock:
> > > pte_unmap_unlock(page_table, ptl);
> > > if (dirty_page) {
> > > +   if (vma->vm_file)
> > > +   file_update_time(vma->vm_file);
> > > +
> > > /*
> > >  * Yes, Virginia, this is actually required to prevent a 
> > > race
> > >  * with clear_page_dirty_for_io() from clearing the page 
> > > dirty
> > > @@ -2343,6 +2346,9 @@ out_unlocked:
> > > if (anon)
> > > page_cache_release(vmf.page);
> > > else if (dirty_page) {
> > > +   if (vma->vm_file)
> > > +       file_update_time(vma->vm_file);
> > > +
> > > set_page_dirty_balance(dirty_page, page_mkwrite);
> > > put_page(dirty_page);
> > > }
> > > diff --git a/mm/msync.c b/mm/msync.c
> > > index a4de868..394130d 100644
> > > --- a/mm/msync.c
> > > +++ b/mm/msync.c
> > > @@ -5,6 +5,7 @@
> > >   * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
> > >   */
> > >
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -13,11 +14,37 @@
> > >  #include 
> > >
> > >  /*
> > > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > > + * It will force a pagefault on the next write access.
> > > + */
> > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > +{
> > > +   unsigned long addr;
> > > +
> > > +   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) 
> > > {
> >
> > I know it's not the common "Linux Kernel way", but 'addr' could be
> > made to have just 'for' scope here according to C99;
>
> I believe that the C89 style is more common for the Linux kernel, so
> I've used the out-of-scope variable declaration.
>
> >
> >for (unsigned long addr = vma->vm_start; addr < vma->vm_end;
> > addr += PAGE_SIZE) {

By the way, if we're talking "pedantic", then:

>>>

debian:/tmp$ cat c.c
void f()
{
   for (unsigned long i = 0; i < 10; i++)
   continue;
}
debian:/tmp$ gcc -c -pedantic c.c
c.c: In function 'f':
c.c:3: error: 'for' loop initial declaration used outside C99 mode
debian:/tmp$

<<<

No pun intended :)

> >
> >
> > > +   spinlock_t *ptl;
> > > +   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > +   pud_t *pud = pud_offset(pgd, addr);
> > > +   pmd_t *pmd = pmd_offset(pud, addr);
> > > +   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, 
> > > &ptl);
> > > +
> > > +   if (pte_dirty(*pte) && pte_write(*pte)) {
> > > +   pte_t entry = ptep_clear_flush(vma, addr, pte);
> > > +
> > > +   entry = pte_wrprotect(entry);
> > > +   set_pte_at(vma->vm_mm, addr, pte, entry);
> > > +   }
> > > +   pte_unmap_unlock(pte, ptl);
> > > +   }
> > > +}
> > > +
> > > +/*
> > >   * MS_SYNC syncs the entire file - including mappings.
> > >   *
> > > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> >

Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > 2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> > > Some very pedantic nitpicking below;
> > >
> > > On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> ...
> > > > +   if (file && (vma->vm_flags & VM_SHARED)) {
> > > > +   if (flags & MS_ASYNC)
> > > > +   vma_wrprotect(vma);
> > > > +   if (flags & MS_SYNC) {
> > >
> > > "else if" ??
> >
> > The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
> > did not use the "else-if" here. Moreover, this function itself checks
> > that they never come together.
> >
>
> I would say that them being mutually exclusive would be a reason *for*
> using "else-if" here.

This check is performed by the sys_msync() function itself in its very
beginning.

We don't need to check it later.

>
> --
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> Some very pedantic nitpicking below;
>
> On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> >
> > Update file times at write references to memory-mapped files.
> > Force file times update at the next write reference after
> > calling the msync() system call with the MS_ASYNC flag.
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  mm/memory.c |6 ++
> >  mm/msync.c  |   57 
> > -
> >  2 files changed, 50 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6dd1cd8..4b0144b 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1670,6 +1670,9 @@ gotten:
> >  unlock:
> > pte_unmap_unlock(page_table, ptl);
> > if (dirty_page) {
> > +   if (vma->vm_file)
> > +   file_update_time(vma->vm_file);
> > +
> > /*
> >  * Yes, Virginia, this is actually required to prevent a 
> > race
> >  * with clear_page_dirty_for_io() from clearing the page 
> > dirty
> > @@ -2343,6 +2346,9 @@ out_unlocked:
> > if (anon)
> > page_cache_release(vmf.page);
> > else if (dirty_page) {
> > +   if (vma->vm_file)
> > +   file_update_time(vma->vm_file);
> > +
> > set_page_dirty_balance(dirty_page, page_mkwrite);
> > put_page(dirty_page);
> > }
> > diff --git a/mm/msync.c b/mm/msync.c
> > index a4de868..394130d 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -5,6 +5,7 @@
> >   * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
> >   */
> >
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -13,11 +14,37 @@
> >  #include 
> >
> >  /*
> > + * Scan the PTEs for pages belonging to the VMA and mark them read-only.
> > + * It will force a pagefault on the next write access.
> > + */
> > +static void vma_wrprotect(struct vm_area_struct *vma)
> > +{
> > +   unsigned long addr;
> > +
> > +   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
>
> I know it's not the common "Linux Kernel way", but 'addr' could be
> made to have just 'for' scope here according to C99;

I believe that the C89 style is more common for the Linux kernel, so
I've used the out-of-scope variable declaration.

>
>for (unsigned long addr = vma->vm_start; addr < vma->vm_end;
> addr += PAGE_SIZE) {
>
>
> > +   spinlock_t *ptl;
> > +   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > +   pud_t *pud = pud_offset(pgd, addr);
> > +   pmd_t *pmd = pmd_offset(pud, addr);
> > +   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, 
> > &ptl);
> > +
> > +   if (pte_dirty(*pte) && pte_write(*pte)) {
> > +   pte_t entry = ptep_clear_flush(vma, addr, pte);
> > +
> > +   entry = pte_wrprotect(entry);
> > +   set_pte_at(vma->vm_mm, addr, pte, entry);
> > +   }
> > +   pte_unmap_unlock(pte, ptl);
> > +   }
> > +}
> > +
> > +/*
> >   * MS_SYNC syncs the entire file - including mappings.
> >   *
> > - * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> > - * Now it doesn't do anything, since dirty pages are properly tracked.
>
> I think keeping some version of the "up to ..." comments makes sense.
> It documents that we previously had different behaviour.

Earlier I had a request to remove any "changelog-style" comments from the code.

>
> > + * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
> > + * read-only by calling vma_wrprotect(). This is needed to catch the next
> > + * write reference to the mapped region and update the file times
> > + * accordingly.
> >   *
> >   * The application may now run fsync() to write out the dirty pages and
> >   * wait on the writeout and check the result. Or the application may run
> > @@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> 

Re: [PATCH -v7 0/2] Fixing the issue with memory-mapped file times

2008-01-21 Thread Anton Salikhmetov
2008/1/22, Jesper Juhl <[EMAIL PROTECTED]>:
> On 22/01/2008, Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
> > This is the seventh version of my solution for the bug #2645:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Since the previous version, the following has changed: based on
> > Linus' comment, SMP-safe PTE update implemented.
> >
> > Discussions, which followed my past submissions, showed that it was
> > tempting to approach this problem using very different assumptions.
> > However, many such designs have proved to be incomplete or inefficient.
> >
> > Taking into account the obvious complexity of this problem, I prepared a
> > design document, the purpose of which is twofold. First, it summarizes
> > previous attempts to resolve the ctime/mtime issue. Second, it attempts
> > to prove that what the patches do is necessary and sufficient for mtime
> > and ctime update provided that we start from a certain sane set of
> > requirements. The design document is available via the following link:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40
> >
> > For the seventh version, comprehensive performance testing was performed.
> > The results of performance tests, including numbers, are available here:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
> >
>
> Hi Anton,
>
> I applied your patches here and as far as my own test programs go,
> these patches solve the previously observed problems I saw with mtime
> not getting updated.
>
> Thank you very much for so persistently working on these long standing issues.

Thank you very much for testing these patches!

>
> --
> Jesper Juhl <[EMAIL PROTECTED]>
> Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
> Plain text mails only, please  http://www.expita.com/nomime.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v7 0/2] Fixing the issue with memory-mapped file times

2008-01-21 Thread Anton Salikhmetov
This is the seventh version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Since the previous version, the following has changed: based on
Linus' comment, SMP-safe PTE update implemented.

Discussions, which followed my past submissions, showed that it was
tempting to approach this problem using very different assumptions.
However, many such designs have proved to be incomplete or inefficient.

Taking into account the obvious complexity of this problem, I prepared a
design document, the purpose of which is twofold. First, it summarizes
previous attempts to resolve the ctime/mtime issue. Second, it attempts
to prove that what the patches do is necessary and sufficient for mtime
and ctime update provided that we start from a certain sane set of
requirements. The design document is available via the following link:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

For the seventh version, comprehensive performance testing was performed.
The results of performance tests, including numbers, are available here:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v7 1/2] Massive code cleanup of sys_msync()

2008-01-21 Thread Anton Salikhmetov
Use the PAGE_ALIGN() macro instead of "manual" alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   77 
 1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current->mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
-   error = 0;
+
+   error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current->mm;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma->vm_file;
+
+   error = -EBUSY;
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+   break;
+
+   error = 0;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   

[PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

Update file times at write references to memory-mapped files.
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/memory.c |6 ++
 mm/msync.c  |   57 -
 2 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..394130d 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -13,11 +14,37 @@
 #include 
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+   if (pte_dirty(*pte) && pte_write(*pte)) {
+   pte_t entry = ptep_clear_flush(vma, addr, pte);
+
+   entry = pte_wrprotect(entry);
+   set_pte_at(vma->vm_mm, addr, pte, entry);
+   }
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +104,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
-   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-   get_file(file);
-   up_read(&mm->mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start >= end)
-   goto out;
-   down_read(&mm->mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   if (flags & MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(&mm->mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start >= end)
+   goto out;
+   down_read(&mm->mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma->vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-21 Thread Anton Salikhmetov
2008/1/21, Peter Staubach <[EMAIL PROTECTED]>:
> Linus Torvalds wrote:
> > On Fri, 18 Jan 2008, Ingo Oeser wrote:
> >
> >> Can we get "if the write to the page hits the disk, the mtime has hit the 
> >> disk
> >> already no less than SOME_GRANULARITY before"?
> >>
> >> That is very important for computer forensics. Esp. in saving your ass!
> >>
> >> Ok, now back again to making that fast :-)
> >>
> >
> > I certainly don't mind it if we have some tighter guarantees, but what I'd
> > want is:
> >
> >  - keep it simple. Let's face it, Linux has never ever given those
> >guarantees before, and it's not is if anybody has really cared. Even
> >now, the issue seems to be more about paper standards conformance than
> >anything else.
> >
> >
>
> I have been working on getting something supported here for
> because I have some very large Wall Street customers who do
> care about getting the mtime updated because their backups
> are getting corrupted.  They are incomplete because although
> their applications update files, they don't get backed up
> because the mtime never changes.
>
> >  - I get worried about people playing around with the dirty bit in
> >particular. We have had some really rather nasty bugs here. Most of
> >which are totally impossible to trigger under normal loads (for
> >example the old random-access utorrent writable mmap issue from about
> >a year ago).
> >
> > So these two issues - the big red danger signs flashing in my brain,
> > coupled with the fact that no application has apparently ever really
> > noticed in the last 15 years - just makes it a case where I'd like each
> > step of the way to be obvious and simple and no larger than really
> > absolutely necessary.
>
> Simple is good.  However, too simple is not good.  I would suggest
> that we implement file time updates which make sense and if they
> happen to follow POSIX, then nifty, otherwise, oh well.

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
series, and the performance tests results soon.

>
> Thanx...
>
>ps
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > The page_check_address() function is called from the
> > page_mkclean_one() routine as follows:
>
> .. and the page_mkclean_one() function is totally different.
>
> Lookie here, this is the correct and complex sequence:
>
> > entry = ptep_clear_flush(vma, address, pte);
> > entry = pte_wrprotect(entry);
> > entry = pte_mkclean(entry);
> > set_pte_at(mm, address, pte, entry);
>
> That's a rather expensive sequence, but it's done exactly because it has
> to be done that way. What it does is to
>
>  - *atomically* load the pte entry _and_ clear the old one in memory.
>
>That's the
>
> entry = ptep_clear_flush(vma, address, pte);
>
>thing, and it basically means that it's doing some
>architecture-specific magic to make sure that another CPU that accesses
>the PTE at the same time will never actually modify the pte (because
>it's clear and not valid)
>
>  - it then - while the page table is actually clear and invalid - takes
>the old value and turns it into the new one:
>
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
>
>  - and finally, it replaces the entry with the new one:
>
> set_pte_at(mm, address, pte, entry);
>
>which takes care to write the new entry in some specific way that is
>atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
>entry it writes the high word first, see the write barriers in
>"native_set_pte()" in include/asm-x86/pgtable-3level.h
>
> Now, compare that subtle and correct thing with what is *not* correct:
>
> if (pte_dirty(*pte) && pte_write(*pte))
> *pte = pte_wrprotect(*pte);
>
> which makes no effort at all to make sure that it's safe in case another
> CPU updates the accessed bit.
>
> Now, arguably it's unlikely to cause horrible problems at least on x86,
> because:
>
>  - we only do this if the pte is already marked dirty, so while we can
>lose the accessed bit, we can *not* lose the dirty bit. And the
>accessed bit isn't such a big deal.
>
>  - it's not doing any of the "be careful about" ordering things, but since
>the really important bits aren't changing, ordering probably won't
>practically matter.
>
> But the problem is that we have something like 24 different architectures,
> it's hard to make sure that none of them have issues.
>
> In other words: it may well work in practice. But when these things go
> subtly wrong, they are *really* nasty to find, and the unsafe sequence is
> really not how it's supposed to be done. For example, you don't even flush
> the TLB, so even if there are no cross-CPU issues, there's probably going
> to be writable entries in the TLB that now don't match the page tables.
>
> Will it matter? Again, probably impossible to see in practice. But ...

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

Anton

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/19, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Sat, 19 Jan 2008, Anton Salikhmetov wrote:
> >
> > Before using pte_wrprotect() the vma_wrprotect() routine uses the
> > pte_offset_map_lock() macro to get the PTE and to acquire the ptl
> > spinlock. Why did you say that this code was not SMP-safe? It should
> > be atomic, I think.
>
> It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.
>
> Guess how much another x86 CPU cares when it sets the accessed bit in
> hardware?

Thank you very much for taking part in this discussion. Personally,
it's very important to me.  But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return NULL;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
return NULL;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!pte_present(*pte)) {
pte_unmap(pte);
return NULL;
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

pte = page_check_address(page, mm, address, &ptl);
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
ret = 1;
}

pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed that the code in mm/rmap.c was absolutely correct, that's
why I basically reused the design.

>
> > The POSIX standard requires the ctime and mtime stamps to be updated
> > not later than at the second call to msync() with the MS_ASYNC flag.
>
> .. and that is no excuse for bad code.
>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Fri, 18 Jan 2008, Anton Salikhmetov wrote:
> >
> > The current solution doesn't hit the performance at all when compared to
> > the competitor POSIX-compliant systems. It is faster and does even more
> > than the POSIX standard requires.
>
> Your current patches have two problems:
>  - they are simply unnecessarily invasive for a relatively simple issue
>  - all versions I've looked at closer are buggy too
>
> Example:
>
> +   if (pte_dirty(*pte) && pte_write(*pte))
> +   *pte = pte_wrprotect(*pte);
>
> Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
> other CPU's may be accessing it and updating it from their hw page table
> walkers. What will happen? Who knows? I can see lost access bits at a
> minimum.
>
> IOW, this isn't simple code. It's code that it is simple to screw up. In
> this case, you really need to use ptep_set_wrprotect(), for example.

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should
be atomic, I think.


>
> So why not do it in many fewer lines with that simpler vma->dirty flag?

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
satisfy this POSIX requirement. Linux does not.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Linus Torvalds <[EMAIL PROTECTED]>:
>
>
> On Fri, 18 Jan 2008, Miklos Szeredi wrote:
> >
> > What I'm saying is that the times could be left un-updated for a long
> > time if program doesn't do munmap() or msync(MS_SYNC) for a long time.
>
> Sure.
>
> But in those circumstances, the programmer cannot depend on the mtime
> *anyway* (because there is no synchronization), so what's the downside?
>
> Let's face it, there's exactly three possible solutions:
>
>  - the insane one: trap EVERY SINGLE instruction that does a write to the
>page, and update mtime each and every time.
>
>This one is so obviously STUPID that it's not even worth discussing
>further, except to say that "yes, there is an 'exact' algorithm, but
>no, we are never EVER going to use it".
>
>  - the non-exact solutions that don't give you mtime updates every time
>a write to the page happens, but give *some* guarantees for things that
>will update it.
>
>This is the one I think we can do, and the only things a programmer can
>impact using it is "msync()" and "munmap()", since no other operations
>really have any thing to do with it in a programmer-visible way (ie a
>normal "sync" operation may happen in the background and has no
>progam-relevant timing information)
>
>Other things *may* or may not update mtime (some filesystems - take
>most networked one as an example - will *always* update mtime on the
>server on writeback, so we cannot ever guarantee that nothing but
>msync/munmap does so), but at least we'll have a minimum set of things
>that people can depend on.
>
>  - the "we don't care at all solutions".
>
>mmap(MAP_WRITE) doesn't really update times reliably after the write
>has happened (but might do it *before* - maybe the mmap() itself does).
>
> Those are the three choices, I think. We currently approximate #3. We
> *can* do #2 (and there are various flavors of it). And even *aiming* for
> #1 is totally insane and stupid.

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread "-v6 0/2":

http://lkml.org/lkml/2008/1/18/447

I guess, the current solution is ready to use.

>
> Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

Did a quick test on an ext4 partition. This is how it looks like:

debian:~/miklos# ./miklos_test /mnt/file
begin   1200662360  1200662360  1200662353
write   1200662361  1200662361  1200662353
mmap1200662361  1200662361  1200662362
b   1200662363  1200662363  1200662362
msync b 1200662363  1200662363  1200662362
c   1200662365  1200662365  1200662362
msync c 1200662365  1200662365  1200662362
d   1200662367  1200662367  1200662362
munmap  1200662367  1200662367  1200662362
close   1200662367  1200662367  1200662362
sync1200662367  1200662367  1200662362
debian:~/miklos# mount | grep /mnt
/root/image.ext4 on /mnt type ext4dev (rw,loop=/dev/loop0)

> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

Following are the results of the measurements. Here are the relevant
portions of the test program:

>>>

#define FILE_SIZE (1024 * 1024 * 512)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i < FILE_SIZE; i += 4096)
tmp = p[i];

/* Dirty the pages */
for (i = 0; i < FILE_SIZE; i += 4096)
p[i] = i;

/* How long did we spend in msync(MS_ASYNC)? */
gettimeofday(&tv_start, NULL);
msync(p, FILE_SIZE, MS_ASYNC);
gettimeofday(&tv_stop, NULL);

<<<

For reference tests, the following platforms were used:

1. HP-UX B.11.31, PA-RISC 8800 processor (800 MHz, 64 MB), Memory: 4 GB.

2. HP-UX B.11.31, 2 Intel(R) Itanium 2 9000 series processors (1.59 GHz, 18 MB),
   Memory: 15.98 GB.

3. FreeBSD 6.2-RELEASE, Intel(R) Pentium(R) III CPU family 1400MHz, 2 CPUs.
   Memory: 4G.

The tests of my solution were performed using the following platform:

A KVM x86_64 guest OS, current Git kernel. Host system: Intel(R) Core(TM)2
Duo CPU T7300 @ 2.00GHz. Further referred to as "the first case".

The following tables give the time difference between the two calls
to gettimeofday(). The test program was run three times in a raw
with a delay of one second between consecutive runs. On Linux
systems, the following commands were issued prior to running the
tests:

echo 80 >/proc/sys/vm/dirty_ratio
echo 80 >/proc/sys/vm/dirty_background_ratio
echo 3 >/proc/sys/vm/dirty_expire_centisecs
sync

Table 1. Reference platforms.


| | HP-UX/PA-RISC | HP-UX/Itanium | FreeBSD|

| First run   | 263405 usec   | 202283 usec   | 90 SECONDS |

| Second run  | 262253 usec   | 172837 usec   | 90 SECONDS |

| Third run   | 238465 usec   | 238465 usec   | 90 SECONDS |


It looks like FreeBSD is a clear outsider here. Note that FreeBSD
showed an almost liner depencence of the time spent in the
msync(MS_ASYNC) call on the file size.

Table 2. The Qemu system. File size is 512M.

---
|| Before the patch | After the patch |
---
| First run  | 35 usec  |   5852 usec |
---
| Second run | 35 usec  |    usec |
---
| Third run  | 35 usec  |   6330 usec |
---

I think that the data above prove the viability of the solution I
presented. Also, I guess that this bug fix is most probably ready
for getting upstream.

Please apply the sixth version of my solution.

>
> Thanks,
> Miklos
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Fri, 2008-01-18 at 11:15 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-01-18 at 10:51 +0100, Miklos Szeredi wrote:
> >
> > > > diff --git a/mm/msync.c b/mm/msync.c
> > > > index a4de868..a49af28 100644
> > > > --- a/mm/msync.c
> > > > +++ b/mm/msync.c
> > > > @@ -13,11 +13,33 @@
> > > >  #include 
> > > >
> > > >  /*
> > > > + * Scan the PTEs for pages belonging to the VMA and mark them 
> > > > read-only.
> > > > + * It will force a pagefault on the next write access.
> > > > + */
> > > > +static void vma_wrprotect(struct vm_area_struct *vma)
> > > > +{
> > > > + unsigned long addr;
> > > > +
> > > > + for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> > > > + spinlock_t *ptl;
> > > > + pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > > + pud_t *pud = pud_offset(pgd, addr);
> > > > + pmd_t *pmd = pmd_offset(pud, addr);
> > > > + pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > +
> > > > + if (pte_dirty(*pte) && pte_write(*pte))
> > > > + *pte = pte_wrprotect(*pte);
> > > > + pte_unmap_unlock(pte, ptl);
> > > > + }
> > > > +}
> > >
> > > What about ram based filesystems?  They don't start out with read-only
> > > pte's, so I think they don't want them read-protected now either.
> > > Unless this is essential for correct mtime/ctime accounting on these
> > > filesystems (I don't think it really is).  But then the mapping should
> > > start out read-only as well, otherwise the time update will only work
> > > after an msync(MS_ASYNC).
> >
> > page_mkclean() has all the needed logic for this, it also walks the rmap
> > and cleans out all other users, which I think is needed too for
> > consistencies sake:
> >
> > Process A Process B
> >
> > mmap(foo.txt) mmap(foo.txt)
> >
> > dirty page
> >   dirty page
> >
> > msync(MS_ASYNC)
> >
> >   dirty page
> >
> > msync(MS_ASYNC) <--- now what?!
> >
> >
> > So what I would suggest is using the page table walkers from mm, and
> > walks the page range, obtain the page using vm_normal_page() and call
> > page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)
> >
> > All in all, that sounds rather expensive..
>
> Bah, and will break on s390... so we'd need a page_mkclean() variant
> that doesn't actually clear dirty.

So the current version of the functional changes patch takes this into account.

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 4. Performance test was done using the program available from the
> > following link:
> >
> > http://bugzilla.kernel.org/attachment.cgi?id=14493
> >
> > Result: the impact of the changes was negligible for files of a few
> > hundred megabytes.
>
> Could you also test with ext4 and post some numbers?  Afaik, ext4 uses
> nanosecond timestamps, so the time updating code would be exercised
> more during the page faults.
>
> What about performance impact on msync(MS_ASYNC)?  Could you please do
> some measurment of that as well?

I'll do the measurements for the MS_ASYNC case and for the Ext4 filesystem.

>
> Thanks,
> Miklos
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v6 1/2] Massive code cleanup of sys_msync()

2008-01-18 Thread Anton Salikhmetov
2008/1/18, Miklos Szeredi <[EMAIL PROTECTED]>:
> >   unsigned long end;
> > - struct mm_struct *mm = current->mm;
> > + int error, unmapped_error;
> >   struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + struct mm_struct *mm;
> >
> > + error = -EINVAL;
>
> I think you may have misunderstood my last comment.  These are OK:
>
> struct mm_struct *mm = current->mm;
> int unmapped_error = 0;
> int error = -EINVAL;
>
> This is not so good:
>
> int error, unmapped_error;
>
> This is the worst:
>
> int error = -EINVAL, unmapped_error = 0;
>
> So I think the original code is fine as it is.
>
> Othewise patch looks OK now.

I moved the initialization of the variables to the code where they are needed.

I don't agree that "int a; int b;" is better than "int a, b".

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

2008-01-17 Thread Anton Salikhmetov
Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/memory.c |6 ++
 mm/msync.c  |   52 +++-
 2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
 unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
/*
 * Yes, Virginia, this is actually required to prevent a race
 * with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+   if (vma->vm_file)
+   file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
 #include 
 
 /*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+   unsigned long addr;
+
+   for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+   spinlock_t *ptl;
+   pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+   pud_t *pud = pud_offset(pgd, addr);
+   pmd_t *pmd = pmd_offset(pud, addr);
+   pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+   if (pte_dirty(*pte) && pte_write(*pte))
+   *pte = pte_wrprotect(*pte);
+   pte_unmap_unlock(pte, ptl);
+   }
+}
+
+/*
  * MS_SYNC syncs the entire file - including mappings.
  *
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
- * Now it doesn't do anything, since dirty pages are properly tracked.
+ * MS_ASYNC does not start I/O. Instead, it marks the relevant pages
+ * read-only by calling vma_wrprotect(). This is needed to catch the next
+ * write reference to the mapped region and update the file times
+ * accordingly.
  *
  * The application may now run fsync() to write out the dirty pages and
  * wait on the writeout and check the result. Or the application may run
@@ -77,16 +99,20 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
error = 0;
start = vma->vm_end;
file = vma->vm_file;
-   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
-   get_file(file);
-   up_read(&mm->mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error || start >= end)
-   goto out;
-   down_read(&mm->mmap_sem);
-   vma = find_vma(mm, start);
-   continue;
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   if (flags & MS_ASYNC)
+   vma_wrprotect(vma);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(&mm->mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error || start >= end)
+   goto out;
+   down_read(&mm->mmap_sem);
+   vma = find_vma(mm, start);
+   continue;
+   }
}
 
vma = vma->vm_next;
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 0/2] Fixing the issue with memory-mapped file times

2008-01-17 Thread Anton Salikhmetov
This is the sixth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) a few cosmetic changes according to the latest feedback
   for the cleanup patch;

2) implementation of the following suggestion by Miklos Szeredi:

http://lkml.org/lkml/2008/1/17/158

These changes were tested as explained below. Please note that all
tests were performed with all recommended kernel debug options
enabled. Also note that the tests were performed on regular files
residing on both an ext3 partition and a tmpfs filesystem. I also
checked the block device case, which worked for me as well.

1. My own unit test:

http://bugzilla.kernel.org/attachment.cgi?id=14430

Result: all test cases passed successfully.

2. Unit test provided by Miklos Szeredi:

http://lkml.org/lkml/2008/1/14/104

Result: this test produced the following output:

debian-64:~# ./miklos_test test
begin   1200598736  1200598736  1200598617
write   1200598737  1200598737  1200598617
mmap1200598737  1200598737  1200598738
b   1200598739  1200598739  1200598738
msync b 1200598739  1200598739  1200598738
c   1200598741  1200598741  1200598738
msync c 1200598741  1200598741  1200598738
d   1200598743  1200598743  1200598738
munmap  1200598743  1200598743  1200598738
close   1200598743  1200598743  1200598738
sync1200598743  1200598743  1200598738
debian-64:~#

3. Regression tests were performed using the following test cases from
the LTP test suite:

msync01
msync02
msync03
msync04
msync05
mmapstress01
mmapstress09
mmapstress10

Result: no regressions were found while running these test cases.

4. Performance test was done using the program available from the
following link:

http://bugzilla.kernel.org/attachment.cgi?id=14493

Result: the impact of the changes was negligible for files of a few
hundred megabytes.

I wonder if these changes can be applied now.

These patches are the result of many fruitful discussions with
Miklos Szeredi, Peter Zijlstra, Rik van Riel, Peter Staubach,
and Jacob Oestergaard. I am grateful to you all for your support
during the days I was working on this long-standing nasty bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH -v6 1/2] Massive code cleanup of sys_msync()

2008-01-17 Thread Anton Salikhmetov
Using the PAGE_ALIGN() macro instead of "manual" alignment and
improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   77 
 1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
  * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
  * applications.
  */
 asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
 {
unsigned long end;
-   struct mm_struct *mm = current->mm;
+   int error, unmapped_error;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   struct mm_struct *mm;
 
+   error = -EINVAL;
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
if (start & ~PAGE_MASK)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+
error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
goto out;
-   error = 0;
+
+   error = unmapped_error = 0;
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
+   mm = current->mm;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
error = -ENOMEM;
if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   break;
if (start < vma->vm_start) {
start = vma->vm_start;
if (start >= end)
-   goto out_unlock;
-   unmapped_error = -ENOMEM;
-   }
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
-   error = -EBUSY;
-   goto out_unlock;
+   break;
+   unmapped_error = error;
}
-   file = vma->vm_file;
+
+   error = -EBUSY;
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
+   break;
+
+   error = 0;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
@@ -88,16 +86,13 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next

Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > The do_wp_page() function is called in mm/memory.c after locking PTE.
> > And the file_update_time() routine calls the filesystem operation that can
> > sleep. It's not accepted, I guess.
>
> do_wp_page() is called with the pte lock but drops it, so that's fine.

OK, I agree.

I'll take into account your suggestion to move updating time stamps from
the __set_page_dirty() and __set_page_dirty_nobuffers() routines to
do_wp_page(). Thank you!

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > I'm not sure this auto-updating is really needed (POSIX doesn't
> > > mandate it).
> >
> > Peter Shtaubach, author of the first solution for this bug,
> > and Jacob Ostergaard, the reporter of this bug, insist the "auto-update"
> > feature to be implemented.
>
> Can they state their reasons for the insistence?
>
> >  1) a base patch: update time just from fsync() and remove_vma()
> >  2) update time on sync(2) as well
> >  3) update time on MS_ASYNC as well
>
> Oh, and the four-liner I posted the other day will give you 1) + 2) +
> even more at a small fraction of the complexity.  And tacking on the
> reprotect code will solve the MS_ASYNC issue just the same.
>
> I agree, that having the timestamp updated on sync() is nice, and that
> trivial patch will give you that, and will also update the timestamp
> at least each 30 seconds if the file is being constantly modified,
> even if no explicit syncing is done.
>
> So maybe it's worth a little effort benchmarking how much that patch
> affects the cost of writing to a page.
>
> You could write a little test program like this (if somebody hasn't
> yet done so):
>
>  - do some preparation:
>
>echo 80 > dirty_ratio
>echo 80 > dirty_background_ratio
>echo 3 > dirty_expire_centisecs
>sync
>
>  - map a large file, one that fits comfortably into free memory
>  - bring the whole file in, by reading a byte from each page
>  - start the timer
>  - write a byte to each page
>  - stop the timer
>
> It would be most interesting to try this on a filesystem supporting
> nanosecond timestamps.  Anyone know which these are?

The do_wp_page() function is called in mm/memory.c after locking PTE.
And the file_update_time() routine calls the filesystem operation that can
sleep. It's not accepted, I guess.

>
> Miklos
> 
>
> Index: linux/mm/memory.c
> ===
> --- linux.orig/mm/memory.c  2008-01-09 21:16:30.0 +0100
> +++ linux/mm/memory.c   2008-01-15 21:16:14.0 +0100
> @@ -1680,6 +1680,8 @@ gotten:
>  unlock:
> pte_unmap_unlock(page_table, ptl);
> if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> /*
>  * Yes, Virginia, this is actually required to prevent a race
>  * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2313,6 +2315,8 @@ out_unlocked:
> if (anon)
> page_cache_release(vmf.page);
> else if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> set_page_dirty_balance(dirty_page, page_mkwrite);
> put_page(dirty_page);
> }
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > > > 4. Recording the time was the file data changed
> > > > >
> > > > > Finally, I noticed yet another issue with the previous version of my 
> > > > > patch.
> > > > > Specifically, the time stamps were set to the current time of the 
> > > > > moment
> > > > > when syncing but not the write reference was being done. This led to 
> > > > > the
> > > > > following adverse effect on my development system:
> > > > >
> > > > > 1) a text file A was updated by process B;
> > > > > 2) process B exits without calling any of the *sync() functions;
> > > > > 3) vi editor opens the file A;
> > > > > 4) file data synced, file times updated;
> > > > > 5) vi is confused by "thinking" that the file was changed after 3).
> > >
> > > Updating the time in remove_vma() would fix this, no?
> >
> > We need to save modification time. Otherwise, updating time stamps
> > will be confusing the vi editor.
>
> remove_vma() will be called when process B exits, so if the times are
> updated there, and the flag is cleared, the times won't be updated
> later.
>
> > >
> > > > > All these changes to inode.c are unnecessary, I think.
> > > >
> > > > The first part is necessary to account for "remembering" the 
> > > > modification time.
> > > >
> > > > The second part is for handling block device files. I cannot see any 
> > > > other
> > > > sane way to update file times for them.
> > >
> > > Use file_update_time(), which will do the right thing.  It will in
> > > fact do the same thing as write(2) on the device, which is really what
> > > we want.
> > >
> > > Block devices being mapped for write through different device
> > > nodes..., well, I don't think we really need to handle such weird
> > > corner cases 100% acurately.
> >
> > The file_update_time() cannot be used for implementing
> > the "auto-update" feature, because the sync() system call
> > doesn't "know" about the file which was memory-mapped.
>
> I'm not sure this auto-updating is really needed (POSIX doesn't
> mandate it).

Peter Shtaubach, author of the first solution for this bug,
and Jacob Ostergaard, the reporter of this bug, insist the "auto-update"
feature to be implemented.

>
> At least split it out into a separate patch, so it can be considered
> separately on it's own merit.
>
> I think doing the same with the page-table reprotecting in MS_ASYNC is
> also a good idea.  That will leave us with
>
>  1) a base patch: update time just from fsync() and remove_vma()
>  2) update time on sync(2) as well
>  3) update time on MS_ASYNC as well
>
> I'd happily ack the first one, which would solve the most serious
> issues, but have some reservations about the other two.
>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Rogier Wolff <[EMAIL PROTECTED]>:
> On Thu, Jan 17, 2008 at 04:16:47PM +0300, Anton Salikhmetov wrote:
> > 2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > > > 4. Recording the time was the file data changed
> > > > >
> > > > > Finally, I noticed yet another issue with the previous version of my 
> > > > > patch.
> > > > > Specifically, the time stamps were set to the current time of the 
> > > > > moment
> > > > > when syncing but not the write reference was being done. This led to 
> > > > > the
> > > > > following adverse effect on my development system:
> > > > >
> > > > > 1) a text file A was updated by process B;
> > > > > 2) process B exits without calling any of the *sync() functions;
> > > > > 3) vi editor opens the file A;
> > > > > 4) file data synced, file times updated;
> > > > > 5) vi is confused by "thinking" that the file was changed after 3).
> > >
> > > Updating the time in remove_vma() would fix this, no?
> >
> > We need to save modification time. Otherwise, updating time stamps
> > will be confusing the vi editor.
>
> If process B exits before vi opens the file, the timestamp should at
> the latest be the time that process B exits. There is no excuse for
> setting the timestamp later than the time that B exits.
>
> If process B no longer modifies the file, but still keeps it mapped
> until after vi starts, then the system can't help the
> situation. Wether or not B acesses those pages is unknown to the
> system. So you get what you deserve.

The exit() system call closes the file memory-mapped file. Therefore,
it's better to catch the close() system call. However, the close() system call
does not trigger unmapping the file. The mapped area for the file may be
used after closing the file. So, we should catch only the unmap() call,
not close() or exit().

>
> Roger.
>
> --
> ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. - Adapted from lxrbot FAQ
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > 4. Recording the time was the file data changed
> > >
> > > Finally, I noticed yet another issue with the previous version of my 
> > > patch.
> > > Specifically, the time stamps were set to the current time of the moment
> > > when syncing but not the write reference was being done. This led to the
> > > following adverse effect on my development system:
> > >
> > > 1) a text file A was updated by process B;
> > > 2) process B exits without calling any of the *sync() functions;
> > > 3) vi editor opens the file A;
> > > 4) file data synced, file times updated;
> > > 5) vi is confused by "thinking" that the file was changed after 3).
>
> Updating the time in remove_vma() would fix this, no?

We need to save modification time. Otherwise, updating time stamps
will be confusing the vi editor.

>
> > > All these changes to inode.c are unnecessary, I think.
> >
> > The first part is necessary to account for "remembering" the modification 
> > time.
> >
> > The second part is for handling block device files. I cannot see any other
> > sane way to update file times for them.
>
> Use file_update_time(), which will do the right thing.  It will in
> fact do the same thing as write(2) on the device, which is really what
> we want.
>
> Block devices being mapped for write through different device
> nodes..., well, I don't think we really need to handle such weird
> corner cases 100% acurately.

The file_update_time() cannot be used for implementing
the "auto-update" feature, because the sync() system call
doesn't "know" about the file which was memory-mapped.

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Changes for updating the ctime and mtime fields for memory-mapped files:
> >
> > 1) a new flag triggering update of the inode data;
> > 2) a new field in the address_space structure for saving modification time;
> > 3) a new helper function to update ctime and mtime when needed;
> > 4) updating time stamps for mapped files in sys_msync() and do_fsync();
> > 5) implementing lazy ctime and mtime update.
>
> OK, the functionality seems to be there now.  As a next step, I think
> you should try to simplify the patch, removing everything, that is not
> strictly necessary.
>
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  fs/buffer.c |3 ++
> >  fs/fs-writeback.c   |2 +
> >  fs/inode.c  |   43 +++--
> >  fs/sync.c   |2 +
> >  include/linux/fs.h  |   13 +-
> >  include/linux/pagemap.h |3 +-
> >  mm/msync.c  |   61 
> > +-
> >  mm/page-writeback.c |   54 ++---
> >  8 files changed, 124 insertions(+), 57 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 7249e01..3967aa7 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
> >   if (unlikely(!mapping))
> >   return !TestSetPageDirty(page);
> >
> > + mapping->mtime = CURRENT_TIME;
>
> Why is this needed?  POSIX explicitly states, that the modification
> time can be set to anywhere between the first write and the msync.

I've already described this change in one of my previous emails:

> 4. Recording the time was the file data changed
>
> Finally, I noticed yet another issue with the previous version of my patch.
> Specifically, the time stamps were set to the current time of the moment
> when syncing but not the write reference was being done. This led to the
> following adverse effect on my development system:
>
> 1) a text file A was updated by process B;
> 2) process B exits without calling any of the *sync() functions;
> 3) vi editor opens the file A;
> 4) file data synced, file times updated;
> 5) vi is confused by "thinking" that the file was changed after 3).
>
> This version overcomes this problem by introducing another field into the
> address_space structure. This field is used to "remember" the time of
> dirtying, and then this time value is propagated to the file metadata.
>
> This approach is based upon the following suggestion given by Peter
> Staubach during one of our previous discussions:
>
> http://lkml.org/lkml/2008/1/9/267
>
> > A better architecture would be to arrange for the file times
> > to be updated when the page makes the transition from being
> > unmodified to modified.  This is not straightforward due to
> > the current locking, but should be doable, I think.  Perhaps
> > recording the current time and then using it to update the
> > file times at a more suitable time (no pun intended) might
> > work.
>
> The solution I propose now proves the viability of the latter
> approach.

See:

http://lkml.org/lkml/2008/1/15/202

>
> > + set_bit(AS_MCTIME, &mapping->flags);
>
> A bigger problem is that doing this in __set_page_dirty() and friends
> will mean, that the flag will be set for non-mapped writes as well,
> which we definitely don't want.
>
> A better place to put it is do_wp_page and __do_fault, where
> set_page_dirty_balance() is called.

Thanks for your suggestion, I'll consider how I can apply it.

>
> > +
> >   if (TestSetPageDirty(page))
> >   return 0;
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 300324b..affd291 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
> > writeback_control *wbc)
> >
> >   spin_unlock(&inode_lock);
> >
> > + mapping_update_time(mapping);
> > +
>
> I think this is unnecessary.  Rather put the update into remove_vma().

Thanks for your suggestion, I'll consider how I can apply it.

However, I suspect that this is a bit problematic. Let me check it out.

>
> >   ret = do_writepages(mapping, wbc);
> >
> >   /* Don't write the inode if only I_DIRTY_PAGES was set */
> > diff --git a/fs/ino

Re: [PATCH -v5 1/2] Massive code cleanup of sys_msync()

2008-01-17 Thread Anton Salikhmetov
2008/1/17, Miklos Szeredi <[EMAIL PROTECTED]>:
> > Substantial code cleanup of the sys_msync() function:
> >
> > 1) using the PAGE_ALIGN() macro instead of "manual" alignment;
> > 2) improved readability of the loop traversing the process memory regions.
> >
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  mm/msync.c |   74 
> > +--
> >  1 files changed, 36 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 144a757..44997bf 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -1,24 +1,22 @@
> >  /*
> > - *   linux/mm/msync.c
> > + * The msync() system call.
> >   *
> > - * Copyright (C) 1994-1999  Linus Torvalds
> > + * Copyright (C) 1994-1999 Linus Torvalds
> > + * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
> >   */
> >
> > -/*
> > - * The msync() system call.
> > - */
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> > -#include 
> >  #include 
> > +#include 
> >
> >  /*
> >   * MS_SYNC syncs the entire file - including mappings.
> >   *
> >   * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> > + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> >   * Now it doesn't do anything, since dirty pages are properly tracked.
> >   *
> >   * The application may now run fsync() to
> > @@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   unsigned long end;
> >   struct mm_struct *mm = current->mm;
> >   struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + int error = -EINVAL, unmapped_error = 0;
>
> I prefer multi-line variable declarations, especially for ones with an
> initializer.
>
> >
> >   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> >   goto out;
> > @@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   goto out;
> >   if ((flags & MS_ASYNC) && (flags & MS_SYNC))
> >   goto out;
> > - error = -ENOMEM;
> > - len = (len + ~PAGE_MASK) & PAGE_MASK;
> > +
> > + len = PAGE_ALIGN(len);
> >   end = start + len;
> > - if (end < start)
> > + if (end < start) {
> > + error = -ENOMEM;
>
> The usual style is to have the error assignment outside the
> conditional.  That way is shorter, clearer, as well as possibly
> generating better code.
>
> >   goto out;
> > + }
> > +
> >   error = 0;
> > +
>
> Unnecessary empty line here, these two statements actually belong
> together.
>
> >   if (end == start)
> >   goto out;
> > +
> >   /*
> >* If the interval [start,end) covers some unmapped address ranges,
> >* just ignore them, but return -ENOMEM at the end.
> >*/
> >   down_read(&mm->mmap_sem);
> >   vma = find_vma(mm, start);
> > - for (;;) {
> > + do {
> >   struct file *file;
> >
> > - /* Still start < end. */
> > - error = -ENOMEM;
> > - if (!vma)
> > - goto out_unlock;
> > - /* Here start < vma->vm_end. */
> > + if (!vma) {
> > + error = -ENOMEM;
> > + break;
> > + }
>
> Again, error asignment should be outside the conditional.  This of
> course means, you'll have to set the error back to zero at the end of
> the loop, but that's fine.
>
> >   if (start < vma->vm_start) {
> >   start = vma->vm_start;
> > - if (start >= end)
> > - goto out_unlock;
> > + if (start >= end) {
> > + error = -ENOMEM;
> > + break;
> > + }
>
> Ditto.
>
> >   unmapped_error = -ENOMEM;
> >   }
> > - /* Here vma->vm_start <= start < vma->vm_end. */
> > -   

[PATCH -v5 2/2] Updating ctime and mtime at syncing

2008-01-16 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) a new flag triggering update of the inode data;
2) a new field in the address_space structure for saving modification time;
3) a new helper function to update ctime and mtime when needed;
4) updating time stamps for mapped files in sys_msync() and do_fsync();
5) implementing lazy ctime and mtime update.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 fs/buffer.c |3 ++
 fs/fs-writeback.c   |2 +
 fs/inode.c  |   43 +++--
 fs/sync.c   |2 +
 include/linux/fs.h  |   13 +-
 include/linux/pagemap.h |3 +-
 mm/msync.c  |   61 +-
 mm/page-writeback.c |   54 ++---
 8 files changed, 124 insertions(+), 57 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3967aa7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
+   mapping->mtime = CURRENT_TIME;
+   set_bit(AS_MCTIME, &mapping->flags);
+
if (TestSetPageDirty(page))
return 0;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 300324b..affd291 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
writeback_control *wbc)
 
spin_unlock(&inode_lock);
 
+   mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..edd5bf4 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,10 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
*dentry)
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: inode accessed
+ * @ts: time when inode was accessed
+ * @sync: whether to do synchronous update
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1253,11 +1255,8 @@ EXPORT_SYMBOL(touch_atime);
  * S_NOCTIME inode flag, e.g. for network filesystem where these
  * timestamps are handled by the server.
  */
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode, struct timespec *ts)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   struct timespec now;
int sync_it = 0;
 
if (IS_NOCMTIME(inode))
@@ -1265,22 +1264,41 @@ void file_update_time(struct file *file)
if (IS_RDONLY(inode))
return;
 
-   now = current_fs_time(inode->i_sb);
-   if (!timespec_equal(&inode->i_mtime, &now)) {
-   inode->i_mtime = now;
+   if (timespec_compare(&inode->i_mtime, ts) < 0) {
+   inode->i_mtime = *ts;
sync_it = 1;
}
 
-   if (!timespec_equal(&inode->i_ctime, &now)) {
-   inode->i_ctime = now;
+   if (timespec_compare(&inode->i_ctime, ts) < 0) {
+   inode->i_ctime = *ts;
sync_it = 1;
}
 
if (sync_it)
mark_inode_dirty_sync(inode);
 }
+EXPORT_SYMBOL(inode_update_time);
 
-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+   if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
+   struct inode *inode = mapping->host;
+   struct timespec *ts = &mapping->mtime;
+
+   if (S_ISBLK(inode->i_mode)) {
+   struct block_device *bdev = inode->i_bdev;
+
+   mutex_lock(&bdev->bd_mutex);
+   list_for_each_entry(inode, &bdev->bd_inodes, i_devices)
+   inode_update_time(inode, ts);
+   mutex_unlock(&bdev->bd_mutex);
+   } else
+   inode_update_time(inode, ts);
+   }
+}
 
 int inode_needs_sync(struct inode *inode)
 {
@@ -1290,7 +1308,6 @@ int inode_needs_sync(struct inode *inode)
return 1;
return 0;
 }
-
 EXPORT_SYMBOL(inode_needs_sync);
 
 int inode_wait(void *word)
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..5561464 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+   mapping_update_time(mapping);
+
ret 

[PATCH -v5 1/2] Massive code cleanup of sys_msync()

2008-01-16 Thread Anton Salikhmetov
Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   74 +--
 1 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..44997bf 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,22 @@
 /*
- * linux/mm/msync.c
+ * The msync() system call.
  *
- * Copyright (C) 1994-1999  Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,8 +31,7 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = -EINVAL, unmapped_error = 0;
 
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
goto out;
@@ -42,62 +39,63 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+
+   len = PAGE_ALIGN(len);
end = start + len;
-   if (end < start)
+   if (end < start) {
+   error = -ENOMEM;
goto out;
+   }
+
error = 0;
+
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start < vma->vm_start) {
start = vma->vm_start;
-   if (start >= end)
-   goto out_unlock;
+   if (start >= end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
-   file = vma->vm_file;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+
+   file = vma->vm_file;
+   if (file && (vma->vm_flags & VM_SHARED) && (flags & MS_SYNC)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start >= end)
+   if (error)
goto out;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
+   continue;
}
-   }
-out_unlock:
+
+   vma = vma->vm_next;
+   } while (start < end);
up_read(&mm->mmap_sem);
+
 out:
-   return error ? : unmapped_error;
+   return error ? error : unmapped

[PATCH -v5 0/2] Updating ctime and mtime for memory-mapped files

2008-01-16 Thread Anton Salikhmetov
This is the fifth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) the case of retouching an already-dirty page pointed out
   by Miklos Szeredi has been correctly addressed;

2) a few cosmetic changes according to the latest feedback;

3) fixed the error of calling a possibly sleeping function
   from an atomic context.

The design for the first item above was suggested by Peter Zijlstra:

> It would require scanning the PTEs and marking them read-only again on
> MS_ASYNC, and some more logic in set_page_dirty() because that currently
> bails out early if the page in question is already dirty.

Miklos' test program now produces the following output for
the repeated calls to msync() with the MS_ASYNC flag:

debian:~/miklos# ./miklos_test file
begin   1200529196  1200529196  1200528798
write   1200529197  1200529197  1200528798
mmap1200529197  1200529197  1200529198
b   1200529197  1200529197  1200529198
msync b 1200529199  1200529199  1200529198
c   1200529199  1200529199  1200529198
msync c 1200529201  1200529201  1200529198
d   1200529201  1200529201  1200529198
munmap  1200529201  1200529201  1200529198
close   1200529201  1200529201  1200529198
sync1200529204  1200529204  1200529198
debian:~/miklos#

Miklos' test program can be found using the following link:

http://lkml.org/lkml/2008/1/14/104
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mtime updates for mmapped files.

2008-01-16 Thread Anton Salikhmetov
2008/1/16, Rogier Wolff <[EMAIL PROTECTED]>:
>
> Hi,
>
> I  wrote a small app yesterday that updates a file by mmapping the
> file (RW), changing the thing around, and then exiting.
>
> This did not trigger a change in the mtime of the file. Thus rsync
> didn't pick up that the file had changed.
>
> I understand that tracking every change to a RW mmapped file is
> costly, and thus unfeasable, but shouldn't then the close cause a
> mtime update?
>
> The server where this happened is running 2.6.21, so my apologies if
> this has already been corrected.

Unfortunately, this issue has not been fully fixed yet.

My last attempt (http://lkml.org/lkml/2008/1/15/202) to solve
this problem has a couple of drawbacks:

1) calling a possibly sleeping function from atomic context -
I've already corrected this;

2) there's a very special case with retouching the memory-mapped data
after a call to msync() with MS_ASYNC.

I'm still working on the latter case, but I guess that I have found a solution.

If you badly need a quick fix, I can send my working unreleased patch to you.
I reckon that your particular problem will be fixed by this patch.
Let me know if you want that.

However, if your application calls msync() with the MS_ASYNC flag,
it's better to wait a little bit more - I'll release the next version
of my solution
shortly.

>
> Roger.
>
> --
> ** [EMAIL PROTECTED] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> **Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233**
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. - Adapted from lxrbot FAQ
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] Updating ctime and mtime for memory-mapped files [try #4]

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 1. Introduction
> >
> > This is the fourth version of my solution for the bug #2645:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Changes since the previous version:
> >
> > 1) the case of retouching an already-dirty page pointed out
> >   by Miklos Szeredi has been addressed;
>
> I'm a bit sceptical, as we've also pointed out, that this is not
> possible without messing with the page tables.
>
> Did you try my test program on the patched kernel?

I just tried your test program. Alas, my assumption appears to be wrong.

Thank you for your comment!

Now I start thinking that it is better not to care about
the MS_ASYNC case whatsoever.

>
> I've refreshed the patch, where we left this issue last time.  It
> should basically have equivalent functionality to your patch, and is a
> lot simpler.  There might be performance issues with it, but it's a
> good starting point.
>
> Miklos
> 
>
> Index: linux/mm/memory.c
> ===
> --- linux.orig/mm/memory.c  2008-01-09 21:16:30.0 +0100
> +++ linux/mm/memory.c   2008-01-15 21:16:14.0 +0100
> @@ -1680,6 +1680,8 @@ gotten:
>  unlock:
> pte_unmap_unlock(page_table, ptl);
> if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> /*
>  * Yes, Virginia, this is actually required to prevent a race
>  * with clear_page_dirty_for_io() from clearing the page dirty
> @@ -2313,6 +2315,8 @@ out_unlocked:
> if (anon)
> page_cache_release(vmf.page);
> else if (dirty_page) {
> +   if (vma->vm_file)
> +   file_update_time(vma->vm_file);
> set_page_dirty_balance(dirty_page, page_mkwrite);
> put_page(dirty_page);
> }
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> On Tue, Jan 15, 2008 at 08:28:48PM +0100, Peter Zijlstra wrote:
> > Notice that error is already -EINVAL, so a simple goto should suffice.
>
> Yes, for the start of the function you can basically leave it as-is.
>

OK, I will do as you suggest. Thank you for your answers.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Randy Dunlap <[EMAIL PROTECTED]>:
> On Tue, 15 Jan 2008 22:02:54 +0300 Anton Salikhmetov wrote:
>
> > 2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> > > On Tue, Jan 15, 2008 at 07:02:44PM +0300, Anton Salikhmetov wrote:
>
> > > > @@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, 
> > > > size_t len, int flags)
> > > >   unsigned long end;
> > > >   struct mm_struct *mm = current->mm;
> > > >   struct vm_area_struct *vma;
> > > > - int unmapped_error = 0;
> > > > - int error = -EINVAL;
> > > > + int error = 0, unmapped_error = 0;
> > > >
> > > >   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> > > > - goto out;
> > > > + return -EINVAL;
> > > >   if (start & ~PAGE_MASK)
> > > > - goto out;
> > > > + return -EINVAL;
> > >
> > > The goto out for a simple return style is used quite commonly in kernel
> > > code to have a single return statement which makes code maintaince, e.g.
> > > adding locks or allocations simpler.  Not sure that getting rid of it
> > > makes a lot of sense.
> >
> > Sorry, I can't agree. That's what is written in the CodingStyle document:
> >
> > The goto statement comes in handy when a function exits from multiple
> > locations and some common work such as cleanup has to be done.
>
> CodingStyle does not try to cover Everything.  Nor do we want it to.
>
> At any rate, there is a desire for functions to have a single point
> of return, regardless of the amount of cleanup to be done, so I agree
> with Christoph's comments.

Should I replace "return -EINVAL;" statement with the following?

{
error = -EINVAL;
goto out;
}

>
>
> > The second part of requirement does not hold true for the sys_msync() 
> > routine.
> >
> > >
> > > > + file = vma->vm_file;
> > > > + if ((flags & MS_SYNC) && file && (vma->vm_flags & 
> > > > VM_SHARED)) {
> > >
> > > Given that file is assigned just above it would be more readable to test
> > > it first.
> >
> > The second part of my solution changes this code, anyway.
> >
> > >
> > > > + if (error)
> > > > + return error;
> > >
> > > This should be an goto out, returns out of the middle of the function
> > > make reading and maintaining the code not so nice.
> >
> > Sorry, I don't think so. No "common cleanup" is needed here.
>
>
> ---
> ~Randy
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> On Tue, Jan 15, 2008 at 07:02:44PM +0300, Anton Salikhmetov wrote:
> > +++ b/mm/msync.c
> > @@ -1,24 +1,25 @@
> >  /*
> >   *   linux/mm/msync.c
> >   *
> > + * The msync() system call.
> >   * Copyright (C) 1994-1999  Linus Torvalds
> > + *
> > + * Massive code cleanup.
> > + * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
>
> Please don't put the changelog in here, that's what the log in the SCM
> is for.  And while you're at it remove the confusing file name comment.
> This should now look something like:
>
> /*
>  * The msync() system call.
>  *
>  * Copyright (C) 1994-1999  Linus Torvalds
>  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
>  */

Thanks!

I'll take into account your recommendation.

>
> > @@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   unsigned long end;
> >   struct mm_struct *mm = current->mm;
> >   struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + int error = 0, unmapped_error = 0;
> >
> >   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> > - goto out;
> > + return -EINVAL;
> >   if (start & ~PAGE_MASK)
> > - goto out;
> > + return -EINVAL;
>
> The goto out for a simple return style is used quite commonly in kernel
> code to have a single return statement which makes code maintaince, e.g.
> adding locks or allocations simpler.  Not sure that getting rid of it
> makes a lot of sense.

Sorry, I can't agree. That's what is written in the CodingStyle document:

The goto statement comes in handy when a function exits from multiple
locations and some common work such as cleanup has to be done.

The second part of requirement does not hold true for the sys_msync() routine.

>
> > + file = vma->vm_file;
> > + if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) 
> > {
>
> Given that file is assigned just above it would be more readable to test
> it first.

The second part of my solution changes this code, anyway.

>
> > + if (error)
> > + return error;
>
> This should be an goto out, returns out of the middle of the function
> make reading and maintaining the code not so nice.

Sorry, I don't think so. No "common cleanup" is needed here.

>
> >   return error ? : unmapped_error;
>
> Care to get rid of this odd GNU extension while you're at it and use
> the proper

I do also think that this GNU extension is not readable,
so I'll take your recommentation into account.

>
> return error ? error : unmapped_error;
>
> ?
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Christoph Hellwig <[EMAIL PROTECTED]>:
> On Tue, Jan 15, 2008 at 07:02:45PM +0300, Anton Salikhmetov wrote:
> > +/*
> > + * Update the ctime and mtime stamps for memory-mapped block device files.
> > + */
> > +static void bd_inode_update_time(struct inode *inode, struct timespec *ts)
> > +{
> > + struct block_device *bdev = inode->i_bdev;
> > + struct list_head *p;
> > +
> > + if (bdev == NULL)
> > + return;
>
> inode->i_bdev is never NULL for inodes currently beeing written to.
>
> > +
> > + mutex_lock(&bdev->bd_mutex);
> > + list_for_each(p, &bdev->bd_inodes) {
> > + inode = list_entry(p, struct inode, i_devices);
>
> this should use list_for_each_entry.
>
>

Thank you very much for your recommenations. I'll take them into account.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Tue, 2008-01-15 at 19:02 +0300, Anton Salikhmetov wrote:
>
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 3d3848f..53d0e34 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -997,35 +997,39 @@ int __set_page_dirty_no_writeback(struct page *page)
> >   */
> >  int __set_page_dirty_nobuffers(struct page *page)
> >  {
> > - if (!TestSetPageDirty(page)) {
> > - struct address_space *mapping = page_mapping(page);
> > - struct address_space *mapping2;
> > + struct address_space *mapping = page_mapping(page);
> > + struct address_space *mapping2;
> >
> > - if (!mapping)
> > - return 1;
> > + if (!mapping)
> > + return 1;
> >
> > - write_lock_irq(&mapping->tree_lock);
> > - mapping2 = page_mapping(page);
> > - if (mapping2) { /* Race with truncate? */
> > - BUG_ON(mapping2 != mapping);
> > - WARN_ON_ONCE(!PagePrivate(page) && 
> > !PageUptodate(page));
> > - if (mapping_cap_account_dirty(mapping)) {
> > - __inc_zone_page_state(page, NR_FILE_DIRTY);
> > - __inc_bdi_stat(mapping->backing_dev_info,
> > - BDI_RECLAIMABLE);
> > - task_io_account_write(PAGE_CACHE_SIZE);
> > - }
> > - radix_tree_tag_set(&mapping->page_tree,
> > - page_index(page), PAGECACHE_TAG_DIRTY);
> > - }
> > - write_unlock_irq(&mapping->tree_lock);
> > - if (mapping->host) {
> > - /* !PageAnon && !swapper_space */
> > - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > + mapping->mtime = CURRENT_TIME;
> > + set_bit(AS_MCTIME, &mapping->flags);
>
> This seems vulnerable to the race we have against truncate, handled by
> the mapping2 magic below. Do we care?
>
> > +
> > + if (TestSetPageDirty(page))
> > + return 0;
> > +
> > + write_lock_irq(&mapping->tree_lock);
> > + mapping2 = page_mapping(page);
> > + if (mapping2) {
> > + /* Race with truncate? */
> > + BUG_ON(mapping2 != mapping);
> > + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> > + if (mapping_cap_account_dirty(mapping)) {
> > + __inc_zone_page_state(page, NR_FILE_DIRTY);
> > + __inc_bdi_stat(mapping->backing_dev_info,
> > + BDI_RECLAIMABLE);
> > + task_io_account_write(PAGE_CACHE_SIZE);
> >   }
> > - return 1;
> > + radix_tree_tag_set(&mapping->page_tree,
> > + page_index(page), PAGECACHE_TAG_DIRTY);
> >   }
> > - return 0;
> > + write_unlock_irq(&mapping->tree_lock);
> > +
> > + if (mapping->host)
> > + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

The inode gets marked dirty using the same "mapping" variable
as my code does. So, AFAIU, my change does not introduce any new
vulnerabilities. I would nevertherless be grateful to you for a scenario
where the race would be triggered.

> > +
> > + return 1;
> >  }
> >  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
> >
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Updating ctime and mtime for memory-mapped files [try #4]

2008-01-15 Thread Anton Salikhmetov
1. Introduction

This is the fourth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes since the previous version:

1) the case of retouching an already-dirty page pointed out
  by Miklos Szeredi has been addressed;

2) the file metadata are updated using the page modification time
  instead of the time of syncing data;

3) a few small corrections according to the latest feedback.

Brief explanation of these changes as well as some design considerations
are given below.

2. The case of retouching an already-dirtied page

Miklos Szeredi gave the following feedback on the previous version:

> I suspect your patch is ignoring writes after the first msync, but
> then why care about msync at all?  What's so special about the _first_
> msync?  Is it just that most test programs only check this, and not
> what happens if msync is called more than once?  That would be a bug
> in the test cases.

This version adds handling of the case of multiple msync() calls. Before
going on with the explanaion, I'll quote a remark by Peter Zijlstra:

> I must agree, doing the mmap dirty, MS_ASYNC, mmap retouch, MS_ASYNC
> case correctly would need a lot more code which I doubt is worth the
> effort.
>
> It would require scanning the PTEs and marking them read-only again on
> MS_ASYNC, and some more logic in set_page_dirty() because that currently
> bails out early if the page in question is already dirty.

Indeed, the following logic of the __set_pages_dirty_nobuffers() function:

if (!TestSetPageDirty(page)) {
   mapping = page_mapping(page);

   if (!mapping)
   return 1;

   /* critical section */

   if (mapping->host) {
   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
   set_bit(AS_MCTIME, &mapping->flags);
   }
   return 1;
}
return 0;

made it difficult to account for the case of the already-dirty page
retouch after the call to msync(MS_ASYNC).

In this version of my solution, I redesigned the logic of the same
function as follows:

mapping = page_mapping(page);

if (!mapping)
   return 1;

set_bit(AS_MCTIME, &mapping->flags);

if (TestSetPageDirty(page))
   return 0;

/* critical section */

if (mapping->host) {
   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

return 1;

This allows us to set the AS_MCTIME bit independently of whether the page
had already been dirtied or not. Besides, such change makes the logic of
the topmost "if" in this function straight thus improving readability.
Finally, we already have the __set_page_dirty() routine with almost
identical functionality. My redesign of __set_pages_dirty_nobuffers()
is based on how the __set_page_dirty() routine is implemented.

Miklos gave an example of a scenario, where the previous version of
my solution would fail:

http://lkml.org/lkml/2008/1/14/100

Here is how it looks in the version I am sending now:

 1 page is dirtied through mapping
   => the AS_MCTIME bit turned on
 2 app calls msync(MS_ASYNC)
   => inode's times updated, the AS_MCTIME bit turned off
 3 page is written again through mapping
   => the AS_MCTIME bit turned on again
 4 app calls msync(MS_ASYNC)
   => inode's times updated, the AS_MCTIME bit turned off
 5 ...
 6 page is written back
   => ... by this moment, the either the msync(MS_ASYNC) has
  taken care of updating the file times, or the AS_MCTIME
  bit is on.

I think that the feedback about writes after the first msync(MS_ASYNC)
has thereby been addressed.

3. Updating the time stamps of the block device special files

As for the block device case, let's start from the following assumption:

if the block device data changes, we should do our best to tell the world
that this has happened.

This is how I approach this requirement:

1) if the block device is active, this is done at next *sync() through
  calling the bd_inode_update_time() helper function.

2) if the block device is not active, this is done during the block
  device file deactivation in the unlink_file_vma() routine.

Removing either of these actions would leave a possibility of losing
information about the block device data update. That is why I am keeping
both.

4. Recording the time was the file data changed

Finally, I noticed yet another issue with the previous version of my patch.
Specifically, the time stamps were set to the current time of the moment
when syncing but not the write reference was being done. This led to the
following adverse effect on my development system:

1) a text file A was updated by process B;
2) process B exits without calling any of the *sync() functions;
3) vi editor opens the file A;
4) file data synced, file times updated;
5) vi is confused by "thinking" that the file was changed after 3).

This version overcomes this problem by introducing another field into the
address_space structure. This field is used to "remember" the time of
dirtying, and then this time value is propagate

[PATCH 1/2] Massive code cleanup of sys_msync()

2008-01-15 Thread Anton Salikhmetov
Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   79 
 1 files changed, 37 insertions(+), 42 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..3270caa 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,25 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Massive code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,71 +34,65 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = 0, unmapped_error = 0;
 
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
-   goto out;
+   return -EINVAL;
if (start & ~PAGE_MASK)
-   goto out;
+   return -EINVAL;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
-   goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
-   goto out;
-   error = 0;
+   return -ENOMEM;
if (end == start)
-   goto out;
+   return 0;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start < vma->vm_start) {
start = vma->vm_start;
-   if (start >= end)
-   goto out_unlock;
+   if (start >= end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
-   file = vma->vm_file;
start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+
+   file = vma->vm_file;
+   if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start >= end)
-   goto out;
+   if (error)
+   return error;
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
+   continue;
}
-   }
-out_unlock:
+
+   vma = vma->vm_next;
+   } while (start < end);
up_read(&mm->mmap_sem);
-out:
+
return error ? : u

[PATCH 2/2] Updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) new flag triggering update of the inode data;
2) new field in the address_space structure for saving modification time;
3) new function to update ctime and mtime for block device files;
4) new helper function to update ctime and mtime when needed;
5) updating time stamps for mapped files in sys_msync() and do_fsync();
6) implementing the feature of auto-updating ctime and mtime;
7) account for the case of retouching an already-dirtied page.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 fs/buffer.c |3 ++
 fs/fs-writeback.c   |2 +
 fs/inode.c  |   53 ++---
 fs/sync.c   |2 +
 include/linux/fs.h  |   13 ++-
 include/linux/pagemap.h |3 +-
 mm/mmap.c   |3 ++
 mm/msync.c  |   29 
 mm/page-writeback.c |   54 +-
 9 files changed, 112 insertions(+), 50 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..3967aa7 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -701,6 +701,9 @@ static int __set_page_dirty(struct page *page,
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
+   mapping->mtime = CURRENT_TIME;
+   set_bit(AS_MCTIME, &mapping->flags);
+
if (TestSetPageDirty(page))
return 0;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 300324b..affd291 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
writeback_control *wbc)
 
spin_unlock(&inode_lock);
 
+   mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..5997046 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,9 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
*dentry)
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: inode accessed
+ * @ts: time when inode was accessed
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1253,11 +1254,8 @@ EXPORT_SYMBOL(touch_atime);
  * S_NOCTIME inode flag, e.g. for network filesystem where these
  * timestamps are handled by the server.
  */
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode, struct timespec *ts)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
-   struct timespec now;
int sync_it = 0;
 
if (IS_NOCMTIME(inode))
@@ -1265,22 +1263,52 @@ void file_update_time(struct file *file)
if (IS_RDONLY(inode))
return;
 
-   now = current_fs_time(inode->i_sb);
-   if (!timespec_equal(&inode->i_mtime, &now)) {
-   inode->i_mtime = now;
+   if (timespec_compare(&inode->i_mtime, ts) < 0) {
+   inode->i_mtime = *ts;
sync_it = 1;
}
 
-   if (!timespec_equal(&inode->i_ctime, &now)) {
-   inode->i_ctime = now;
+   if (timespec_compare(&inode->i_ctime, ts) < 0) {
+   inode->i_ctime = *ts;
sync_it = 1;
}
 
if (sync_it)
mark_inode_dirty_sync(inode);
 }
+EXPORT_SYMBOL(inode_update_time);
+
+/*
+ * Update the ctime and mtime stamps for memory-mapped block device files.
+ */
+static void bd_inode_update_time(struct inode *inode, struct timespec *ts)
+{
+   struct block_device *bdev = inode->i_bdev;
+   struct list_head *p;
+
+   if (bdev == NULL)
+   return;
+
+   mutex_lock(&bdev->bd_mutex);
+   list_for_each(p, &bdev->bd_inodes) {
+   inode = list_entry(p, struct inode, i_devices);
+   inode_update_time(inode, ts);
+   }
+   mutex_unlock(&bdev->bd_mutex);
+}
 
-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+   if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
+   if (S_ISBLK(mapping->host->i_mode))
+   bd_inode_update_time(mapping->host, &mapping->mtime);
+   else
+   inode_update_time(mapping->host, &mapping->mtime);
+   }
+}
 
 int inode_needs_sync(struct inode *inode)
 {
@@ -1290,7 +1318,6 @@ int inode_needs_sync(struct inode *inode)

Re: [PATCH 2/2] updating ctime and mtime at syncing

2008-01-15 Thread Anton Salikhmetov
2008/1/15, Miklos Szeredi <[EMAIL PROTECTED]>:
> > Thanks for your review, Peter and Miklos!
> >
> > I overlooked this case when AS_MCTIME flag has been turned off and the
> > page is still dirty.
> >
> > On the other hand, the words "shall be marked for update" may be
> > considered as just setting the AS_MCTIME flag, not updating the time
> > stamps.
> >
> > What do you think about calling mapping_update_time() inside of "if
> > (MS_SYNC & flags)"? I suggest such change because the code for
> > analysis of the case you've mentioned above seems impossible to me.
>
> I think that's a good idea.  As a first iteration, just updating the
> mtime/ctime in msync(MS_SYNC) and remove_vma() (called at munmap time)
> would be a big improvement over what we currently have.
>
> I would also recommend, that you drop mapping_update_time() and the
> related functions from the patch, and just use file_update_time()
> instead.

Thank you for your recommendations. I will submit my new solution shortly.

By the way, I've already changed the unlink_file_vma() function.

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] updating ctime and mtime at syncing

2008-01-14 Thread Anton Salikhmetov
2008/1/14, Miklos Szeredi <[EMAIL PROTECTED]>:
> > 2008/1/14, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> > > >
> > > > Changes for updating the ctime and mtime fields for memory-mapped files:
> > > >
> > > > 1) new flag triggering update of the inode data;
> > > > 2) new function to update ctime and mtime for block device files;
> > > > 3) new helper function to update ctime and mtime when needed;
> > > > 4) updating time stamps for mapped files in sys_msync() and do_fsync();
> > > > 5) implementing the feature of auto-updating ctime and mtime.
> > >
> > > How exactly is this done?
> > >
> > > Is this catering for this case:
> > >
> > >  1 page is dirtied through mapping
> > >  2 app calls msync(MS_ASYNC)
> > >  3 page is written again through mapping
> > >  4 app calls msync(MS_ASYNC)
> > >  5 ...
> > >  6 page is written back
> > >
> > > What happens at 4?  Do we care about this one at all?
> >
> > The POSIX standard requires updating the file times every time when msync()
> > is called with MS_ASYNC. I.e. the time stamps should be updated even
> > when no physical synchronization is being done immediately.
>
> Yes.  However, on linux MS_ASYNC is basically a no-op, and without
> doing _something_ with the dirty pages (which afaics your patch
> doesn't do), it's impossible to observe later writes to the same page.
>
> I don't advocate full POSIX conformance anymore, because it's probably
> too expensive to do (I've tried).  Rather than that, we should
> probably find some sane compromise, that just fixes the real life
> issue.  Here's a pointer to the thread about this:
>
> http://lkml.org/lkml/2007/3/27/55
>
> Your patch may be a good soultion, but you should describe in detail
> what it does when pages are dirtied, and when msync/fsync are called,
> and what happens with multiple msync calls that I've asked about.
>
> I suspect your patch is ignoring writes after the first msync, but
> then why care about msync at all?  What's so special about the _first_
> msync?  Is it just that most test programs only check this, and not
> what happens if msync is called more than once?  That would be a bug
> in the test cases.
>
> > > > +/*
> > > > + * Update the ctime and mtime stamps for memory-mapped block device 
> > > > files.
> > > > + */
> > > > +static void bd_inode_update_time(struct inode *inode)
> > > > +{
> > > > + struct block_device *bdev = inode->i_bdev;
> > > > + struct list_head *p;
> > > > +
> > > > + if (bdev == NULL)
> > > > + return;
> > > > +
> > > > + mutex_lock(&bdev->bd_mutex);
> > > > + list_for_each(p, &bdev->bd_inodes) {
> > > > + inode = list_entry(p, struct inode, i_devices);
> > > > + inode_update_time(inode);
> > > > + }
> > > > + mutex_unlock(&bdev->bd_mutex);
> > > > +}
> > >
> > > Umm, why not just call with file->f_dentry->d_inode, so that you don't
> > > need to do this ugly search for the physical inode?  The file pointer
> > > is available in both msync and fsync.
> >
> > I'm not sure if I undestood your question. I see two possible
> > interpretations for this question, and I'm answering both.
> >
> > The intention was to make the data changes in the block device data
> > visible to all device files associated with the block device. Hence
> > the search, because the time stamps for all such device files should
> > be updated as well.
>
> Ahh, but it will only update "active" devices, which are currently
> open, no?  Is that what we want?
>
> > Not only the sys_msync() and do_fsync() routines call the helper
> > function mapping_update_time().
>
> Ah yes, __sync_single_inode() calls it as well.  Why?

The __sync_single_inode() function calls mapping_update_time()
to enable the "auto-updating" feature discussed earlier.

>
> Miklos
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] updating ctime and mtime at syncing

2008-01-14 Thread Anton Salikhmetov
2008/1/14, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Mon, 2008-01-14 at 14:14 +0100, Miklos Szeredi wrote:
> > > 2008/1/14, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > > > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> > > > >
> > > > > Changes for updating the ctime and mtime fields for memory-mapped 
> > > > > files:
> > > > >
> > > > > 1) new flag triggering update of the inode data;
> > > > > 2) new function to update ctime and mtime for block device files;
> > > > > 3) new helper function to update ctime and mtime when needed;
> > > > > 4) updating time stamps for mapped files in sys_msync() and 
> > > > > do_fsync();
> > > > > 5) implementing the feature of auto-updating ctime and mtime.
> > > >
> > > > How exactly is this done?
> > > >
> > > > Is this catering for this case:
> > > >
> > > >  1 page is dirtied through mapping
> > > >  2 app calls msync(MS_ASYNC)
> > > >  3 page is written again through mapping
> > > >  4 app calls msync(MS_ASYNC)
> > > >  5 ...
> > > >  6 page is written back
> > > >
> > > > What happens at 4?  Do we care about this one at all?
> > >
> > > The POSIX standard requires updating the file times every time when 
> > > msync()
> > > is called with MS_ASYNC. I.e. the time stamps should be updated even
> > > when no physical synchronization is being done immediately.
> >
> > Yes.  However, on linux MS_ASYNC is basically a no-op, and without
> > doing _something_ with the dirty pages (which afaics your patch
> > doesn't do), it's impossible to observe later writes to the same page.
> >
> > I don't advocate full POSIX conformance anymore, because it's probably
> > too expensive to do (I've tried).  Rather than that, we should
> > probably find some sane compromise, that just fixes the real life
> > issue.  Here's a pointer to the thread about this:
> >
> > http://lkml.org/lkml/2007/3/27/55
> >
> > Your patch may be a good soultion, but you should describe in detail
> > what it does when pages are dirtied, and when msync/fsync are called,
> > and what happens with multiple msync calls that I've asked about.
> >
> > I suspect your patch is ignoring writes after the first msync, but
> > then why care about msync at all?  What's so special about the _first_
> > msync?  Is it just that most test programs only check this, and not
> > what happens if msync is called more than once?  That would be a bug
> > in the test cases.
>
> I must agree, doing the mmap dirty, MS_ASYNC, mmap retouch, MS_ASYNC
> case correctly would need a lot more code which I doubt is worth the
> effort.
>
> It would require scanning the PTEs and marking them read-only again on
> MS_ASYNC, and some more logic in set_page_dirty() because that currently
> bails out early if the page in question is already dirty.

Thanks for your review, Peter and Miklos!

I overlooked this case when AS_MCTIME flag has been turned off and the
page is still dirty.

On the other hand, the words "shall be marked for update" may be
considered as just setting the AS_MCTIME flag, not updating the time
stamps.

What do you think about calling mapping_update_time() inside of "if
(MS_SYNC & flags)"? I suggest such change because the code for
analysis of the case you've mentioned above seems impossible to me.

>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] updating ctime and mtime at syncing

2008-01-14 Thread Anton Salikhmetov
2008/1/14, Miklos Szeredi <[EMAIL PROTECTED]>:
> > > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> > >
> > > Changes for updating the ctime and mtime fields for memory-mapped files:
> > >
> > > 1) new flag triggering update of the inode data;
> > > 2) new function to update ctime and mtime for block device files;
> > > 3) new helper function to update ctime and mtime when needed;
> > > 4) updating time stamps for mapped files in sys_msync() and do_fsync();
> > > 5) implementing the feature of auto-updating ctime and mtime.
> >
> > How exactly is this done?
> >
> > Is this catering for this case:
> >
> >  1 page is dirtied through mapping
> >  2 app calls msync(MS_ASYNC)
> >  3 page is written again through mapping
> >  4 app calls msync(MS_ASYNC)
> >  5 ...
> >  6 page is written back
> >
> > What happens at 4?  Do we care about this one at all?
>
> Oh, and here's a test program I wrote, that can be used to check this
> behavior.   It has two options:
>
>  -s   use MS_SYNC instead of MS_ASYNC
>  -f   fork and do the msync on a different mapping
>
> Back then I haven't found a single OS, that fully conformed to all the
> stupid POSIX rules regarding mmaps and ctime/mtime.

Thank you very much for sharing your code.

I'll integrate the MS_ASYNC and fork() test cases into my own unit test.

>
> Miklos
> 
>
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
>
> static const char *filename;
> static int msync_flag = MS_ASYNC;
> static int msync_fork = 0;
>
> static void print_times(const char *msg)
> {
> struct stat stbuf;
> stat(filename, &stbuf);
> printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
>stbuf.st_atime);
> }
>
> static void do_msync(void *addr, int len)
> {
> int res;
> if (!msync_fork) {
> res = msync(addr, len, msync_flag);
> if (res == -1) {
> perror("msync");
> exit(1);
> }
> } else {
> int pid = fork();
> if (pid == -1) {
> perror("fork");
> exit(1);
> }
> if (!pid) {
> int fd = open(filename, O_RDWR);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
> addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 
> 0);
> if (addr == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
> res = msync(addr, len, msync_flag);
> if (res == -1) {
> perror("msync");
> exit(1);
> }
> exit(0);
> }
> wait(NULL);
> }
> }
>
> static void usage(const char *progname)
> {
> fprintf(stderr, "usage: %s filename [-sf]\n", progname);
> exit(1);
> }
>
> int main(int argc, char *argv[])
> {
> int res;
> char *addr;
> int fd;
>
> if (argc < 2)
> usage(argv[0]);
>
> filename = argv[1];
> if (argc > 2) {
> if (argc > 3)
> usage(argv[0]);
> if (strcmp(argv[2], "-s") == 0)
> msync_flag = MS_SYNC;
> else if (strcmp(argv[2], "-f") == 0)
> msync_fork = 1;
> else if (strcmp(argv[2], "-sf") == 0 || strcmp(argv[2], "-fs") == 0) {
> msync_flag = MS_SYNC;
> msync_fork = 1;
> } else
> usage(argv[0]);
> }
>
> fd = open(filename, O_RDWR | O_TRUNC | O_CREAT, 0666);
> if (fd == -1) {
> perror(filename);
> return 1;
> }
> print_times("begin");
> sleep(1);
> write(fd, "\n", 4);
> print_times("write");
> sleep(1);
> addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> if (addr == MAP_FAILED) {
> perror("mmap");
> return 1;
> }
> print_times("mmap");
> sleep(1);
>
> addr[1] = 'b';
> print_times("b");
> sleep(1);
> do_msync(addr, 4);
> print_times("msync b");
> sleep(1);
>
> addr[2] = 'c';
> print_times("c");
> sleep(1);
> do_msync(addr, 4);
> print_times("msync c");
> sleep(1);
>
> addr[3] = 'd';
> print_times("d");
> sleep(1);
> res = munmap(addr, 4);
> if (res == -1) {
> perror("munmap");
> return 1;
> }
> print_times("munmap");
> sleep(1);
>
> res = close(fd);
> if (res == -1) {
> perror("close");
> return 1;
> }
> print_times("close");
> sleep(1);
> sync();
> print_times("sync");
>
> return 0;
> }
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] updating ctime and mtime at syncing

2008-01-14 Thread Anton Salikhmetov
2008/1/14, Miklos Szeredi <[EMAIL PROTECTED]>:
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > Changes for updating the ctime and mtime fields for memory-mapped files:
> >
> > 1) new flag triggering update of the inode data;
> > 2) new function to update ctime and mtime for block device files;
> > 3) new helper function to update ctime and mtime when needed;
> > 4) updating time stamps for mapped files in sys_msync() and do_fsync();
> > 5) implementing the feature of auto-updating ctime and mtime.
>
> How exactly is this done?
>
> Is this catering for this case:
>
>  1 page is dirtied through mapping
>  2 app calls msync(MS_ASYNC)
>  3 page is written again through mapping
>  4 app calls msync(MS_ASYNC)
>  5 ...
>  6 page is written back
>
> What happens at 4?  Do we care about this one at all?

The POSIX standard requires updating the file times every time when msync()
is called with MS_ASYNC. I.e. the time stamps should be updated even
when no physical synchronization is being done immediately.

At least, this is how I undestand the standard. Please tell me if I am wrong.

>
> More comments inline.
>
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  fs/buffer.c |1 +
> >  fs/fs-writeback.c   |2 ++
> >  fs/inode.c  |   42 +++---
> >  fs/sync.c   |2 ++
> >  include/linux/fs.h  |9 -
> >  include/linux/pagemap.h |3 ++-
> >  mm/msync.c  |   24 
> >  mm/page-writeback.c |1 +
> >  8 files changed, 67 insertions(+), 17 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 7249e01..09adf7e 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
> >   }
> >   write_unlock_irq(&mapping->tree_lock);
> >   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > + set_bit(AS_MCTIME, &mapping->flags);
> >
> >   return 1;
> >  }
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 0fca820..c25ebd5 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
> > writeback_control *wbc)
> >
> >   spin_unlock(&inode_lock);
> >
> > + mapping_update_time(mapping);
> > +
> >   ret = do_writepages(mapping, wbc);
> >
> >   /* Don't write the inode if only I_DIRTY_PAGES was set */
> > diff --git a/fs/inode.c b/fs/inode.c
> > index ed35383..c02bfab 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1243,8 +1243,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
> > *dentry)
> >  EXPORT_SYMBOL(touch_atime);
> >
> >  /**
> > - *   file_update_time-   update mtime and ctime time
> > - *   @file: file accessed
> > + *   inode_update_time   -   update mtime and ctime time
> > + *   @inode: inode accessed
> >   *
> >   *   Update the mtime and ctime members of an inode and mark the inode
> >   *   for writeback.  Note that this function is meant exclusively for
> > @@ -1253,10 +1253,8 @@ EXPORT_SYMBOL(touch_atime);
> >   *   S_NOCTIME inode flag, e.g. for network filesystem where these
> >   *   timestamps are handled by the server.
> >   */
> > -
> > -void file_update_time(struct file *file)
> > +void inode_update_time(struct inode *inode)
> >  {
> > - struct inode *inode = file->f_path.dentry->d_inode;
> >   struct timespec now;
> >   int sync_it = 0;
> >
> > @@ -1279,8 +1277,39 @@ void file_update_time(struct file *file)
> >   if (sync_it)
> >   mark_inode_dirty_sync(inode);
> >  }
> > +EXPORT_SYMBOL(inode_update_time);
> > +
> > +/*
> > + * Update the ctime and mtime stamps for memory-mapped block device files.
> > + */
> > +static void bd_inode_update_time(struct inode *inode)
> > +{
> > + struct block_device *bdev = inode->i_bdev;
> > + struct list_head *p;
> > +
> > + if (bdev == NULL)
> > + return;
> > +
> > + mutex_lock(&bdev->bd_mutex);
> > + list_for_each(p, &bdev->bd_inodes) {
> > + inode = list_entry(p, struct inode, i_devices);
> > + inode_update_time(inode);
> > + }
> > + mutex_unlock(&bdev->bd_mutex);
> > +}
>
> 

Re: [PATCH 1/2] massive code cleanup of sys_msync()

2008-01-14 Thread Anton Salikhmetov
2008/1/14, Miklos Szeredi <[EMAIL PROTECTED]>:
> > Substantial code cleanup of the sys_msync() function:
> >
> > 1) using the PAGE_ALIGN() macro instead of "manual" alignment;
> > 2) improved readability of the loop traversing the process memory regions.
>
> Thanks for doing this.  See comments below.
>
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> > ---
> >  mm/msync.c |   78 
> > +++-
> >  1 files changed, 35 insertions(+), 43 deletions(-)
> >
> > diff --git a/mm/msync.c b/mm/msync.c
> > index 144a757..ff654c9 100644
> > --- a/mm/msync.c
> > +++ b/mm/msync.c
> > @@ -1,24 +1,25 @@
> >  /*
> >   *   linux/mm/msync.c
> >   *
> > + * The msync() system call.
> >   * Copyright (C) 1994-1999  Linus Torvalds
> > + *
> > + * Substantial code cleanup.
> > + * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
> >   */
> >
> > -/*
> > - * The msync() system call.
> > - */
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> > -#include 
> >  #include 
> > +#include 
> >
> >  /*
> >   * MS_SYNC syncs the entire file - including mappings.
> >   *
> >   * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> > - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> > + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
> >   * Now it doesn't do anything, since dirty pages are properly tracked.
> >   *
> >   * The application may now run fsync() to
> > @@ -33,71 +34,62 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   unsigned long end;
> >   struct mm_struct *mm = current->mm;
> >   struct vm_area_struct *vma;
> > - int unmapped_error = 0;
> > - int error = -EINVAL;
> > + int error = 0, unmapped_error = 0;
> >
> >   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
> > - goto out;
> > + return -EINVAL;
> >   if (start & ~PAGE_MASK)
> > - goto out;
> > + return -EINVAL;
> >   if ((flags & MS_ASYNC) && (flags & MS_SYNC))
> > - goto out;
> > - error = -ENOMEM;
> > - len = (len + ~PAGE_MASK) & PAGE_MASK;
> > + return -EINVAL;
> > +
> > + len = PAGE_ALIGN(len);
> >   end = start + len;
> >   if (end < start)
> > - goto out;
> > - error = 0;
> > + return -ENOMEM;
> >   if (end == start)
> > - goto out;
> > + return 0;
> > +
> >   /*
> >* If the interval [start,end) covers some unmapped address ranges,
> >* just ignore them, but return -ENOMEM at the end.
> >*/
> >   down_read(&mm->mmap_sem);
> >   vma = find_vma(mm, start);
> > - for (;;) {
> > + do {
> >   struct file *file;
> >
> > - /* Still start < end. */
> > - error = -ENOMEM;
> > - if (!vma)
> > - goto out_unlock;
> > - /* Here start < vma->vm_end. */
> > + if (!vma) {
> > + error = -ENOMEM;
> > + break;
> > + }
> >   if (start < vma->vm_start) {
> >   start = vma->vm_start;
> > - if (start >= end)
> > - goto out_unlock;
> > + if (start >= end) {
> > + error = -ENOMEM;
> > + break;
> > + }
> >   unmapped_error = -ENOMEM;
> >   }
> > - /* Here vma->vm_start <= start < vma->vm_end. */
> > - if ((flags & MS_INVALIDATE) &&
> > - (vma->vm_flags & VM_LOCKED)) {
> > + if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
> >   error = -EBUSY;
> > - goto out_unlock;
> > + break;
> >   }
> >   file = vma->vm_file;
> > - start = vma->vm_end;
> > - if ((flags & MS_SYNC) && file &&
>

[PATCH 2/2] updating ctime and mtime at syncing

2008-01-12 Thread Anton Salikhmetov
http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) new flag triggering update of the inode data;
2) new function to update ctime and mtime for block device files;
3) new helper function to update ctime and mtime when needed;
4) updating time stamps for mapped files in sys_msync() and do_fsync();
5) implementing the feature of auto-updating ctime and mtime.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 fs/buffer.c |1 +
 fs/fs-writeback.c   |2 ++
 fs/inode.c  |   42 +++---
 fs/sync.c   |2 ++
 include/linux/fs.h  |9 -
 include/linux/pagemap.h |3 ++-
 mm/msync.c  |   24 
 mm/page-writeback.c |1 +
 8 files changed, 67 insertions(+), 17 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..09adf7e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+   set_bit(AS_MCTIME, &mapping->flags);
 
return 1;
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0fca820..c25ebd5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct 
writeback_control *wbc)
 
spin_unlock(&inode_lock);
 
+   mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);
 
/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..c02bfab 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry 
*dentry)
 EXPORT_SYMBOL(touch_atime);
 
 /**
- * file_update_time-   update mtime and ctime time
- * @file: file accessed
+ * inode_update_time   -   update mtime and ctime time
+ * @inode: inode accessed
  *
  * Update the mtime and ctime members of an inode and mark the inode
  * for writeback.  Note that this function is meant exclusively for
@@ -1253,10 +1253,8 @@ EXPORT_SYMBOL(touch_atime);
  * S_NOCTIME inode flag, e.g. for network filesystem where these
  * timestamps are handled by the server.
  */
-
-void file_update_time(struct file *file)
+void inode_update_time(struct inode *inode)
 {
-   struct inode *inode = file->f_path.dentry->d_inode;
struct timespec now;
int sync_it = 0;
 
@@ -1279,8 +1277,39 @@ void file_update_time(struct file *file)
if (sync_it)
mark_inode_dirty_sync(inode);
 }
+EXPORT_SYMBOL(inode_update_time);
+
+/*
+ * Update the ctime and mtime stamps for memory-mapped block device files.
+ */
+static void bd_inode_update_time(struct inode *inode)
+{
+   struct block_device *bdev = inode->i_bdev;
+   struct list_head *p;
+
+   if (bdev == NULL)
+   return;
+
+   mutex_lock(&bdev->bd_mutex);
+   list_for_each(p, &bdev->bd_inodes) {
+   inode = list_entry(p, struct inode, i_devices);
+   inode_update_time(inode);
+   }
+   mutex_unlock(&bdev->bd_mutex);
+}
 
-EXPORT_SYMBOL(file_update_time);
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapping_update_time(struct address_space *mapping)
+{
+   if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
+   if (S_ISBLK(mapping->host->i_mode))
+   bd_inode_update_time(mapping->host);
+   else
+   inode_update_time(mapping->host);
+   }
+}
 
 int inode_needs_sync(struct inode *inode)
 {
@@ -1290,7 +1319,6 @@ int inode_needs_sync(struct inode *inode)
return 1;
return 0;
 }
-
 EXPORT_SYMBOL(inode_needs_sync);
 
 int inode_wait(void *word)
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..5561464 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+   mapping_update_time(mapping);
+
ret = filemap_fdatawrite(mapping);
 
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..1dccd4b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1977,7 +1977,14 @@ extern int buffer_migrate_page(struct address_space *,
 extern int inode_change_ok(struct inode *, struct iattr *);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
-extern void file_update_time(struct file *file);
+extern void inode_update_time(struct inode *);
+
+static inline void file_update_time(struct file *file)
+{
+   inode_update_time(file->f_path.dentry->d_inode);
+}
+
+extern void mapping_update_time(struct address_space *);
 
 static i

[PATCH 0/2] yet another attempt to fix the ctime and mtime issue

2008-01-12 Thread Anton Salikhmetov
The POSIX standard requires that the ctime and mtime fields
for memory-mapped files should be updated after a write
reference to the memory region where the file data is mapped.
At least FreeBSD 6.2 and HP-UX 11i implement this properly.
Linux does not, which leads to data loss problems in database
backup applications.

Kernel Bug Tracker contains more information about the problem:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

There have been several attempts in the past to address this
issue. Following are a few links to LKML discussions related
to this bug:

http://lkml.org/lkml/2006/5/17/138
http://lkml.org/lkml/2007/2/21/242
http://lkml.org/lkml/2008/1/7/234

All earlier solutions were criticized. Some solutions did not
handle memory-mapped block devices properly. Some led to forcing
applications to explicitly call msync() to update file metadata.
Some contained errors in using kernel synchronization primitives.

In the two patches that follow, I would like to propose a new
solution.

This is the third version of my changes. This version takes
into account all feedback I received for the two previous versions.
The overall design remains basically the same as the one that
was acked by Rick van Riel:

http://lkml.org/lkml/2008/1/11/208

To the best of my knowledge, these patches are free of all the
drawbacks found during previous attempts by Peter Staubach,
Miklos Szeredi and myself.

New since the previous version:

1) no need to explicitly call msync() to update file times;
2) changing block device data is visible to all device files
   associated with the block device;
3) in the cleanup part, the error checks are separated out as
   suggested by Rik van Riel;
4) some small refinements accodring to the LKML comments.

This is how I tested the patches.

1. To test the features mentioned above, I wrote a unit test
   available from

   http://bugzilla.kernel.org/attachment.cgi?id=14430

   I verified that the unit test passed successfully for both
   regular files and block device files. For the unit test I
   used the following architectures: 32-bit x86, x86_64 and
   MIPS32 (cross-compiled from x86_64).

2. I did build tests with allmodconfig and allyesconfig on x86_64.

3. I ran the following test cases from the LTP test suite:

   msync01
   msync02
   msync03
   msync04
   msync05
   mmapstress01
   mmapstress09
   mmapstress10

   No regressions were found by these test cases.

I think that the bug #2645 is resolved by these patches.

Please apply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] massive code cleanup of sys_msync()

2008-01-12 Thread Anton Salikhmetov
Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
 mm/msync.c |   78 +++-
 1 files changed, 35 insertions(+), 43 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..ff654c9 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,25 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Substantial code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,71 +34,62 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = 0, unmapped_error = 0;
 
if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
-   goto out;
+   return -EINVAL;
if (start & ~PAGE_MASK)
-   goto out;
+   return -EINVAL;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
-   goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   return -EINVAL;
+
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
-   goto out;
-   error = 0;
+   return -ENOMEM;
if (end == start)
-   goto out;
+   return 0;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start < vma->vm_start) {
start = vma->vm_start;
-   if (start >= end)
-   goto out_unlock;
+   if (start >= end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
file = vma->vm_file;
-   start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start >= end)
-   goto out;
+   if (error)
+   return error;
down_read(&mm->mmap_sem);
-   vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
}
-   }
-out_unlock:
+
+   start = vma->vm_end;
+   vma = vma->vm_next;
+   } while (start < end);
up_read(&mm->mmap_sem);
-out:
+
return error ? : unmapped_error;
 }
-- 
1.4.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-12 Thread Anton Salikhmetov
2008/1/12, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Sat, 2008-01-12 at 15:38 +0300, Anton Salikhmetov wrote:
> > 2008/1/12, Peter Zijlstra <[EMAIL PROTECTED]>:
> > >
> > > On Sat, 2008-01-12 at 10:36 +0100, Peter Zijlstra wrote:
> > > > On Fri, 2008-01-11 at 03:44 +0300, Anton Salikhmetov wrote:
> > > >
> > > > > +/*
> > > > > + * Update the ctime and mtime stamps after checking if they are to 
> > > > > be updated.
> > > > > + */
> > > > > +void mapped_file_update_time(struct file *file)
> > > > > +{
> > > > > +   if (test_and_clear_bit(AS_MCTIME, &file->f_mapping->flags)) {
> > > > > +   get_file(file);
> > > > > +   file_update_time(file);
> > > > > +   fput(file);
> > > > > +   }
> > > > > +}
> > > > > +
> > > >
> > > > I don't think you need the get/put file stuff here, because
> > >
> > > BTW, the reason for me noticing this is that if it would be needed there
> > > is a race condition right there, who is to say that the file pointer
> > > you're deref'ing in your test condition isn't a dead one already.
> >
> > So, in your opinion, is it at all needed here to play with the file 
> > reference
> > counter? May I drop the get_file() and fput() calls from the
> > sys_msync() function?
>
> No, the ones in sys_msync() around calling do_fsync() are most
> definately needed because we release mmap_sem there.
>
> What I'm saying is that you can remove the get_file()/fput() calls from
> your new mapped_file_update_time() function.

OK, thank you very much for your answer. I'm planning to submit my
next solution which is going to take your suggestion into account.

But I'm not sure how to process memory-mapped block device files.
Staubach's approach did not work for me after adapting it to my
solution.

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-12 Thread Anton Salikhmetov
2008/1/12, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Sat, 2008-01-12 at 10:36 +0100, Peter Zijlstra wrote:
> > On Fri, 2008-01-11 at 03:44 +0300, Anton Salikhmetov wrote:
> >
> > > +/*
> > > + * Update the ctime and mtime stamps after checking if they are to be 
> > > updated.
> > > + */
> > > +void mapped_file_update_time(struct file *file)
> > > +{
> > > +   if (test_and_clear_bit(AS_MCTIME, &file->f_mapping->flags)) {
> > > +   get_file(file);
> > > +   file_update_time(file);
> > > +   fput(file);
> > > +   }
> > > +}
> > > +
> >
> > I don't think you need the get/put file stuff here, because
>
> BTW, the reason for me noticing this is that if it would be needed there
> is a race condition right there, who is to say that the file pointer
> you're deref'ing in your test condition isn't a dead one already.

So, in your opinion, is it at all needed here to play with the file reference
counter? May I drop the get_file() and fput() calls from the
sys_msync() function?

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-12 Thread Anton Salikhmetov
2008/1/12, Peter Zijlstra <[EMAIL PROTECTED]>:
>
> On Fri, 2008-01-11 at 03:44 +0300, Anton Salikhmetov wrote:
>
> > +/*
> > + * Update the ctime and mtime stamps after checking if they are to be 
> > updated.
> > + */
> > +void mapped_file_update_time(struct file *file)
> > +{
> > + if (test_and_clear_bit(AS_MCTIME, &file->f_mapping->flags)) {
> > + get_file(file);
> > + file_update_time(file);
> > + fput(file);
> > + }
> > +}
> > +
>
> I don't think you need the get/put file stuff here, because
>
> > @@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
> >   goto out;
> >   }
> >
> > + mapped_file_update_time(file);
> > +
> >   ret = filemap_fdatawrite(mapping);
> >
> >   /*
>
> at this call-site we already hold an extra reference on the file, and
>
> > @@ -74,14 +79,17 @@ asmlinkage long sys_msync(unsigned long start, size_t 
> > len, int flags)
> >   break;
> >   }
> >   file = vma->vm_file;
> > - if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) 
> > {
> > - get_file(file);
> > - up_read(&mm->mmap_sem);
> > - error = do_fsync(file, 0);
> > - fput(file);
> > - if (error)
> > - return error;
> > - down_read(&mm->mmap_sem);
> > + if (file && (vma->vm_flags & VM_SHARED)) {
> > + mapped_file_update_time(file);
> > + if (flags & MS_SYNC) {
> > + get_file(file);
> > + up_read(&mm->mmap_sem);
> > + error = do_fsync(file, 0);
> > + fput(file);
> > + if (error)
> > + return error;
> > + down_read(&mm->mmap_sem);
> > + }
> >   }
> >
> >   start = vma->vm_end;
>
> here we hold the mmap_sem so the vma reference on the file can't go
> away.

These get_file() and fput() calls were in the sys_msync() function form before
all my changes. I did not add them here.

>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Anton Salikhmetov
2008/1/11, Peter Staubach <[EMAIL PROTECTED]>:
> Anton Salikhmetov wrote:
> > From: Anton Salikhmetov <[EMAIL PROTECTED]>
> >
> > The patch contains changes for updating the ctime and mtime fields for 
> > memory mapped files:
> >
> > 1) adding a new flag triggering update of the inode data;
> > 2) implementing a helper function for checking that flag and updating ctime 
> > and mtime;
> > 3) updating time stamps for mapped files in sys_msync() and do_fsync().
>
> Sorry, one other issue to throw out too -- an mmap'd block device
> should also have its inode time fields updated.  This is a little
> tricky because the inode referenced via mapping->host isn't the
> one that needs to have the time fields updated on.
>
> I have attached the patch that I submitted last.  It is quite out
> of date, but does show my attempt to resolve some of these issues.

It looks very strange to me that your patch received no reaction whatsoever:

http://lkml.org/lkml/2007/2/20/255

I've just tried to adapt your patch to my solution. I'm afraid ctime
and mtime stamps
for memory-mapped block device files are still not updated with your
code integrated
into what I'm doing. I set up the loopback device /dev/loop0, then ran
my test program:

http://bugzilla.kernel.org/attachment.cgi?id=14398

The unit test shows that ctime and mtime are not updated.
However, regular files are updated properly.

Also I have a couple of questions about your patch. Please see below.

>
> Thanx...
>
>ps
>
> --- linux-2.6.20.i686/fs/buffer.c.org
> +++ linux-2.6.20.i686/fs/buffer.c
> @@ -710,6 +710,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty_buffers(struct page *page)
>  {
> struct address_space * const mapping = page_mapping(page);
> +   int ret = 0;
>
> if (unlikely(!mapping))
> return !TestSetPageDirty(page);
> @@ -727,7 +728,7 @@ int __set_page_dirty_buffers(struct page
> spin_unlock(&mapping->private_lock);
>
> if (TestSetPageDirty(page))
> -   return 0;
> +   goto out;
>
> write_lock_irq(&mapping->tree_lock);
> if (page->mapping) {/* Race with truncate? */
> @@ -740,7 +741,11 @@ int __set_page_dirty_buffers(struct page
> }
> write_unlock_irq(&mapping->tree_lock);
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -   return 1;
> +   ret = 1;
> +out:
> +   if (page_mapped(page))
> +   set_bit(AS_MCTIME, &mapping->flags);
> +   return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
>
> --- linux-2.6.20.i686/fs/fs-writeback.c.org
> +++ linux-2.6.20.i686/fs/fs-writeback.c
> @@ -167,6 +167,13 @@ __sync_single_inode(struct inode *inode,
>
> spin_unlock(&inode_lock);
>
> +   if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
> +   if (S_ISBLK(inode->i_mode))
> +   bd_inode_update_time(inode);
> +   else
> +   inode_update_time(inode);
> +   }
> +

Why the S_ISBLK check is done only here? I think sys_msync() should contain
the same logic.

> ret = do_writepages(mapping, wbc);
>
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> --- linux-2.6.20.i686/fs/inode.c.org
> +++ linux-2.6.20.i686/fs/inode.c
> @@ -1201,8 +1201,8 @@ void touch_atime(struct vfsmount *mnt, s
>  EXPORT_SYMBOL(touch_atime);
>
>  /**
> - * file_update_time-   update mtime and ctime time
> - * @file: file accessed
> + * inode_update_time   -   update mtime and ctime time
> + * @inode: file accessed
>   *
>   * Update the mtime and ctime members of an inode and mark the inode
>   * for writeback.  Note that this function is meant exclusively for
> @@ -1212,9 +1212,8 @@ EXPORT_SYMBOL(touch_atime);
>   * timestamps are handled by the server.
>   */
>
> -void file_update_time(struct file *file)
> +void inode_update_time(struct inode *inode)
>  {
> -   struct inode *inode = file->f_path.dentry->d_inode;
> struct timespec now;
> int sync_it = 0;
>
> @@ -1238,7 +1237,7 @@ void file_update_time(struct file *file)
> mark_inode_dirty_sync(inode);
>  }
>
> -EXPORT_SYMBOL(file_update_time);
> +EXPORT_SYMBOL(inode_update_time);
>
>  int inode_needs_sync(struct inode *inode)
>  {
> --- linux-2.6.20.i686/fs/block_dev.c.org
> +++ linux-2.6.20.i686/fs/block_dev.c
> @@ -608,6 +608,22 @@ void bdput(struct block_device *bdev)
>
>  EXPORT_SYMBOL(bdput);
>
> +void bd_inode_

Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Anton Salikhmetov
2008/1/12, Peter Staubach <[EMAIL PROTECTED]>:
> Anton Salikhmetov wrote:
> > 2008/1/11, Peter Staubach <[EMAIL PROTECTED]>:
> >
> >> Anton Salikhmetov wrote:
> >>
> >>> From: Anton Salikhmetov <[EMAIL PROTECTED]>
> >>>
> >>> The patch contains changes for updating the ctime and mtime fields for 
> >>> memory mapped files:
> >>>
> >>> 1) adding a new flag triggering update of the inode data;
> >>> 2) implementing a helper function for checking that flag and updating 
> >>> ctime and mtime;
> >>> 3) updating time stamps for mapped files in sys_msync() and do_fsync().
> >>>
> >> Sorry, one other issue to throw out too -- an mmap'd block device
> >> should also have its inode time fields updated.  This is a little
> >> tricky because the inode referenced via mapping->host isn't the
> >> one that needs to have the time fields updated on.
> >>
> >> I have attached the patch that I submitted last.  It is quite out
> >> of date, but does show my attempt to resolve some of these issues.
> >>
> >
> > Thanks for your feedback!
> >
> > Now I'm looking at your solution and thinking about which parts of it
> > I could adapt to the infrastructure I'm trying to develop.
> >
> > However, I would like to address the block device case within
> > a separate project. But for now, I want the msync() and fsync()
> > system calls to update ctime and mtime at least for memory-mapped
> > regular files properly. I feel that even this little improvement could 
> > address
> > many customer's troubles such as the one Jacob Oestergaard reported
> > in the bug #2645.
>
> Not that I disagree and I also have customers who would really like
> to see this situation addressed so that I can then fix it in RHEL,
> but the block device issue was raised by Andrew Morton during my
> first attempt to get a patch integrated.
>
> Just so that you are aware of who has raised which issues...  :-)

Yes, I remember that email by Andrew Morton (http://lkml.org/lkml/2006/6/19/6).
In fact, I went over that thread many times while working on my
solution for this bug.

Nevertheless, I presume the block device case to be addressed in a
separate patch
series, just like the "auto-updating" feature.

>
> Thanx...
>
>ps
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Anton Salikhmetov
2008/1/11, Peter Staubach <[EMAIL PROTECTED]>:
> Anton Salikhmetov wrote:
> > From: Anton Salikhmetov <[EMAIL PROTECTED]>
> >
> > The patch contains changes for updating the ctime and mtime fields for 
> > memory mapped files:
> >
> > 1) adding a new flag triggering update of the inode data;
> > 2) implementing a helper function for checking that flag and updating ctime 
> > and mtime;
> > 3) updating time stamps for mapped files in sys_msync() and do_fsync().
>
> Sorry, one other issue to throw out too -- an mmap'd block device
> should also have its inode time fields updated.  This is a little
> tricky because the inode referenced via mapping->host isn't the
> one that needs to have the time fields updated on.
>
> I have attached the patch that I submitted last.  It is quite out
> of date, but does show my attempt to resolve some of these issues.

Thanks for your feedback!

Now I'm looking at your solution and thinking about which parts of it
I could adapt to the infrastructure I'm trying to develop.

However, I would like to address the block device case within
a separate project. But for now, I want the msync() and fsync()
system calls to update ctime and mtime at least for memory-mapped
regular files properly. I feel that even this little improvement could address
many customer's troubles such as the one Jacob Oestergaard reported
in the bug #2645.

>
> Thanx...
>
>ps
>
> --- linux-2.6.20.i686/fs/buffer.c.org
> +++ linux-2.6.20.i686/fs/buffer.c
> @@ -710,6 +710,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty_buffers(struct page *page)
>  {
> struct address_space * const mapping = page_mapping(page);
> +   int ret = 0;
>
> if (unlikely(!mapping))
> return !TestSetPageDirty(page);
> @@ -727,7 +728,7 @@ int __set_page_dirty_buffers(struct page
> spin_unlock(&mapping->private_lock);
>
> if (TestSetPageDirty(page))
> -   return 0;
> +   goto out;
>
> write_lock_irq(&mapping->tree_lock);
> if (page->mapping) {/* Race with truncate? */
> @@ -740,7 +741,11 @@ int __set_page_dirty_buffers(struct page
> }
> write_unlock_irq(&mapping->tree_lock);
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -   return 1;
> +   ret = 1;
> +out:
> +   if (page_mapped(page))
> +   set_bit(AS_MCTIME, &mapping->flags);
> +   return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
>
> --- linux-2.6.20.i686/fs/fs-writeback.c.org
> +++ linux-2.6.20.i686/fs/fs-writeback.c
> @@ -167,6 +167,13 @@ __sync_single_inode(struct inode *inode,
>
> spin_unlock(&inode_lock);
>
> +   if (test_and_clear_bit(AS_MCTIME, &mapping->flags)) {
> +   if (S_ISBLK(inode->i_mode))
> +   bd_inode_update_time(inode);
> +   else
> +   inode_update_time(inode);
> +   }
> +
> ret = do_writepages(mapping, wbc);
>
> /* Don't write the inode if only I_DIRTY_PAGES was set */
> --- linux-2.6.20.i686/fs/inode.c.org
> +++ linux-2.6.20.i686/fs/inode.c
> @@ -1201,8 +1201,8 @@ void touch_atime(struct vfsmount *mnt, s
>  EXPORT_SYMBOL(touch_atime);
>
>  /**
> - * file_update_time-   update mtime and ctime time
> - * @file: file accessed
> + * inode_update_time   -   update mtime and ctime time
> + * @inode: file accessed
>   *
>   * Update the mtime and ctime members of an inode and mark the inode
>   * for writeback.  Note that this function is meant exclusively for
> @@ -1212,9 +1212,8 @@ EXPORT_SYMBOL(touch_atime);
>   * timestamps are handled by the server.
>   */
>
> -void file_update_time(struct file *file)
> +void inode_update_time(struct inode *inode)
>  {
> -   struct inode *inode = file->f_path.dentry->d_inode;
> struct timespec now;
> int sync_it = 0;
>
> @@ -1238,7 +1237,7 @@ void file_update_time(struct file *file)
> mark_inode_dirty_sync(inode);
>  }
>
> -EXPORT_SYMBOL(file_update_time);
> +EXPORT_SYMBOL(inode_update_time);
>
>  int inode_needs_sync(struct inode *inode)
>  {
> --- linux-2.6.20.i686/fs/block_dev.c.org
> +++ linux-2.6.20.i686/fs/block_dev.c
> @@ -608,6 +608,22 @@ void bdput(struct block_device *bdev)
>
>  EXPORT_SYMBOL(bdput);
>
> +void bd_inode_update_time(struct inode *inode)
> +{
> +   struct block_device *bdev = inode->i_bdev;
> +   struct list_head *p;
> +
> +   if (bdev == NULL)
> +

Re: [PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-11 Thread Anton Salikhmetov
2008/1/11, Peter Staubach <[EMAIL PROTECTED]>:
> Anton Salikhmetov wrote:
> > From: Anton Salikhmetov <[EMAIL PROTECTED]>
> >
> > The patch contains changes for updating the ctime and mtime fields for 
> > memory mapped files:
> >
> > 1) adding a new flag triggering update of the inode data;
> > 2) implementing a helper function for checking that flag and updating ctime 
> > and mtime;
> > 3) updating time stamps for mapped files in sys_msync() and do_fsync().
> >
> >
>
> What happens if the application does not issue either an msync
> or an fsync call, but either just munmap's the region or just
> keeps on manipulating it?  It appears to me that the file times
> will never be updated in these cases.
>
> It seems to me that the file times should be updated eventually,
> and perhaps even regularly if the file is being constantly
> updated via the mmap'd region.

Indeed, FreeBSD, for example, implements updating ctime and mtime
in exactly the way you described. Many people I've spoken with do
interpret the POSIX requirement the same way as you do.

I had thoroughly investigated the possibility of implementing
the feature you are talking about, and came to the conclusion
that it would lead to quite massive changes and would require
a very complicated work with locks. At least, within
the current kernel design for the memory management.
So, I believe that the "auto-updating" feature should be
implemented later and outside of the bug #2645.

Finally, my solution puts the Linux kernel much closer to
the POSIX standard (the msync() and fsync() requirements), anyway.
And the changes which my patch contains, to the best of my knowledge,
do not intersect with the possible implementation of
the "auto-updating" feature.

I'm planning on adding the "auto-updating" feature in
the nearest future, but it looks like a separate project to me.

>
> Thanx...
>
>ps
>
> > Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
> >
> > ---
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 7249e01..09adf7e 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
> >   }
> >   write_unlock_irq(&mapping->tree_lock);
> >   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > + set_bit(AS_MCTIME, &mapping->flags);
> >
> >   return 1;
> >  }
> > diff --git a/fs/inode.c b/fs/inode.c
> > index ed35383..c5b954e 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -22,6 +22,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  /*
> >   * This is needed for the following functions:
> > @@ -1282,6 +1283,18 @@ void file_update_time(struct file *file)
> >
> >  EXPORT_SYMBOL(file_update_time);
> >
> > +/*
> > + * Update the ctime and mtime stamps after checking if they are to be 
> > updated.
> > + */
> > +void mapped_file_update_time(struct file *file)
> > +{
> > + if (test_and_clear_bit(AS_MCTIME, &file->f_mapping->flags)) {
> > + get_file(file);
> > + file_update_time(file);
> > + fput(file);
> > + }
> > +}
> > +
> >  int inode_needs_sync(struct inode *inode)
> >  {
> >   if (IS_SYNC(inode))
> > diff --git a/fs/sync.c b/fs/sync.c
> > index 7cd005e..df57507 100644
> > --- a/fs/sync.c
> > +++ b/fs/sync.c
> > @@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
> >   goto out;
> >   }
> >
> > + mapped_file_update_time(file);
> > +
> >   ret = filemap_fdatawrite(mapping);
> >
> >   /*
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index b3ec4a4..0b05118 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1978,6 +1978,7 @@ extern int inode_change_ok(struct inode *, struct 
> > iattr *);
> >  extern int __must_check inode_setattr(struct inode *, struct iattr *);
> >
> >  extern void file_update_time(struct file *file);
> > +extern void mapped_file_update_time(struct file *file);
> >
> >  static inline ino_t parent_ino(struct dentry *dentry)
> >  {
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index db8a410..bf0f9e7 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -17,8 +17,9 @@
> >   * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
> >   * 

[PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-10 Thread Anton Salikhmetov
From: Anton Salikhmetov <[EMAIL PROTECTED]>

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..09adf7e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+   set_bit(AS_MCTIME, &mapping->flags);
 
return 1;
 }
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..c5b954e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -1282,6 +1283,18 @@ void file_update_time(struct file *file)
 
 EXPORT_SYMBOL(file_update_time);
 
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapped_file_update_time(struct file *file)
+{
+   if (test_and_clear_bit(AS_MCTIME, &file->f_mapping->flags)) {
+   get_file(file);
+   file_update_time(file);
+   fput(file);
+   }
+}
+
 int inode_needs_sync(struct inode *inode)
 {
if (IS_SYNC(inode))
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..df57507 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+   mapped_file_update_time(file);
+
ret = filemap_fdatawrite(mapping);
 
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..0b05118 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1978,6 +1978,7 @@ extern int inode_change_ok(struct inode *, struct iattr 
*);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
 extern void file_update_time(struct file *file);
+extern void mapped_file_update_time(struct file *file);
 
 static inline ino_t parent_ino(struct dentry *dentry)
 {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db8a410..bf0f9e7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -17,8 +17,9 @@
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
  * allocation mode flags.
  */
-#defineAS_EIO  (__GFP_BITS_SHIFT + 0)  /* IO error on async 
write */
+#define AS_EIO (__GFP_BITS_SHIFT + 0)  /* IO error on async write */
 #define AS_ENOSPC  (__GFP_BITS_SHIFT + 1)  /* ENOSPC on async write */
+#define AS_MCTIME  (__GFP_BITS_SHIFT + 2)  /* mtime and ctime to update */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
diff --git a/mm/msync.c b/mm/msync.c
index e788f7b..9d0a8f9 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 1994-1999  Linus Torvalds
  *
  * Substantial code cleanup.
+ * Updating the ctime and mtime stamps for memory mapped files.
  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
@@ -22,6 +23,10 @@
  * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -74,14 +79,17 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
break;
}
file = vma->vm_file;
-   if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
-   get_file(file);
-   up_read(&mm->mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error)
-   return error;
-   down_read(&mm->mmap_sem);
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   mapped_file_update_time(file);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(&mm->mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error)
+   return error;
+   down_read(&mm-

[PATCH 2/2][RFC][BUG] msync: updating ctime and mtime at syncing

2008-01-10 Thread Anton Salikhmetov
From: Anton Salikhmetov <[EMAIL PROTECTED]>

The patch contains changes for updating the ctime and mtime fields for memory 
mapped files:

1) adding a new flag triggering update of the inode data;
2) implementing a helper function for checking that flag and updating ctime and 
mtime;
3) updating time stamps for mapped files in sys_msync() and do_fsync().

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..09adf7e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+   set_bit(AS_MCTIME, &mapping->flags);
 
return 1;
 }
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..c5b954e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -1282,6 +1283,18 @@ void file_update_time(struct file *file)
 
 EXPORT_SYMBOL(file_update_time);
 
+/*
+ * Update the ctime and mtime stamps after checking if they are to be updated.
+ */
+void mapped_file_update_time(struct file *file)
+{
+   if (test_and_clear_bit(AS_MCTIME, &file->f_mapping->flags)) {
+   get_file(file);
+   file_update_time(file);
+   fput(file);
+   }
+}
+
 int inode_needs_sync(struct inode *inode)
 {
if (IS_SYNC(inode))
diff --git a/fs/sync.c b/fs/sync.c
index 7cd005e..df57507 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -87,6 +87,8 @@ long do_fsync(struct file *file, int datasync)
goto out;
}
 
+   mapped_file_update_time(file);
+
ret = filemap_fdatawrite(mapping);
 
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b3ec4a4..0b05118 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1978,6 +1978,7 @@ extern int inode_change_ok(struct inode *, struct iattr 
*);
 extern int __must_check inode_setattr(struct inode *, struct iattr *);
 
 extern void file_update_time(struct file *file);
+extern void mapped_file_update_time(struct file *file);
 
 static inline ino_t parent_ino(struct dentry *dentry)
 {
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index db8a410..bf0f9e7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -17,8 +17,9 @@
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
  * allocation mode flags.
  */
-#defineAS_EIO  (__GFP_BITS_SHIFT + 0)  /* IO error on async 
write */
+#define AS_EIO (__GFP_BITS_SHIFT + 0)  /* IO error on async write */
 #define AS_ENOSPC  (__GFP_BITS_SHIFT + 1)  /* ENOSPC on async write */
+#define AS_MCTIME  (__GFP_BITS_SHIFT + 2)  /* mtime and ctime to update */
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
diff --git a/mm/msync.c b/mm/msync.c
index e788f7b..9d0a8f9 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
  * Copyright (C) 1994-1999  Linus Torvalds
  *
  * Substantial code cleanup.
+ * Updating the ctime and mtime stamps for memory mapped files.
  * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
@@ -22,6 +23,10 @@
  * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -74,14 +79,17 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
break;
}
file = vma->vm_file;
-   if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
-   get_file(file);
-   up_read(&mm->mmap_sem);
-   error = do_fsync(file, 0);
-   fput(file);
-   if (error)
-   return error;
-   down_read(&mm->mmap_sem);
+   if (file && (vma->vm_flags & VM_SHARED)) {
+   mapped_file_update_time(file);
+   if (flags & MS_SYNC) {
+   get_file(file);
+   up_read(&mm->mmap_sem);
+   error = do_fsync(file, 0);
+   fput(file);
+   if (error)
+   return error;
+   down_read(&mm-

[PATCH 0/2][RFC][BUG] msync: another attempt to fix the ctime/mtime issue

2008-01-10 Thread Anton Salikhmetov
From: Anton Salikhmetov <[EMAIL PROTECTED]>

I would like to propose my second solution for the bug #2645 from the
kernel bug tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

You may find the relevant background information as well as an extensive
discussion of my previous solution using the following link:

http://lkml.org/lkml/2008/1/9/387

The short change list:

1) taking into account the intervening sync() call which Peter Staubach
has mentioned (http://lkml.org/lkml/2008/1/9/267);
2) splitting the solution into two patches: code cleanup and functional
changes;
3) updating ctime and mtime in do_fsync(), due to that the file time stamps
get updated even without any explicit call to msync().

Please note that the second patch (functional changes) should be applied
on top of the first one (code cleanup).

Also I changed my unit test due to Peter's remark:

http://lkml.org/lkml/2008/1/9/267

The new version of the unit test can be found here:

http://bugzilla.kernel.org/attachment.cgi?id=14398&action=view

No regression was found when I ran the test cases for the msync() system
call from the LTP test suite (msync01 - msync05, mmapstress01,
mmapstress09, and mmapstress10).

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2][RFC][BUG] msync: massive code cleanup of sys_msync()

2008-01-10 Thread Anton Salikhmetov
From: Anton Salikhmetov <[EMAIL PROTECTED]>

The patch contains substantial code cleanup of the sys_msync() function:

1) consolidated error check for function parameters;
2) using the PAGE_ALIGN() macro instead of "manual" alignment;
3) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..e788f7b 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,25 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Substantial code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
  * The application may now run fsync() to
@@ -33,71 +34,60 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = 0, unmapped_error = 0;
+
+   if ((flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) ||
+   (start & ~PAGE_MASK) ||
+   ((flags & MS_ASYNC) && (flags & MS_SYNC)))
+   return -EINVAL;
 
-   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
-   goto out;
-   if (start & ~PAGE_MASK)
-   goto out;
-   if ((flags & MS_ASYNC) && (flags & MS_SYNC))
-   goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   len = PAGE_ALIGN(len);
end = start + len;
if (end < start)
-   goto out;
-   error = 0;
+   return -ENOMEM;
if (end == start)
-   goto out;
+   return 0;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do {
struct file *file;
 
-   /* Still start < end. */
-   error = -ENOMEM;
-   if (!vma)
-   goto out_unlock;
-   /* Here start < vma->vm_end. */
+   if (!vma) {
+   error = -ENOMEM;
+   break;
+   }
if (start < vma->vm_start) {
start = vma->vm_start;
-   if (start >= end)
-   goto out_unlock;
+   if (start >= end) {
+   error = -ENOMEM;
+   break;
+   }
unmapped_error = -ENOMEM;
}
-   /* Here vma->vm_start <= start < vma->vm_end. */
-   if ((flags & MS_INVALIDATE) &&
-   (vma->vm_flags & VM_LOCKED)) {
+   if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED)) {
error = -EBUSY;
-   goto out_unlock;
+   break;
}
file = vma->vm_file;
-   start = vma->vm_end;
-   if ((flags & MS_SYNC) && file &&
-   (vma->vm_flags & VM_SHARED)) {
+   if ((flags & MS_SYNC) && file && (vma->vm_flags & VM_SHARED)) {
get_file(file);
up_read(&mm->mmap_sem);
error = do_fsync(file, 0);
fput(file);
-   if (error || start >= end)
-   goto out;
+   if (error)
+   return error;
down_read(&mm->mmap_sem);
-   vma = find_vma(mm, start);
-   } else {
-   if (start >= end) {
-   error = 0;
-   goto out_unlock;
-   }
-   vma = vma->vm_next;
}
-   }
-out_unlock:
+
+   start = vma->vm_end;
+   vma = vma->vm_next;
+   

Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Anton Salikhmetov
2008/1/10, Rik van Riel <[EMAIL PROTECTED]>:
> On Thu, 10 Jan 2008 18:56:07 +0300
> "Anton Salikhmetov" <[EMAIL PROTECTED]> wrote:
>
> > However, I don't see how they will work if there has been
> > something like a sync(2) done after the mmap'd region is
> > modified and the msync call.  When the inode is written out
> > as part of the sync process, I_DIRTY_PAGES will be cleared,
> > thus causing a miss in this code.
> >
> > The I_DIRTY_PAGES check here is good, but I think that there
> > needs to be some code elsewhere too, to catch the case where
> > I_DIRTY_PAGES is being cleared, but the time fields still need
> > to be updated.
>
> Agreed. The mtime and ctime should probably also be updated
> when I_DIRTY_PAGES is cleared.
>
> The alternative would be to remember that the inode had been
> dirty in the past, and have the mtime and ctime updated on
> msync or close - which would be more complex.

Adding the new flag (AS_MCTIME) has been already suggested by Peter
Staubach in his first solution for this bug. Now I understand that the
AS_MCTIME flag is required for fixing the bug.

>
> --
> All rights reversed.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Anton Salikhmetov
2008/1/10, Rik van Riel <[EMAIL PROTECTED]>:
> On Thu, 10 Jan 2008 13:53:59 +0300
> "Anton Salikhmetov" <[EMAIL PROTECTED]> wrote:
>
> > Indeed, if msync() is called with MS_SYNC an explicit sync is
> > triggered, and Rik's suggestion would work. However, the POSIX
> > standard requires a call to msync() with MS_ASYNC to update the
> > st_ctime and st_mtime stamps too. No explicit sync of the inode data
> > is triggered in the current implementation of msync(). Hence Rik's
> > suggestion would fail to satisfy POSIX in the latter case.
>
> Since your patch is already changing msync(), has it occurred
> to you that your patch could change msync() to do the right
> thing?

No, not quite. Peter Staubach mentioned an issue in my solution:

>>>

> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment check, improved readability of the loop,
> which traverses the process memory regions, updated comments.
>
>

These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call.  When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.

<<<

So I'm working on my next solution for this bug now.

>
> --
> All rights reversed.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-10 Thread Anton Salikhmetov
2008/1/10, Jakob Oestergaard <[EMAIL PROTECTED]>:
> On Thu, Jan 10, 2008 at 03:03:03AM +0300, Anton Salikhmetov wrote:
> ...
> > > I guess a third possible time (if we want to minimize the number of
> > > updates) would be when natural syncing of the file data to disk, by
> > > other things in the VM, would be about to clear the I_DIRTY_PAGES
> > > flag on the inode.  That way we do not need to remember any special
> > > "we already flushed all dirty data, but we have not updated the mtime
> > > and ctime yet" state.
> > >
> > > Does this sound reasonable?
> >
> > No, it doesn't. The msync() system call called with the MS_ASYNC flag
> > should (the POSIX standard requires that) update the st_ctime and
> > st_mtime stamps in the same manner as for the MS_SYNC flag. However,
> > the current implementation of msync() doesn't call the do_fsync()
> > function for the MS_ASYNC case. The msync() function may be called
> > with the MS_ASYNC flag before "natural syncing".
>
> If the update was done as Rik suggested, with the addition that msync()
> triggered an explicit sync of the inode data, then everything would be ok,
> right?

Indeed, if msync() is called with MS_SYNC an explicit sync is
triggered, and Rik's suggestion would work. However, the POSIX
standard requires a call to msync() with MS_ASYNC to update the
st_ctime and st_mtime stamps too. No explicit sync of the inode data
is triggered in the current implementation of msync(). Hence Rik's
suggestion would fail to satisfy POSIX in the latter case.

>
> --
>
>  / jakob
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Anton Salikhmetov
2008/1/10, Rik van Riel <[EMAIL PROTECTED]>:
> On Wed, 09 Jan 2008 16:06:17 -0500
> [EMAIL PROTECTED] wrote:
> > On Wed, 09 Jan 2008 15:50:15 EST, Rik van Riel said:
> >
> > > Could you explain (using short words and simple sentences) what the
> > > exact problem is?
> >
> > It's like this:
> >
> > Monday  9:04AM:  System boots, database server starts up, mmaps file
> > Monday  9:06AM:  Database server writes to mmap area, updates mtime/ctime
> > Monday  Database server writes to mmap area, no further update..
> > Monday 11:45PM:  Backup sees "file modified 9:06AM, let's back it up"
> > Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no 
> > mtime
> > Tuesday 11:45PM: Backup sees "file modified back on Monday, we backed this 
> > up..
> > Wed  9:00AM-5:00PM: More updates, more not touching the mtime
> > Wed  11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing 
> > it up..
> >
> > Lather, rinse, repeat
>
> On the other hand, updating the mtime and ctime whenever a page is dirtied
> also does not work right.  Apparently that can break mutt.

Please tell why you think that can break mutt? Such approach was
suggested by Peter once and looks reasonable to me too.

>
> Calling msync() every once in a while with Anton's patch does not look like a
> fool proof method to me either, because the VM can write all the dirty pages
> to disk by itself, leaving nothing for msync() to detect.  (I think...)
>
> Can we get by with simply updating the ctime and mtime every time msync()
> is called, regardless of whether or not the mmaped pages were still dirty
> by the time we called msync() ?
>
> --
> All Rights Reversed
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Anton Salikhmetov
2008/1/9, Rik van Riel <[EMAIL PROTECTED]>:
> On Mon, 07 Jan 2008 20:54:19 +0300
> Anton Salikhmetov <[EMAIL PROTECTED]> wrote:
>
> > This program showed that the msync() function had a bug:
> > it did not update the st_mtime and st_ctime fields.
> >
> > The program shows appropriate behavior of the msync()
> > function using the kernel with the proposed patch applied.
> > Specifically, the ctime and mtime time stamps do change
> > when modifying the mapped memory and do not change when
> > there have been no write references between the mmap()
> > and msync() system calls.
>
> As long as the ctime and mtime stamps change when the memory is
> written to, what exactly is the problem?
>
> Is it that the ctime and mtime does not change again when the memory
> is written to again?
>
> Is there a way for backup programs to miss file modification times?
>
> Could you explain (using short words and simple sentences) what the
> exact problem is?
>
> Eg.
>
> 1) program mmaps file
> 2) program writes to mmaped area
> 3) ???   <=== this part, in equally simple words :)
> 4) data loss
>
> An explanation like that will help people understand exactly what the
> bug is, and why the patch should be applied ASAP.
>
> > The patch adds a call to the file_update_time() function to change
> > the file metadata before syncing. The patch also contains
> > substantial code cleanup: consolidated error check
> > for function parameters, using the PAGE_ALIGN() macro instead of
> > "manual" alignment, improved readability of the loop,
> > which traverses the process memory regions, updated comments.
>
> Due to the various cleanups all being in one file, it took me a while
> to understand the patch.  In an area of code this subtle, it may be
> better to submit the cleanups in (a) separate patch(es) from the patch
> that adds the call to file_update_time().

Now I'm working on my next solution for this bug. It will probably
modify more than one file and be split into several parts.

>
> > - (vma->vm_flags & VM_SHARED)) {
> > + if (file && (vma->vm_flags & VM_SHARED)) {
> >   get_file(file);
> > - up_read(&mm->mmap_sem);
> > - error = do_fsync(file, 0);
> > - fput(file);
> > - if (error || start >= end)
> > - goto out;
> > - down_read(&mm->mmap_sem);
> > - vma = find_vma(mm, start);
> > - } else {
> > - if (start >= end) {
> > - error = 0;
> > - goto out_unlock;
> > + if (file->f_mapping->host->i_state & I_DIRTY_PAGES)
> > + file_update_time(file);
>
> I wonder if calling file_update_time() from inside the loop is the
> best idea.  Why not call that function just once, after msync breaks
> from the loop?

That function should be called inside of the loop because the memory
region, which msync() is called with, may contain pages mapped for
different files.

>
> thanks,
>
> Rik
> --
> All Rights Reversed
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Anton Salikhmetov
2008/1/9, Peter Staubach <[EMAIL PROTECTED]>:
> Anton Salikhmetov wrote:
> > From: Anton Salikhmetov <[EMAIL PROTECTED]>
> >
> > I would like to propose my solution for the bug #2645 from the kernel bug 
> > tracker:
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=2645
> >
> > The Open Group defines the behavior of the mmap() function as follows.
> >
> > The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> > and PROT_WRITE shall be marked for update at some point in the interval
> > between a write reference to the mapped region and the next call to msync()
> > with MS_ASYNC or MS_SYNC for that portion of the file by any process.
> > If there is no such call and if the underlying file is modified as a result
> > of a write reference, then these fields shall be marked for update at some
> > time after the write reference.
> >
> > The above citation was taken from the following link:
> >
> > http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
> >
> > Therefore, the msync() function should be called before verifying the time
> > stamps st_mtime and st_ctime in the test program Badari wrote in the context
> > of the bug #2645. Otherwise, the time stamps may be updated
> > at some unspecified moment according to the POSIX standard.
> >
> > I changed his test program a little. The changed unit test can be downloaded
> > using the following link:
> >
> > http://pygx.sourceforge.net/mmap.c
> >
> > This program showed that the msync() function had a bug:
> > it did not update the st_mtime and st_ctime fields.
> >
> > The program shows the appropriate behavior of the msync()
> > function using the kernel with the proposed patch applied.
> > Specifically, the ctime and mtime time stamps do change
> > when modifying the mapped memory and do not change when
> > there have been no write references between the mmap()
> > and msync() system calls.
> >
> >
>
> Sorry, I don't see where the test program shows that the file
> times did not change if there had not been an intervening
> modification to the mmap'd region.  It appears to me that it
> just shows the file times changing or not when there has been
> intervening modification after the mmap call and before the
> fstat call.
>
> Or am I looking in the wrong place?  :-)

No, you are looking at the right place, but there was one thing missing
from my unit test: namely, that the "no write reference" case was
tested by commenting out the "for" loop. I should have written this
explicitly, sorry for its not having been done. The next version of
the unit test will be cleaner, I hope.

>
> > Additionally, the test cases for the msync() system call from
> > the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
> > and mmapstress10) successfully passed using the kernel
> > with the patch included into this email.
> >
> > The patch adds a call to the file_update_time() function to change
> > the file metadata before syncing. The patch also contains
> > substantial code cleanup: consolidated error check
> > for function parameters, using the PAGE_ALIGN() macro instead of
> > "manual" alignment check, improved readability of the loop,
> > which traverses the process memory regions, updated comments.
> >
> >
>
> These changes catch the simple case, where the file is mmap'd,
> modified via the mmap'd region, and then an msync is done,
> all on a mostly quiet system.
>
> However, I don't see how they will work if there has been
> something like a sync(2) done after the mmap'd region is
> modified and the msync call.  When the inode is written out
> as part of the sync process, I_DIRTY_PAGES will be cleared,
> thus causing a miss in this code.

Good catch, thanks! I totally missed the intervening sync() case.
Now I understand why the I_DIRTY_PAGES flag is not a reliable way
to detect whether there have been any write references in the
mapped area.

>
> The I_DIRTY_PAGES check here is good, but I think that there
> needs to be some code elsewhere too, to catch the case where
> I_DIRTY_PAGES is being cleared, but the time fields still need
> to be updated.
>
> --
>
> A better architecture would be to arrange for the file times
> to be updated when the page makes the transition from being
> unmodified to modified.  This is not straightforward due to
> the current locking, but should be doable, I think.  Perhaps
> recording the current time and then using it to update the
> file times at a more suitable time (no pun intend

Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Anton Salikhmetov
2008/1/10, Rik van Riel <[EMAIL PROTECTED]>:
> On Wed, 9 Jan 2008 23:33:40 +0100
> Jakob Oestergaard <[EMAIL PROTECTED]> wrote:
> > On Wed, Jan 09, 2008 at 05:06:33PM -0500, Rik van Riel wrote:
>
> > > Can we get by with simply updating the ctime and mtime every time msync()
> > > is called, regardless of whether or not the mmaped pages were still dirty
> > > by the time we called msync() ?
> >
> > The update must still happen, eventually, after a write to the mapped region
> > followed by an unmap/close even if no msync is ever called.
> >
> > The msync only serves as a "no later than" deadline. The write to the region
> > triggers the need for the update.
> >
> > At least this is how I read the standard - please feel free to correct me 
> > if I
> > am mistaken.
>
> You are absolutely right.  If we wrote dirty pages to disk, the ctime
> and mtime updates must happen no later than msync or close time.
>
> I guess a third possible time (if we want to minimize the number of
> updates) would be when natural syncing of the file data to disk, by
> other things in the VM, would be about to clear the I_DIRTY_PAGES
> flag on the inode.  That way we do not need to remember any special
> "we already flushed all dirty data, but we have not updated the mtime
> and ctime yet" state.
>
> Does this sound reasonable?

No, it doesn't. The msync() system call called with the MS_ASYNC flag
should (the POSIX standard requires that) update the st_ctime and
st_mtime stamps in the same manner as for the MS_SYNC flag. However,
the current implementation of msync() doesn't call the do_fsync()
function for the MS_ASYNC case. The msync() function may be called
with the MS_ASYNC flag before "natural syncing".
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Anton Salikhmetov
2008/1/9, Jesper Juhl <[EMAIL PROTECTED]>:
> I've only looked briefly at your patch but it seems resonable. I'll
> try to do some testing with it later.

Jesper, thank you very much for your answer!

In fact, I tested my change quite extensively using test cases for the
mmap() and msync() system calls from the LTP test suite. Please note
that I did mention that in my previous message:

>>>

Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

<<<
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-09 Thread Anton Salikhmetov
Since no reaction in LKML was recieved for this message it seemed
logical to suggest closing the bug #2645 as "WONTFIX":

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15

However, the reporter of the bug, Jacob Oestergaard, insisted the
solution to be resubmitted once more:

>>>

Please re-submit to LKML.

This bug causes backup systems to *miss* changed files.

This bug does cause data loss in common real-world deployments (I gave an
example with a database when posting the bug, but this affects the data from
all mmap using applications with common backup systems).

Silent exclusion from backups is very very nasty.

<<<

Please comment on my solution or commit it if it's acceptable in its
present form.

2008/1/7, Anton Salikhmetov <[EMAIL PROTECTED]>:
> From: Anton Salikhmetov <[EMAIL PROTECTED]>
>
> Due to the lack of reaction in LKML I presume the message was lost
> in the high traffic of that list. Resending it now with the addressee changed
> to the memory management mailing list.
>
> I would like to propose my solution for the bug #2645 from the kernel bug 
> tracker:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>
> The Open Group defines the behavior of the mmap() function as follows.
>
> The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> and PROT_WRITE shall be marked for update at some point in the interval
> between a write reference to the mapped region and the next call to msync()
> with MS_ASYNC or MS_SYNC for that portion of the file by any process.
> If there is no such call and if the underlying file is modified as a result
> of a write reference, then these fields shall be marked for update at some
> time after the write reference.
>
> The above citation was taken from the following link:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
>
> Therefore, the msync() function should be called before verifying the time
> stamps st_mtime and st_ctime in the test program Badari wrote in the context
> of the bug #2645. Otherwise, the time stamps may be updated
> at some unspecified moment according to the POSIX standard.
>
> I changed his test program a little. The changed unit test can be downloaded
> using the following link:
>
> http://pygx.sourceforge.net/mmap.c
>
> This program showed that the msync() function had a bug:
> it did not update the st_mtime and st_ctime fields.
>
> The program shows appropriate behavior of the msync()
> function using the kernel with the proposed patch applied.
> Specifically, the ctime and mtime time stamps do change
> when modifying the mapped memory and do not change when
> there have been no write references between the mmap()
> and msync() system calls.
>
> Additionally, the test cases for the msync() system call from
> the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
> and mmapstress10) successfully passed using the kernel
> with the patch included into this email.
>
> The patch adds a call to the file_update_time() function to change
> the file metadata before syncing. The patch also contains
> substantial code cleanup: consolidated error check
> for function parameters, using the PAGE_ALIGN() macro instead of
> "manual" alignment, improved readability of the loop,
> which traverses the process memory regions, updated comments.
>
> Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
>
> ---
>
> diff --git a/mm/msync.c b/mm/msync.c
> index 144a757..cb973eb 100644
> --- a/mm/msync.c
> +++ b/mm/msync.c
> @@ -1,26 +1,32 @@
>  /*
>   * linux/mm/msync.c
>   *
> + * The msync() system call.
>   * Copyright (C) 1994-1999  Linus Torvalds
> + *
> + * Updating the mtime and ctime stamps for mapped files
> + * and code cleanup.
> + * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
>   */
>
> -/*
> - * The msync() system call.
> - */
> +#include 
>  #include 
>  #include 
>  #include 
> -#include 
> -#include 
>  #include 
> +#include 
>
>  /*
>   * MS_SYNC syncs the entire file - including mappings.
>   *
>   * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
> - * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
> + * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
>   * Now it doesn't do anything, since dirty pages are properly tracked.
>   *
> + * The msync() system call updates the ctime and mtime fields for
> + * the mapped file when called with the MS_SYNC or MS_ASYNC flags
> + * according to the POSIX standard.
> + *
>   * The application may now run fsync() to
>   * write out the dirty pages and wait on the writeout and check the result.
>   * Or

[PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

2008-01-07 Thread Anton Salikhmetov
From: Anton Salikhmetov <[EMAIL PROTECTED]>

Due to the lack of reaction in LKML I presume the message was lost
in the high traffic of that list. Resending it now with the addressee changed
to the memory management mailing list.

I would like to propose my solution for the bug #2645 from the kernel bug 
tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.

I changed his test program a little. The changed unit test can be downloaded
using the following link:

http://pygx.sourceforge.net/mmap.c

This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.

The program shows appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.

Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

The patch adds a call to the file_update_time() function to change
the file metadata before syncing. The patch also contains
substantial code cleanup: consolidated error check
for function parameters, using the PAGE_ALIGN() macro instead of
"manual" alignment, improved readability of the loop,
which traverses the process memory regions, updated comments.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..cb973eb 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,26 +1,32 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Updating the mtime and ctime stamps for mapped files
+ * and code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -33,70 +39,68 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = 0, unmapped_error = 0;
 
-   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
-   goto out;
-   if (start & ~PAGE_MASK)
+   if ((flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) ||
+   (start & ~PAGE_MASK) ||
+   ((flags & MS_ASYNC) && (flags & MS_SYNC))) {
+   error = -EINVAL;
goto out;
-   if ((flags & MS_ASYNC) && (flags & MS_SYNC))
-   goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   }
+
+   len = PAGE_ALIGN(len);
end = start + len;
-   if (end < start)
+   if (end < start) {
+   error = -ENOMEM;
goto out;
-   error = 0;
+   }
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 

[PATCH] updating the ctime and mtime time stamps in msync()

2008-01-04 Thread Anton Salikhmetov
From: Anton Salikhmetov <[EMAIL PROTECTED]>

I would like to propose my solution for the bug #2645 from the kernel bug 
tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.

I changed his test program a little. The changed unit test can be downloaded
using the following link:

http://pygx.sourceforge.net/mmap.c

This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.

The program shows the appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.

Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

The patch adds a call to the file_update_time() function to change
the file metadata before syncing. The patch also contains
substantial code cleanup: consolidated error check
for function parameters, using the PAGE_ALIGN() macro instead of
"manual" alignment check, improved readability of the loop,
which traverses the process memory regions, updated comments.

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>

---

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..cb973eb 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,26 +1,32 @@
 /*
  * linux/mm/msync.c
  *
+ * The msync() system call.
  * Copyright (C) 1994-1999  Linus Torvalds
+ *
+ * Updating the mtime and ctime stamps for mapped files
+ * and code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <[EMAIL PROTECTED]>
  */
 
-/*
- * The msync() system call.
- */
+#include 
 #include 
 #include 
 #include 
-#include 
-#include 
 #include 
+#include 
 
 /*
  * MS_SYNC syncs the entire file - including mappings.
  *
  * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
  * Now it doesn't do anything, since dirty pages are properly tracked.
  *
+ * The msync() system call updates the ctime and mtime fields for
+ * the mapped file when called with the MS_SYNC or MS_ASYNC flags
+ * according to the POSIX standard.
+ *
  * The application may now run fsync() to
  * write out the dirty pages and wait on the writeout and check the result.
  * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
@@ -33,70 +39,68 @@ asmlinkage long sys_msync(unsigned long start, size_t len, 
int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
-   int unmapped_error = 0;
-   int error = -EINVAL;
+   int error = 0, unmapped_error = 0;
 
-   if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
-   goto out;
-   if (start & ~PAGE_MASK)
+   if ((flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC)) ||
+   (start & ~PAGE_MASK) ||
+   ((flags & MS_ASYNC) && (flags & MS_SYNC))) {
+   error = -EINVAL;
goto out;
-   if ((flags & MS_ASYNC) && (flags & MS_SYNC))
-   goto out;
-   error = -ENOMEM;
-   len = (len + ~PAGE_MASK) & PAGE_MASK;
+   }
+
+   len = PAGE_ALIGN(len);
end = start + len;
-   if (end < start)
+   if (end < start) {
+   error = -ENOMEM;
goto out;
-   error = 0;
+   }
if (end == start)
goto out;
+
/*
 * If the interval [start,end) covers some unmapped address ranges,
 * just ignore them, but return -ENOMEM at the end.
 */
down_read(&mm->mmap_sem);
vma = find_vma(mm, start);
-   for (;;) {
+   do 

Re: [PATCH] signals: real-time signals delivery order

2007-07-11 Thread Anton Salikhmetov
В Срд, 11/07/2007 в 07:34 +, Matthieu CASTET пишет:
> Anton Salikhmetov  gmail.com> writes:
> 
> > 
> > From: Anton Salikhmetov  gmail.com>
> > 
> > According to the POSIX standard, multiple real-time signals
> > pending to a process should be delivered in a strict order.
> > Specifically, the lowest-numbered signal should be delivered
> > first and multiple occurrences of signals with the same number
> > should be delivered in FIFO order.
> > 
> > Current Linux kernel delivers the highest-numbered signals
> > pending to a process first, not the lowest-numbered ones. This
> > contradicts to the requirement explained above. The problem can
> > be demonstrated by the following test program:
> I believe you should check that you mask or signal in your signal handler.
> If you don't the high-prio handler will be prempted by low-prio, and they will
> be executed in the reverse order.
> 

Thanks, I have modified the test program blocking the signals for the
duration of the signal handler as follows:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SIG_RAND (SIGRTMIN + rand() % (SIGRTMAX - SIGRTMIN + 1))
#define NUMSIG 7

struct {
int sig, num;
} seq[NUMSIG];

void sh(int sig, siginfo_t *info, void *unused)
{
static int cnt;
seq[cnt].num = info->si_value.sival_int;
seq[cnt++].sig = sig;
}

int main(int argc, char *argv[])
{
int i;
pid_t p = getpid();
struct sigaction sa;
sigset_t ss, ssold;
union sigval sv;

sigemptyset(&ss);
for (i = SIGRTMIN; i <= SIGRTMAX; i++) {
sa.sa_sigaction = sh;
sa.sa_flags = SA_SIGINFO;
sigfillset(&sa.sa_mask);
sigaction(i, &sa, NULL);
sigaddset(&ss, i);
}

sigprocmask(SIG_BLOCK, &ss, &ssold);
switch (fork()) {
case -1:
perror("fork()");
return 1;
case 0:
srand(time(NULL));
for (i = 0; i < NUMSIG; i++) {
sv.sival_int = i;
sigqueue(p, SIG_RAND, sv);
}
return 0;
default:
wait(NULL);
}
sigprocmask(SIG_SETMASK, &ssold, NULL);

for (i = 0; i < NUMSIG; i++)
printf("SIGRTMIN + %2d, sent with number %d\n",
seq[i].sig - SIGRTMIN, seq[i].num);
return 0;
}

This modified program works correctly with the original kernel:

SIGRTMIN +  2, sent with number 5
SIGRTMIN +  3, sent with number 2
SIGRTMIN + 10, sent with number 6
SIGRTMIN + 16, sent with number 3
SIGRTMIN + 17, sent with number 4
SIGRTMIN + 18, sent with number 0
SIGRTMIN + 29, sent with number 1

I would like to explain why I got an impression the next_signal()
funtion did not work correctly. When I ran the old test program using a
vendor-specific heavily patched kernel the signals order was as the
POSIX standard specified. Another kernel, which was closer to the
vanilla kernel, did not show the expected behavior. Instead, the signals
were handled in the reversed order.

Sorry for making noise in the mailing list.

Anton

> 
> Matthieu 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] signals: real-time signals delivery order

2007-07-10 Thread Anton Salikhmetov
From: Anton Salikhmetov <[EMAIL PROTECTED]>

According to the POSIX standard, multiple real-time signals
pending to a process should be delivered in a strict order.
Specifically, the lowest-numbered signal should be delivered
first and multiple occurrences of signals with the same number
should be delivered in FIFO order.

Current Linux kernel delivers the highest-numbered signals
pending to a process first, not the lowest-numbered ones. This
contradicts to the requirement explained above. The problem can
be demonstrated by the following test program:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SIG_RAND (SIGRTMIN + rand() % (SIGRTMAX - SIGRTMIN + 1))
#define NUMSIG 7

struct {
int sig, num;
} seq[NUMSIG];

void sh(int sig, siginfo_t *info, void *unused)
{
static int cnt;
seq[cnt].num = info->si_value.sival_int;
seq[cnt++].sig = sig;
}

int main(int argc, char *argv[])
{
int i;
pid_t p = getpid();
struct sigaction sa;
sigset_t ss, ssold;
union sigval sv;

sigemptyset(&ss);
for (i = SIGRTMIN; i <= SIGRTMAX; i++) {
sa.sa_sigaction = sh;
sa.sa_flags = SA_SIGINFO;
sigemptyset(&sa.sa_mask);
sigaction(i, &sa, NULL);
sigaddset(&ss, i);
}

sigprocmask(SIG_BLOCK, &ss, &ssold);
switch (fork()) {
case -1:
perror("fork()");
return 1;
case 0:
srand(time(NULL));
for (i = 0; i < NUMSIG; i++) {
sv.sival_int = i;
sigqueue(p, SIG_RAND, sv);
}
return 0;
default:
wait(NULL);
}
sigprocmask(SIG_SETMASK, &ssold, NULL);

for (i = 0; i < NUMSIG; i++)
printf("SIGRTMIN + %2d, sent with number %d\n",
seq[i].sig - SIGRTMIN, seq[i].num);
return 0;
}

Example of the program output for the original kernel:

SIGRTMIN + 23, sent with number 0
SIGRTMIN + 21, sent with number 1
SIGRTMIN + 21, sent with number 6
SIGRTMIN + 16, sent with number 2
SIGRTMIN + 15, sent with number 3
SIGRTMIN + 14, sent with number 4
SIGRTMIN +  5, sent with number 5

Example of the program output for the patched kernel:

SIGRTMIN + 2, sent with number 1
SIGRTMIN + 3, sent with number 0
SIGRTMIN + 17, sent with number 3
SIGRTMIN + 23, sent with number 5
SIGRTMIN + 25, sent with number 2
SIGRTMIN + 29, sent with number 6
SIGRTMIN + 30, sent with number 4

Patch for the Linux 2.6.22.1 kernel:

Signed-off-by: Anton Salikhmetov <[EMAIL PROTECTED]>
---
--- linux-2.6.22.1.orig/kernel/signal.c 2007-07-10 22:56:30.0
+0400
+++ linux-2.6.22.1/kernel/signal.c  2007-07-11 04:42:17.0 +0400
@@ -134,34 +134,11 @@
 
 int next_signal(struct sigpending *pending, sigset_t *mask)
 {
-   unsigned long i, *s, *m, x;
-   int sig = 0;
-   
-   s = pending->signal.sig;
-   m = mask->sig;
-   switch (_NSIG_WORDS) {
-   default:
-   for (i = 0; i < _NSIG_WORDS; ++i, ++s, ++m)
-   if ((x = *s &~ *m) != 0) {
-   sig = ffz(~x) + i*_NSIG_BPW + 1;
-   break;
-   }
-   break;
-
-   case 2: if ((x = s[0] &~ m[0]) != 0)
-   sig = 1;
-   else if ((x = s[1] &~ m[1]) != 0)
-   sig = _NSIG_BPW + 1;
-   else
-   break;
-   sig += ffz(~x);
-   break;
-
-   case 1: if ((x = *s &~ *m) != 0)
-   sig = ffz(~x) + 1;
-   break;
-   }
-   
+   unsigned long *s = pending->signal.sig, *m = mask->sig, x;
+   int sig = 0, i;
+   for (i = 0; i < _NSIG_WORDS; i++, s++, m++)
+   if ((x = (*s & ~*m)))
+   sig = fls(x) + i * _NSIG_BPW;
return sig;
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/