On Thu, Dec 15, 2011 at 03:51:02PM +0400, Andrey Zonov wrote:
> On Thu, Dec 15, 2011 at 12:42 AM, Jeremy Chadwick
> <free...@jdc.parodius.com>wrote:
> 
> > On Wed, Dec 14, 2011 at 11:47:10PM +0400, Andrey Zonov wrote:
> > > On 14.12.2011 22:22, Jeremy Chadwick wrote:
> > > >On Wed, Dec 14, 2011 at 10:11:47PM +0400, Andrey Zonov wrote:
> > > >>Hi Jeremy,
> > > >>
> > > >>This is not hardware problem, I've already checked that. I also ran
> > > >>fsck today and got no errors.
> > > >>
> > > >>After some more exploration of how mongodb works, I found that then
> > > >>listing hangs, one of mongodb thread is in "biowr" state for a long
> > > >>time. It periodically calls msync(MS_SYNC) accordingly to ktrace
> > > >>out.
> > > >>
> > > >>If I'll remove msync() calls from mongodb, how often data will be
> > > >>sync by OS?
> > > >>
> > > >>--
> > > >>Andrey Zonov
> > > >>
> > > >>On 14.12.2011 2:15, Jeremy Chadwick wrote:
> > > >>>On Wed, Dec 14, 2011 at 01:11:19AM +0400, Andrey Zonov wrote:
> > > >>>>
> > > >>>>Have you any ideas what is going on? or how to catch the problem?
> > > >>>
> > > >>>Assuming this isn't a file on the root filesystem, try booting the
> > > >>>machine in single-user mode and using "fsck -f" on the filesystem in
> > > >>>question.
> > > >>>
> > > >>>Can you verify there's no problems with the disk this file lives on as
> > > >>>well (smartctl -a /dev/disk)?  I'm doubting this is the problem, but
> > > >>>thought I'd mention it.
> > > >
> > > >I have no real answer, I'm sorry.  msync(2) indicates it's effectively
> > > >deprecated (see BUGS).  It looks like this is effectively a mmap-version
> > > >of fsync(2).
> > >
> > > I replaced msync(2) with fsync(2).  Unfortunately, from man pages it
> > > is not obvious that I can do this. Anyway, thanks.
> >
> > Sorry, that wasn't what I was implying.  Let me try to explain
> > differently.
> >
> > msync(2) looks, to me, like an mmap-specific version of fsync(2).  Based
> > on the man page, it seems that the with msync() you can effectively
> > guaranteed flushing of certain pages within an mmap()'d region to disk.
> > fsync() would flush **all** buffers/internal pages to be flushed to
> > disk.
> >
> > One would need to look at the code to mongodb to find out what it's
> > actually doing with msync().  That is to say, if it's doing something
> > like this (I probably have the semantics wrong -- I've never spent much
> > time with mmap()):
> >
> > fd = open("/some/file", O_RDWR);
> > ptr = mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> > ret = msync(ptr, 65536, MS_SYNC);
> > /* or alternatively, this:
> > ret = msync(ptr, NULL, MS_SYNC);
> > */
> >
> > Then this, to me, would be mostly the equivalent to:
> >
> > fd = fopen("/some/file", "r+");
> > ret = fsync(fd);
> >
> > Otherwise, if it's calling msync() only on an address/location within
> > the region ptr points to, then that may be more efficient (less pages to
> > flush).
> >
> 
> They call msync() for the whole file.  So, there will not be any difference.
> 
> 
> > The mmap() arguments -- specifically flags (see man page) -- also play
> > a role here.  The one that catches my attention is MAP_NOSYNC.  So you
> > may need to look at the mongodb code to figure out what it's mmap()
> > call is.
> >
> > One might wonder why they don't just use open() with the O_SYNC.  I
> > imagine that has to do with, again, performance; possibly the don't want
> > all I/O synchronous, and would rather flush certain pages in the mmap'd
> > region to disk as needed.  I see the legitimacy in that approach (vs.
> > just using O_SYNC).
> >
> > There's really no easy way for me to tell you which is more efficient,
> > better, blah blah without spending a lot of time with a benchmarking
> > program that tests all of this, *plus* an entire system (world) built
> > with profiling.
> >
> 
> I ran for two hours mongodb with fsync() and got the following:
> STARTED                      INBLK OUBLK MAJFLT MINFLT
> Thu Dec 15 10:34:52 2011         3 192744    314 3080182
> 
> This is output of `ps -o lstart,inblock,oublock,majflt,minflt -U mongodb'.
> 
> Then I ran it with default msync():
> STARTED                      INBLK OUBLK MAJFLT MINFLT
> Thu Dec 15 12:34:53 2011         0 7241555     79 5401945
> 
> There are also two graphics of disk business [1] [2].
> 
> The difference is significant, in 37 times!  That what I expected to get.
> 
> In commentaries for vm_object_page_clean() I found this:
> 
>  *      When stuffing pages asynchronously, allow clustering.  XXX we need a
>  *      synchronous clustering mode implementation.
> 
> It means for me that msync(MS_SYNC) flush every page on disk in single IO
> transaction.  If we multiply 4K and 37 we get 150K.  This number is size of
> the single transaction in my experience.
> 
> +alc@, kib@
> 
> Am I right? Is there any plan to implement this?
Current buffer clustering code can only do only async writes. In fact, I
am not quite sure what would consitute the sync clustering, because the
ability to delay the write is important to be able to cluster at all.

Also, I am not sure that lack of clustering is the biggest problem.
IMO, the fact that each write is sync is the first problem there. It
would be quite a work to add the tracking of the issued writes to the
vm_object_page_clean() and down the stack. Esp. due to custom page
write vops in several fses.

The only guarantee that POSIX requires from msync(MS_SYNC) is that
the writes are finished when the syscall returned, and not that the
writes are done synchronously. Below is the hack which should help if
the msync()ed region contains the mapping of the whole file, since
it is possible to fsync() the file after all writes are scheduled
asynchronous then. It will causes unneeded metadata update, but I think
it would be much faster still.


diff --git a/sys/vm/vm_object.c b/sys/vm/vm_object.c
index 250b769..a9de554 100644
--- a/sys/vm/vm_object.c
+++ b/sys/vm/vm_object.c
@@ -938,7 +938,7 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, 
vm_size_t size,
        vm_object_t backing_object;
        struct vnode *vp;
        struct mount *mp;
-       int flags;
+       int flags, fsync_after;
 
        if (object == NULL)
                return;
@@ -971,11 +971,26 @@ vm_object_sync(vm_object_t object, vm_ooffset_t offset, 
vm_size_t size,
                (void) vn_start_write(vp, &mp, V_WAIT);
                vfslocked = VFS_LOCK_GIANT(vp->v_mount);
                vn_lock(vp, LK_EXCLUSIVE | LK_RETRY);
-               flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
-               flags |= invalidate ? OBJPC_INVAL : 0;
+               if (syncio && !invalidate && offset == 0 &&
+                   OFF_TO_IDX(size) == object->size) {
+                       /*
+                        * If syncing the whole mapping of the file,
+                        * it is faster to schedule all the writes in
+                        * async mode, also allowing the clustering,
+                        * and then wait for i/o to complete.
+                        */
+                       flags = 0;
+                       fsync_after = TRUE;
+               } else {
+                       flags = (syncio || invalidate) ? OBJPC_SYNC : 0;
+                       flags |= invalidate ? (OBJPC_SYNC | OBJPC_INVAL) : 0;
+                       fsync_after = FALSE;
+               }
                VM_OBJECT_LOCK(object);
                vm_object_page_clean(object, offset, offset + size, flags);
                VM_OBJECT_UNLOCK(object);
+               if (fsync_after)
+                       (void) VOP_FSYNC(vp, MNT_WAIT, curthread);
                VOP_UNLOCK(vp, 0);
                VFS_UNLOCK_GIANT(vfslocked);
                vn_finished_write(mp);

Attachment: pgppNYhTmp6Pi.pgp
Description: PGP signature

Reply via email to