Re: Questions regarding logging upon fsync in btrfs

Josef Bacik Wed, 02 Oct 2013 04:52:39 -0700

On Tue, Oct 01, 2013 at 10:13:25PM +0200, Aastha Mehta wrote:
> On 1 October 2013 21:42, Aastha Mehta <aasth...@gmail.com> wrote:
> > On 1 October 2013 21:40, Aastha Mehta <aasth...@gmail.com> wrote:
> >> On 1 October 2013 19:34, Josef Bacik <jba...@fusionio.com> wrote:
> >>> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote:
> >>>> On 30 September 2013 22:47, Josef Bacik <jba...@fusionio.com> wrote:
> >>>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
> >>>> >> On 30 September 2013 22:11, Josef Bacik <jba...@fusionio.com> wrote:
> >>>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
> >>>> >> >> On 29 September 2013 15:12, Josef Bacik <jba...@fusionio.com> 
> >>>> >> >> wrote:
> >>>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
> >>>> >> >> >> Thank you very much for the reply. That clarifies a lot of 
> >>>> >> >> >> things.
> >>>> >> >> >>
> >>>> >> >> >> I was trying a small test case that opens a file, writes a 
> >>>> >> >> >> block of
> >>>> >> >> >> data, calls fsync and then closes the file. If I understand 
> >>>> >> >> >> correctly,
> >>>> >> >> >> fsync would return only after all in-memory buffers have been
> >>>> >> >> >> committed to disk. I have added few print statements in the
> >>>> >> >> >> __extent_writepage function, and I notice that the function gets
> >>>> >> >> >> called a bit later after fsync returns. It seems that I am not
> >>>> >> >> >> guaranteed to see the data going to disk by the time fsync 
> >>>> >> >> >> returns.
> >>>> >> >> >>
> >>>> >> >> >> Am I doing something wrong, or am I looking at the wrong place 
> >>>> >> >> >> for
> >>>> >> >> >> disk write? This happens both with tree logging enabled as well 
> >>>> >> >> >> as
> >>>> >> >> >> with notreelog.
> >>>> >> >> >>
> >>>> >> >> >
> >>>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't 
> >>>> >> >> > think it was
> >>>> >> >> > _that_ broken.  You are probably better off instrumenting a 
> >>>> >> >> > recent kernel, 3.11
> >>>> >> >> > or just build btrfs-next from git.  But if I were to make a 
> >>>> >> >> > guess I'd say that
> >>>> >> >> > __extent_writepage was how both data and metadata was written 
> >>>> >> >> > out at the time (I
> >>>> >> >> > don't think I changed it until 3.2 or something later) so what 
> >>>> >> >> > you are likely
> >>>> >> >> > seeing is the normal transaction commit after the fsync.  In the 
> >>>> >> >> > case of
> >>>> >> >> > notreelog we are likely starting another transaction and you are 
> >>>> >> >> > seeing that
> >>>> >> >> > commit (at the time the transaction kthread would start a 
> >>>> >> >> > transaction even if
> >>>> >> >> > none had been started yet.)  Thanks,
> >>>> >> >> >
> >>>> >> >> > Josef
> >>>> >> >>
> >>>> >> >> Is there any special handling for very small file write, less than 
> >>>> >> >> 4K? As
> >>>> >> >> I understand there is an optimization to inline the first extent 
> >>>> >> >> in a file if
> >>>> >> >> it is smaller than 4K, does it affect the writeback on fsync as 
> >>>> >> >> well? I did
> >>>> >> >> set the max_inline mount option to 0, but even then it seems there 
> >>>> >> >> is
> >>>> >> >> some difference in fsync behaviour for writing first extent of 
> >>>> >> >> less than 4K
> >>>> >> >> size and writing 4K or more.
> >>>> >> >>
> >>>> >> >
> >>>> >> > Yeah if the file is an inline extent then it will be copied into 
> >>>> >> > the log
> >>>> >> > directly and the log will be written out, no going through the data 
> >>>> >> > write path
> >>>> >> > at all.  Max inline == 0 should make it so we don't inline, so if 
> >>>> >> > it isn't
> >>>> >> > honoring that then that may be a bug.  Thanks,
> >>>> >> >
> >>>> >> > Josef
> >>>> >>
> >>>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then.
> >>>> >> Please find attached logs to confirm.
> >>>> >> Also, probably on the older release.
> >>>> >>
> >>>> >
> >>>> > Oooh ok I understand, you have your printk's in the wrong place ;).
> >>>> > do_writepages doesn't necessarily mean you are writing something.  If 
> >>>> > you want
> >>>> > to see if stuff got written to the disk I'd put a printk at 
> >>>> > run_delalloc_range
> >>>> > and have it spit out the range it is writing out since thats what we 
> >>>> > think is
> >>>> > actually dirty.  Thanks,
> >>>> >
> >>>> > Josef
> >>>>
> >>>> No, but I also placed dump_stack() in the beginning of
> >>>> __extent_writepage. run_delalloc_range is being called only from
> >>>> __extent_writepage, if it were to be called, the dump_stack() at the
> >>>> top of __extent_writepage would have printed as well, no?
> >>>>
> >>>
> >>> Ok I've done the same thing and I'm not seeing what you are seeing.  Are 
> >>> you
> >>> using any mount options other than notreelog and max_inline=0?  Could you 
> >>> adjust
> >>> your printk to print out the root objectid for the inode as well?  It 
> >>> could be
> >>> possible that this is the writeout for the space cache or inode cache.  
> >>> Thanks,
> >>>
> >>> Josef
> >>
> >> I actually printed the stack only when the root objectid is 5. I have
> >> attached another log for writing the first 500 bytes in a file. I also
> >> print the root objectid for the inode in run_delalloc and
> >> __extent_writepage.
> >>
> >> Thanks
> >>
> >
> > Just to clarify, in the latest logs, I allowed printing of debug
> > printk's and stack dump for all root objectid's.
> 
> Actually, it is the same behaviour when I write anything less than 4K
> long, no matter what offset, except if I straddle the page boundary.
> To summarise:
> 1. write 4K -> write in the fsync path
> 2. write less than 4K, within a single page -> bdi_writeback by flush worker
> 3. small write that straddles a page boundary or write 4K+delta -> the
> first page gets written in the fsync path, the remaining length that
> straddles the page boundary is written in the bdi_writeback path
> 
> Please let me know, if I am trying out incorrect cases.
> 
> Sorry for too many mails.
>


This has been bugging me so much I was dreaming about it and now here I am
writing an email at 4:45 in the morning ;).  So I couldn't reproduce earlier
with any of these scenarios and then I realized something, I'm doing something
like this

xfs_io -f -c "pwrite 0 54" -c "fsync" /mnt/btrfs-test/foo

and it is working perfectly.  But I bet what you are doing is something like
this

file = fopen("/mnt/btrfs-test/foo");
fwrite(buf, 54, 1, file);
fsync(fileno(file));
fclose(file);

right?  Please say yes :).  If this is the case then it is likely that these
small writes are getting buffered in the userspace buffering that comes with
fwrite, and so when you fsync it is only flushing the data that is actually in
the kernel, not what is buffered in userspace.  Then when you fclose it flushes
what is in the userspace buffers out to the kernel and then later on the
background writer comes in and writes out the dirty data.  To fix this you want
to do fflush() and then fsync().  Hopefully that is what you are doing and I can
go back to sleep, thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Questions regarding logging upon fsync in btrfs

Reply via email to