Re: Questions regarding logging upon fsync in btrfs

Aastha Mehta Wed, 02 Oct 2013 13:13:43 -0700

On 2 October 2013 13:52, Josef Bacik <jba...@fusionio.com> wrote:
> On Tue, Oct 01, 2013 at 10:13:25PM +0200, Aastha Mehta wrote:
>> On 1 October 2013 21:42, Aastha Mehta <aasth...@gmail.com> wrote:
>> > On 1 October 2013 21:40, Aastha Mehta <aasth...@gmail.com> wrote:
>> >> On 1 October 2013 19:34, Josef Bacik <jba...@fusionio.com> wrote:
>> >>> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote:
>> >>>> On 30 September 2013 22:47, Josef Bacik <jba...@fusionio.com> wrote:
>> >>>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
>> >>>> >> On 30 September 2013 22:11, Josef Bacik <jba...@fusionio.com> wrote:
>> >>>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
>> >>>> >> >> On 29 September 2013 15:12, Josef Bacik <jba...@fusionio.com> 
>> >>>> >> >> wrote:
>> >>>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
>> >>>> >> >> >> Thank you very much for the reply. That clarifies a lot of 
>> >>>> >> >> >> things.
>> >>>> >> >> >>
>> >>>> >> >> >> I was trying a small test case that opens a file, writes a 
>> >>>> >> >> >> block of
>> >>>> >> >> >> data, calls fsync and then closes the file. If I understand 
>> >>>> >> >> >> correctly,
>> >>>> >> >> >> fsync would return only after all in-memory buffers have been
>> >>>> >> >> >> committed to disk. I have added few print statements in the
>> >>>> >> >> >> __extent_writepage function, and I notice that the function 
>> >>>> >> >> >> gets
>> >>>> >> >> >> called a bit later after fsync returns. It seems that I am not
>> >>>> >> >> >> guaranteed to see the data going to disk by the time fsync 
>> >>>> >> >> >> returns.
>> >>>> >> >> >>
>> >>>> >> >> >> Am I doing something wrong, or am I looking at the wrong place 
>> >>>> >> >> >> for
>> >>>> >> >> >> disk write? This happens both with tree logging enabled as 
>> >>>> >> >> >> well as
>> >>>> >> >> >> with notreelog.
>> >>>> >> >> >>
>> >>>> >> >> >
>> >>>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't 
>> >>>> >> >> > think it was
>> >>>> >> >> > _that_ broken.  You are probably better off instrumenting a 
>> >>>> >> >> > recent kernel, 3.11
>> >>>> >> >> > or just build btrfs-next from git.  But if I were to make a 
>> >>>> >> >> > guess I'd say that
>> >>>> >> >> > __extent_writepage was how both data and metadata was written 
>> >>>> >> >> > out at the time (I
>> >>>> >> >> > don't think I changed it until 3.2 or something later) so what 
>> >>>> >> >> > you are likely
>> >>>> >> >> > seeing is the normal transaction commit after the fsync.  In 
>> >>>> >> >> > the case of
>> >>>> >> >> > notreelog we are likely starting another transaction and you 
>> >>>> >> >> > are seeing that
>> >>>> >> >> > commit (at the time the transaction kthread would start a 
>> >>>> >> >> > transaction even if
>> >>>> >> >> > none had been started yet.)  Thanks,
>> >>>> >> >> >
>> >>>> >> >> > Josef
>> >>>> >> >>
>> >>>> >> >> Is there any special handling for very small file write, less 
>> >>>> >> >> than 4K? As
>> >>>> >> >> I understand there is an optimization to inline the first extent 
>> >>>> >> >> in a file if
>> >>>> >> >> it is smaller than 4K, does it affect the writeback on fsync as 
>> >>>> >> >> well? I did
>> >>>> >> >> set the max_inline mount option to 0, but even then it seems 
>> >>>> >> >> there is
>> >>>> >> >> some difference in fsync behaviour for writing first extent of 
>> >>>> >> >> less than 4K
>> >>>> >> >> size and writing 4K or more.
>> >>>> >> >>
>> >>>> >> >
>> >>>> >> > Yeah if the file is an inline extent then it will be copied into 
>> >>>> >> > the log
>> >>>> >> > directly and the log will be written out, no going through the 
>> >>>> >> > data write path
>> >>>> >> > at all.  Max inline == 0 should make it so we don't inline, so if 
>> >>>> >> > it isn't
>> >>>> >> > honoring that then that may be a bug.  Thanks,
>> >>>> >> >
>> >>>> >> > Josef
>> >>>> >>
>> >>>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then.
>> >>>> >> Please find attached logs to confirm.
>> >>>> >> Also, probably on the older release.
>> >>>> >>
>> >>>> >
>> >>>> > Oooh ok I understand, you have your printk's in the wrong place ;).
>> >>>> > do_writepages doesn't necessarily mean you are writing something.  If 
>> >>>> > you want
>> >>>> > to see if stuff got written to the disk I'd put a printk at 
>> >>>> > run_delalloc_range
>> >>>> > and have it spit out the range it is writing out since thats what we 
>> >>>> > think is
>> >>>> > actually dirty.  Thanks,
>> >>>> >
>> >>>> > Josef
>> >>>>
>> >>>> No, but I also placed dump_stack() in the beginning of
>> >>>> __extent_writepage. run_delalloc_range is being called only from
>> >>>> __extent_writepage, if it were to be called, the dump_stack() at the
>> >>>> top of __extent_writepage would have printed as well, no?
>> >>>>
>> >>>
>> >>> Ok I've done the same thing and I'm not seeing what you are seeing.  Are 
>> >>> you
>> >>> using any mount options other than notreelog and max_inline=0?  Could 
>> >>> you adjust
>> >>> your printk to print out the root objectid for the inode as well?  It 
>> >>> could be
>> >>> possible that this is the writeout for the space cache or inode cache.  
>> >>> Thanks,
>> >>>
>> >>> Josef
>> >>
>> >> I actually printed the stack only when the root objectid is 5. I have
>> >> attached another log for writing the first 500 bytes in a file. I also
>> >> print the root objectid for the inode in run_delalloc and
>> >> __extent_writepage.
>> >>
>> >> Thanks
>> >>
>> >
>> > Just to clarify, in the latest logs, I allowed printing of debug
>> > printk's and stack dump for all root objectid's.
>>
>> Actually, it is the same behaviour when I write anything less than 4K
>> long, no matter what offset, except if I straddle the page boundary.
>> To summarise:
>> 1. write 4K -> write in the fsync path
>> 2. write less than 4K, within a single page -> bdi_writeback by flush worker
>> 3. small write that straddles a page boundary or write 4K+delta -> the
>> first page gets written in the fsync path, the remaining length that
>> straddles the page boundary is written in the bdi_writeback path
>>
>> Please let me know, if I am trying out incorrect cases.
>>
>> Sorry for too many mails.
>>
>
> This has been bugging me so much I was dreaming about it and now here I am
> writing an email at 4:45 in the morning ;).  So I couldn't reproduce earlier
> with any of these scenarios and then I realized something, I'm doing something
> like this
>
> xfs_io -f -c "pwrite 0 54" -c "fsync" /mnt/btrfs-test/foo
>
> and it is working perfectly.  But I bet what you are doing is something like
> this
>
> file = fopen("/mnt/btrfs-test/foo");
> fwrite(buf, 54, 1, file);
> fsync(fileno(file));
> fclose(file);
>
> right?  Please say yes :).  If this is the case then it is likely that these
> small writes are getting buffered in the userspace buffering that comes with
> fwrite, and so when you fsync it is only flushing the data that is actually in
> the kernel, not what is buffered in userspace.  Then when you fclose it 
> flushes
> what is in the userspace buffers out to the kernel and then later on the
> background writer comes in and writes out the dirty data.  To fix this you 
> want
> to do fflush() and then fsync().  Hopefully that is what you are doing and I 
> can
> go back to sleep, thanks,
>
> Josef


Indeed!! :)

I did mention I am using f* version of the POSIX API. Sorry for the
confusion. Calling fflush before fsync seems to write everything
perfectly. It works even without notreelog option, as it should have.
I was under the misconception that fflush and fsync do the same thing.

Thanks a lot for your quick help.

Regards,

-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Questions regarding logging upon fsync in btrfs

Reply via email to