On 1 October 2013 21:42, Aastha Mehta <aasth...@gmail.com> wrote: > On 1 October 2013 21:40, Aastha Mehta <aasth...@gmail.com> wrote: >> On 1 October 2013 19:34, Josef Bacik <jba...@fusionio.com> wrote: >>> On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: >>>> On 30 September 2013 22:47, Josef Bacik <jba...@fusionio.com> wrote: >>>> > On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: >>>> >> On 30 September 2013 22:11, Josef Bacik <jba...@fusionio.com> wrote: >>>> >> > On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: >>>> >> >> On 29 September 2013 15:12, Josef Bacik <jba...@fusionio.com> wrote: >>>> >> >> > On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: >>>> >> >> >> Thank you very much for the reply. That clarifies a lot of things. >>>> >> >> >> >>>> >> >> >> I was trying a small test case that opens a file, writes a block >>>> >> >> >> of >>>> >> >> >> data, calls fsync and then closes the file. If I understand >>>> >> >> >> correctly, >>>> >> >> >> fsync would return only after all in-memory buffers have been >>>> >> >> >> committed to disk. I have added few print statements in the >>>> >> >> >> __extent_writepage function, and I notice that the function gets >>>> >> >> >> called a bit later after fsync returns. It seems that I am not >>>> >> >> >> guaranteed to see the data going to disk by the time fsync >>>> >> >> >> returns. >>>> >> >> >> >>>> >> >> >> Am I doing something wrong, or am I looking at the wrong place for >>>> >> >> >> disk write? This happens both with tree logging enabled as well as >>>> >> >> >> with notreelog. >>>> >> >> >> >>>> >> >> > >>>> >> >> > So 3.1 was a long time ago and to be sure it had issues I don't >>>> >> >> > think it was >>>> >> >> > _that_ broken. You are probably better off instrumenting a recent >>>> >> >> > kernel, 3.11 >>>> >> >> > or just build btrfs-next from git. But if I were to make a guess >>>> >> >> > I'd say that >>>> >> >> > __extent_writepage was how both data and metadata was written out >>>> >> >> > at the time (I >>>> >> >> > don't think I changed it until 3.2 or something later) so what you >>>> >> >> > are likely >>>> >> >> > seeing is the normal transaction commit after the fsync. In the >>>> >> >> > case of >>>> >> >> > notreelog we are likely starting another transaction and you are >>>> >> >> > seeing that >>>> >> >> > commit (at the time the transaction kthread would start a >>>> >> >> > transaction even if >>>> >> >> > none had been started yet.) Thanks, >>>> >> >> > >>>> >> >> > Josef >>>> >> >> >>>> >> >> Is there any special handling for very small file write, less than >>>> >> >> 4K? As >>>> >> >> I understand there is an optimization to inline the first extent in >>>> >> >> a file if >>>> >> >> it is smaller than 4K, does it affect the writeback on fsync as >>>> >> >> well? I did >>>> >> >> set the max_inline mount option to 0, but even then it seems there is >>>> >> >> some difference in fsync behaviour for writing first extent of less >>>> >> >> than 4K >>>> >> >> size and writing 4K or more. >>>> >> >> >>>> >> > >>>> >> > Yeah if the file is an inline extent then it will be copied into the >>>> >> > log >>>> >> > directly and the log will be written out, no going through the data >>>> >> > write path >>>> >> > at all. Max inline == 0 should make it so we don't inline, so if it >>>> >> > isn't >>>> >> > honoring that then that may be a bug. Thanks, >>>> >> > >>>> >> > Josef >>>> >> >>>> >> I tried it on 3.12-rc2 release, and it seems there is a bug then. >>>> >> Please find attached logs to confirm. >>>> >> Also, probably on the older release. >>>> >> >>>> > >>>> > Oooh ok I understand, you have your printk's in the wrong place ;). >>>> > do_writepages doesn't necessarily mean you are writing something. If >>>> > you want >>>> > to see if stuff got written to the disk I'd put a printk at >>>> > run_delalloc_range >>>> > and have it spit out the range it is writing out since thats what we >>>> > think is >>>> > actually dirty. Thanks, >>>> > >>>> > Josef >>>> >>>> No, but I also placed dump_stack() in the beginning of >>>> __extent_writepage. run_delalloc_range is being called only from >>>> __extent_writepage, if it were to be called, the dump_stack() at the >>>> top of __extent_writepage would have printed as well, no? >>>> >>> >>> Ok I've done the same thing and I'm not seeing what you are seeing. Are you >>> using any mount options other than notreelog and max_inline=0? Could you >>> adjust >>> your printk to print out the root objectid for the inode as well? It could >>> be >>> possible that this is the writeout for the space cache or inode cache. >>> Thanks, >>> >>> Josef >> >> I actually printed the stack only when the root objectid is 5. I have >> attached another log for writing the first 500 bytes in a file. I also >> print the root objectid for the inode in run_delalloc and >> __extent_writepage. >> >> Thanks >> > > Just to clarify, in the latest logs, I allowed printing of debug > printk's and stack dump for all root objectid's.
Actually, it is the same behaviour when I write anything less than 4K long, no matter what offset, except if I straddle the page boundary. To summarise: 1. write 4K -> write in the fsync path 2. write less than 4K, within a single page -> bdi_writeback by flush worker 3. small write that straddles a page boundary or write 4K+delta -> the first page gets written in the fsync path, the remaining length that straddles the page boundary is written in the bdi_writeback path Please let me know, if I am trying out incorrect cases. Sorry for too many mails. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html