How do I find the physical block number?
Hello, I have created a 500GB partition on my HDD and formatted it for btrfs. I created a file on it. # echo tmp data in the tmp file.. /mnt/btrfs/tmp-file # umount /mnt/btrfs Next I want to know the blocks allocated for the file and I used filefrag for it. I get some information as follows - # mount -o max_inline=0 /dev/sdc2 /mnt/btrfs # filefrag -v /mnt/btrfs/tmp-file Filesystem type is: 9123683e File size of /mnt/btrfs/tmp-file is 27 (1 block, blocksize 4096) ext logical physical expected length flags 0 0 65924123 1 eof /mnt/btrfs/tmp-file: 1 extent found Now, I want to read the same data from the disk directly. I tried the following - block 65924123 = byte (65924123*4096) = 270025207808 # dd if=/dev/sdc2 of=tmp-file skip=270025207808 bs=1 count=4096 # cat tmp-file I cannot read the file's contents but some garbage. I read somewhere that the physical block number shown in filefrag may actually be a logical block for the file system and it has an additional translation to physical block number. So next I tried the following - # btrfs-map-logical -l 65924123 /dev/sdc2 mirror 1 logical 65924123 physical 74312731 device /dev/sdc2 mirror 2 logical 65924123 physical 1148054555 device /dev/sdc2 I again tried reading the block 74312731 using the dd command as above, but it is still not the right block. I want to know what does the physical block number returned by filefrag mean, why there are two mappings for the above block number and how I can find the exact physical disk block number the file system actually writes to. My sdc has the following partitions: Device Boot Start End Blocks Id System /dev/sdc12048 419432447 209715200 83 Linux /dev/sdc2 1468008448 2516584447 524288000 83 Linux (BTRFS) /dev/sdc3 419432448 1468008447 524288000 83 Linux Thanks, Aastha. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I find the physical block number?
On 16 April 2014 15:27, Aastha Mehta aasth...@gmail.com wrote: Hello, I have created a 500GB partition on my HDD and formatted it for btrfs. I created a file on it. # echo tmp data in the tmp file.. /mnt/btrfs/tmp-file # umount /mnt/btrfs Next I want to know the blocks allocated for the file and I used filefrag for it. I get some information as follows - # mount -o max_inline=0 /dev/sdc2 /mnt/btrfs # filefrag -v /mnt/btrfs/tmp-file Filesystem type is: 9123683e File size of /mnt/btrfs/tmp-file is 27 (1 block, blocksize 4096) ext logical physical expected length flags 0 0 65924123 1 eof /mnt/btrfs/tmp-file: 1 extent found Now, I want to read the same data from the disk directly. I tried the following - block 65924123 = byte (65924123*4096) = 270025207808 # dd if=/dev/sdc2 of=tmp-file skip=270025207808 bs=1 count=4096 # cat tmp-file I cannot read the file's contents but some garbage. I read somewhere that the physical block number shown in filefrag may actually be a logical block for the file system and it has an additional translation to physical block number. So next I tried the following - # btrfs-map-logical -l 65924123 /dev/sdc2 mirror 1 logical 65924123 physical 74312731 device /dev/sdc2 mirror 2 logical 65924123 physical 1148054555 device /dev/sdc2 I again tried reading the block 74312731 using the dd command as above, but it is still not the right block. I want to know what does the physical block number returned by filefrag mean, why there are two mappings for the above block number and how I can find the exact physical disk block number the file system actually writes to. My sdc has the following partitions: Device Boot Start End Blocks Id System /dev/sdc12048 419432447 209715200 83 Linux /dev/sdc2 1468008448 2516584447 524288000 83 Linux (BTRFS) /dev/sdc3 419432448 1468008447 524288000 83 Linux Thanks, Aastha. I realized my mistake in using the btrfs-map-logical command. It should have been # btrfs-map-logical -l 270025207808 /dev/sdc2 Now, everything works fine. Please ignore my post, except it may be useful for somebody else needing this information in future. Thanks, Aastha. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: questions regarding fsync in btrfs
On 25 January 2014 16:21, Josef Bacik jba...@fb.com wrote: On 01/24/2014 07:09 PM, Aastha Mehta wrote: Hello, I would like to clarify a bit on how the fsync works in btrfs. The log tree journals only the metadata of the files that have been modified prior to the fsync, correct? It does not log the data extents of files, which are directly sync'ed to the disk. Also, if I understand correctly, fsync and fdatasync are the same thing in btrfs currently. Is it more like fsync or fdatasync? More like fsync. Because we cow we always are updating metadata so there is no fdatasync, we can't get away with just flushing the data. What exactly happens once a file inode is in the tree log? Does it mean it is guaranteed to be persisted on disk, or is it already on disk? I see two flags in btrfs_sync_file - BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not fully understand them. After full sync, what does log_dentry_safe and sync_log do? It is guaranteed to be on disk. We copy all of the inode metadata to the log, sync the log and the data and the super block that points to hte tree log. HAS_ASYNC_EXTENT is for compression where we will return to writepages without actually having marked the page as writeback, so we need to go back and re-lock the pages to make sure it has passed through the async compression threads and the pages have been properly marked writeback so we can wait on them properly. NEEDS_FULL_SYNC means we can't do our fancy tricks of only updating some of the metadata, we have to go and copy all of the inode metadata (the inode, its references, its xattrs) and all of its extents. log_dentry_safe copies all the info into the tree log and sync_log syncs the tree log to disk and writes out a super that points to the tree log. Finally, Wikipedia says that the items in the log tree are replayed and deleted at the next full tree commit or (if there was a system crash) at the next remount. Even if there is no crash, why is there a need to replay the log? There isn't, once we commit a transaction we commit a super that doesn't point to the tree log and we free up the blocks we used for the tree log. The tree log only exists for one transaction, if we crash before a transaction commits we will see that there is a tree log on the next mount and replay it. If we commit the transaction we simply free the tree log and carry on. Thanks, Josef Thank you for your response. I ran few small experiments and I see that fsync on an average leads to writing of about 30-40KB of metadata, irrespective of the amount of data changes. I wonder why is it so much? Besides the superblocks and a couple of blocks in the tree log, what else may be updated? Also, why does it seem to be independent of the amount of writes? Thanks, Aastha. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
questions regarding fsync in btrfs
Hello, I would like to clarify a bit on how the fsync works in btrfs. The log tree journals only the metadata of the files that have been modified prior to the fsync, correct? It does not log the data extents of files, which are directly sync'ed to the disk. Also, if I understand correctly, fsync and fdatasync are the same thing in btrfs currently. Is it more like fsync or fdatasync? What exactly happens once a file inode is in the tree log? Does it mean it is guaranteed to be persisted on disk, or is it already on disk? I see two flags in btrfs_sync_file - BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not fully understand them. After full sync, what does log_dentry_safe and sync_log do? Finally, Wikipedia says that the items in the log tree are replayed and deleted at the next full tree commit or (if there was a system crash) at the next remount. Even if there is no crash, why is there a need to replay the log? Thanks, Aastha. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
question regarding caching
Hello, I have some questions regarding caching in BTRFS. When a file system is unmounted and mounted again, would all the previously cached content be removed from the cache after flushing to disk? After remounting, would the initial requests always be fetched from the disk? Rather than a local disk, I have a remote device to which my IO requests are sent and from which the data is fetched. I need certain data to be fetched from the remote device after a remount. But somehow I do not see any request appearing at the device. I even tried to do drop_caches after remounting the file system, but that does not seem to help. I guess my problem is not related to BTRFS, but since I am working with BTRFS, I wanted to ask here for help. Could any one tell me how I can ensure that requests are fetched from the (remote) device, especially after file system remount, without having to use drop_caches? Please let me know if I described the problem too vaguely and should give some more details. Wishing everyone a happy new year. Thanks and regards, Aastha. -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
On 2 October 2013 13:52, Josef Bacik jba...@fusionio.com wrote: On Tue, Oct 01, 2013 at 10:13:25PM +0200, Aastha Mehta wrote: On 1 October 2013 21:42, Aastha Mehta aasth...@gmail.com wrote: On 1 October 2013 21:40, Aastha Mehta aasth...@gmail.com wrote: On 1 October 2013 19:34, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef No, but I also placed dump_stack() in the beginning of __extent_writepage. run_delalloc_range is being called only from __extent_writepage, if it were to be called, the dump_stack() at the top of __extent_writepage would have printed as well, no? Ok I've done the same thing and I'm not seeing what you are seeing. Are you using any mount options other than notreelog and max_inline=0? Could you adjust your printk to print out the root objectid for the inode as well? It could be possible that this is the writeout for the space cache or inode cache. Thanks, Josef I actually printed the stack only when the root objectid is 5. I have attached another log for writing the first 500 bytes in a file. I also print the root objectid for the inode in run_delalloc and __extent_writepage. Thanks Just to clarify, in the latest logs, I allowed printing of debug printk's and stack dump for all root objectid's. Actually, it is the same behaviour when I write anything less than 4K long, no matter what offset, except if I straddle the page boundary. To summarise: 1. write 4K - write in the fsync path 2. write less than 4K, within a single page - bdi_writeback by flush worker 3. small write that straddles a page boundary or write 4K+delta - the first page
Re: Questions regarding logging upon fsync in btrfs
On 1 October 2013 19:34, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef No, but I also placed dump_stack() in the beginning of __extent_writepage. run_delalloc_range is being called only from __extent_writepage, if it were to be called, the dump_stack() at the top of __extent_writepage would have printed as well, no? Ok I've done the same thing and I'm not seeing what you are seeing. Are you using any mount options other than notreelog and max_inline=0? Could you adjust your printk to print out the root objectid for the inode as well? It could be possible that this is the writeout for the space cache or inode cache. Thanks, Josef I actually printed the stack only when the root objectid is 5. I have attached another log for writing the first 500 bytes in a file. I also print the root objectid for the inode in run_delalloc and __extent_writepage. Thanks -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org write success, size: 500 fsync:: Success file size: 500 fclose:: Success [ 829.847483] btrfs: device fsid 4d2767ea-85cc-443c-88bd-2d8c693f7c7c devid 1 transid 2201 /dev/sdd [ 829.848199] btrfs: max_inline at 0 [ 829.848201] btrfs: disabling tree log [ 829.848202] btrfs: disk space caching is enabled [ 835.266483] [do_writepages] inum: 258, nr_to_write: 9223372036854775807, writepages fn: 8129ce40 [ 835.266492] CPU: 0 PID: 10021 Comm: bb Not tainted 3.12.0-rc2-latest+ #7 [ 835.266494] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 835.266496] 88003cdc5e60 88003cdc5e28 817a8acf 88003fc0eb48 [ 835.266511] 88001f478b28 88003cdc5e48 81105c1f 88001f4789e0 [ 835.266515] 880037557400 88003cdc5e88 810fb689 88003fc12c00
Re: Questions regarding logging upon fsync in btrfs
On 1 October 2013 21:42, Aastha Mehta aasth...@gmail.com wrote: On 1 October 2013 21:40, Aastha Mehta aasth...@gmail.com wrote: On 1 October 2013 19:34, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote: On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef No, but I also placed dump_stack() in the beginning of __extent_writepage. run_delalloc_range is being called only from __extent_writepage, if it were to be called, the dump_stack() at the top of __extent_writepage would have printed as well, no? Ok I've done the same thing and I'm not seeing what you are seeing. Are you using any mount options other than notreelog and max_inline=0? Could you adjust your printk to print out the root objectid for the inode as well? It could be possible that this is the writeout for the space cache or inode cache. Thanks, Josef I actually printed the stack only when the root objectid is 5. I have attached another log for writing the first 500 bytes in a file. I also print the root objectid for the inode in run_delalloc and __extent_writepage. Thanks Just to clarify, in the latest logs, I allowed printing of debug printk's and stack dump for all root objectid's. Actually, it is the same behaviour when I write anything less than 4K long, no matter what offset, except if I straddle the page boundary. To summarise: 1. write 4K - write in the fsync path 2. write less than 4K, within a single page - bdi_writeback by flush worker 3. small write that straddles a page boundary or write 4K+delta - the first page gets written in the fsync path, the remaining length that straddles the page boundary is written in the bdi_writeback path Please let me know, if I am trying out incorrect cases. Sorry for too many mails. Thanks -- To unsubscribe from this list: send the line unsubscribe linux
Re: Questions regarding logging upon fsync in btrfs
On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Thanks, Aastha. -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Thanks, Aastha. -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org [ 2808.125838] [do_writepages] inum: 260, nr_to_write: 9223372036854775807, writepages fn: 8129ce40 [ 2808.125977] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 2808.125980] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2808.125983] 88003d6b5e60 88003d6b5e28 817a8a6f 88003fc0eb48 [ 2808.125987] 88002e325b48 88003d6b5e48 81105c1f 88002e325a00 [ 2808.125990] 880037b1f300 88003d6b5e88 810fb689 88003fc12c00 [ 2808.125993] Call Trace: [ 2808.126003] [817a8a6f] dump_stack+0x46/0x58 [ 2808.126010] [81105c1f] do_writepages+0x4f/0x90 [ 2808.126013] [810fb689] __filemap_fdatawrite_range+0x49/0x50 [ 2808.126013] [810fc41e] filemap_fdatawrite_range+0xe/0x10 [ 2808.126013] [812aafdc] btrfs_sync_file+0xac/0x380 [ 2808.126098] [817aedd2] ? __schedule+0x692/0x730 [ 2808.126115] [811752d8] do_fsync+0x58/0x90 [ 2808.126149] [81148a6d] ? SyS_write+0x4d/0xa0 [ 2808.126165] [8117565b] SyS_fsync+0xb/0x10 [ 2808.126182] [817b83a2] system_call_fastpath+0x16/0x1b [ 2808.127447] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4 [ 2808.127450] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 12/01/2006 [ 2808.127452] ea802440 88003d6b5ba8 817a8a6f 88003d6b5de0 [ 2808.127456] 88002e325870 88003d6b5cb8 812b7c9f 88003e2083d0 [ 2808.127459] 88003d6b5c48 817aedd2 88003c3707b0 [ 2808.127462] Call Trace: [ 2808.127467] [817a8a6f] dump_stack+0x46/0x58 [ 2808.127473] [812b7c9f] __extent_writepage+0x6f/0x750 [ 2808.127479] [817aedd2] ? __schedule+0x692/0x730 [ 2808.127483] [810fac71] ? find_get_pages_tag+0x121/0x160 [ 2808.127487] [812b8552] extent_write_cache_pages.isra.24.constprop.31+0x1d2/0x380 [ 2808.127492] [81005877] ? show_trace_log_lvl+0x57/0x70 [ 2808.127496] [812b89d9] extent_writepages+0x49/0x60 [ 2808.127501] [8129f160] ? btrfs_submit_direct+0x6c0/0x6c0 [ 2808.127505] [8129ce63] btrfs_writepages+0x23/0x30 [ 2808.127510] [81105c3e] do_writepages+0x6e/0x90 [ 2808.127513] [810fb689] __filemap_fdatawrite_range+0x49/0x50 [ 2808.127517] [810fc41e] filemap_fdatawrite_range+0xe/0x10
Re: Questions regarding logging upon fsync in btrfs
On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote: On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote: On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote: On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote: Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. So 3.1 was a long time ago and to be sure it had issues I don't think it was _that_ broken. You are probably better off instrumenting a recent kernel, 3.11 or just build btrfs-next from git. But if I were to make a guess I'd say that __extent_writepage was how both data and metadata was written out at the time (I don't think I changed it until 3.2 or something later) so what you are likely seeing is the normal transaction commit after the fsync. In the case of notreelog we are likely starting another transaction and you are seeing that commit (at the time the transaction kthread would start a transaction even if none had been started yet.) Thanks, Josef Is there any special handling for very small file write, less than 4K? As I understand there is an optimization to inline the first extent in a file if it is smaller than 4K, does it affect the writeback on fsync as well? I did set the max_inline mount option to 0, but even then it seems there is some difference in fsync behaviour for writing first extent of less than 4K size and writing 4K or more. Yeah if the file is an inline extent then it will be copied into the log directly and the log will be written out, no going through the data write path at all. Max inline == 0 should make it so we don't inline, so if it isn't honoring that then that may be a bug. Thanks, Josef I tried it on 3.12-rc2 release, and it seems there is a bug then. Please find attached logs to confirm. Also, probably on the older release. Oooh ok I understand, you have your printk's in the wrong place ;). do_writepages doesn't necessarily mean you are writing something. If you want to see if stuff got written to the disk I'd put a printk at run_delalloc_range and have it spit out the range it is writing out since thats what we think is actually dirty. Thanks, Josef No, but I also placed dump_stack() in the beginning of __extent_writepage. run_delalloc_range is being called only from __extent_writepage, if it were to be called, the dump_stack() at the top of __extent_writepage would have printed as well, no? Thanks -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
Thank you very much for the reply. That clarifies a lot of things. I was trying a small test case that opens a file, writes a block of data, calls fsync and then closes the file. If I understand correctly, fsync would return only after all in-memory buffers have been committed to disk. I have added few print statements in the __extent_writepage function, and I notice that the function gets called a bit later after fsync returns. It seems that I am not guaranteed to see the data going to disk by the time fsync returns. Am I doing something wrong, or am I looking at the wrong place for disk write? This happens both with tree logging enabled as well as with notreelog. Thanks On 29 September 2013 02:42, Josef Bacik jba...@fusionio.com wrote: On Sun, Sep 29, 2013 at 01:35:15AM +0200, Aastha Mehta wrote: Hi, I have few questions regarding logging triggered by calling fsync in BTRFS: 1. If I understand correctly, fsync will call to log entire inode in the log tree. Does this mean that the data extents are also logged into the log tree? Are they copied into the log tree, or just referenced? Are they copied into the subvolume's extent tree again upon replay? The data extents are copied as well, as in the metadata that points to the data, not the actual data itself. For 3.1 it's all of the extents in the inode, in 3.8 on it's only the extents that have changed this transaction. 2. During replay, when the extents are added into the extent allocation tree, do they acquire the physical extent number during replay? Does they physical extent allocated to the data in the log tree differ from that in the subvolume? No the physical location was picked when we wrote the data out during fsync. If we crash and re-mount the replay will just insert the ref into the extent tree for the disk offset as it replays the extents. 3. I see there is a mount option of notreelog available. After disabling tree logging, does fsync still lead to flushing of buffers to the disk directly? notreelog just means that we write the data and wait on the ordered data extents and then commit the transaction. So you get the data for the inode you are fsycning and all of the metadata for the entire file system that has changed in that transaction. 4. Is it possible to selectively identify certain files in the log tree and flush them to disk directly, without waiting for the replay to do it? I don't understand this question, replay only happens on mount after a crash/power loss, and everything is replayed that is in the log, there is no way to select which inode is replayed. Thanks, Josef -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Questions regarding logging upon fsync in btrfs
Hi, I have few questions regarding logging triggered by calling fsync in BTRFS: 1. If I understand correctly, fsync will call to log entire inode in the log tree. Does this mean that the data extents are also logged into the log tree? Are they copied into the log tree, or just referenced? Are they copied into the subvolume's extent tree again upon replay? 2. During replay, when the extents are added into the extent allocation tree, do they acquire the physical extent number during replay? Does they physical extent allocated to the data in the log tree differ from that in the subvolume? 3. I see there is a mount option of notreelog available. After disabling tree logging, does fsync still lead to flushing of buffers to the disk directly? 4. Is it possible to selectively identify certain files in the log tree and flush them to disk directly, without waiting for the replay to do it? Thanks -- Aastha Mehta -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Questions regarding logging upon fsync in btrfs
I am using linux kernel 3.1.10-1.16, just to let you know. Thanks On 29 September 2013 01:35, Aastha Mehta aasth...@gmail.com wrote: Hi, I have few questions regarding logging triggered by calling fsync in BTRFS: 1. If I understand correctly, fsync will call to log entire inode in the log tree. Does this mean that the data extents are also logged into the log tree? Are they copied into the log tree, or just referenced? Are they copied into the subvolume's extent tree again upon replay? 2. During replay, when the extents are added into the extent allocation tree, do they acquire the physical extent number during replay? Does they physical extent allocated to the data in the log tree differ from that in the subvolume? 3. I see there is a mount option of notreelog available. After disabling tree logging, does fsync still lead to flushing of buffers to the disk directly? 4. Is it possible to selectively identify certain files in the log tree and flush them to disk directly, without waiting for the replay to do it? Thanks -- Aastha Mehta -- Aastha Mehta -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
I must admit, it is quite convoluted :-) Please tell me if I understand this. A file system tree (containing the inodes, the extents of all the inodes, etc.) is itself laid out in the leaf extents of another big tree, which is the root tree. This is why you say that inode and other such metadata may be lying in the leaf nodes. Correct? I did not completely understand what you meant when you said that the metadata (the file extent items and such) for the inodes are stored inside the same tree that the inode resides in. I thought the btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to the actual data of a file. Okay, now I am not even sure if in btrfs there is something like an indirect block for a huge file. In file systems with fixed block size, one can hold only as many pointers to data blocks and hence when the file size grows indirects are added in the file's tree. Is there any equivalent indirect extent required for huge files in btrfs, or do all the files fit within one level? If there are indirects, what item type do they have? Would something like btrfs_get_extent() be useful to get the indirect extents of a file? Too many questions, sorry :( Thanks. On 4 March 2013 00:52, Josef Bacik jo...@toxicpanda.com wrote: On Sun, Mar 3, 2013 at 10:41 AM, Aastha Mehta aasth...@gmail.com wrote: Hi Josef, I have some more questions following up on my previous e-mails. I now do somewhat understand the place where extent entries get cow'ed. But I am unclear about the order of operations. Is it correct that the data extent written first, then the pointer in the indirect block needs to be updated, so then it is cowed and written to disk and so on recursively up the tree? Or is the entire path from leaf to node that is going to be affected by the write cowed first and then all the cowed extents are written to the disk and then the rest of the metadata pointers, (for example, in checksum tree, extent tree, etc., I am not sure about this)? The second one. We COW the entire path from root to leaf as things need COW'ing. We start a transaction, we insert the file extent entries, we add the checksums, and we add the delayed ref updates to the extent tree. The delayed things are guaranteed to happen in that transaction so we have consistency there. The COW'ing from top to bottom works like that for all trees. Also, I need to understand specifically how the data (leaf nodes) of a file is written to disk v/s the metadata including the indirect nodes of the file. In extent_writepage I only know the pages of a file that are to be written. I guess, I can identify metadata pages based on the inode of the page's owner. But is it possible to distinguish the pages available in extent_writepage path as belonging to the leaf node or internal node for a file? If it cannot be identified at this point, where earlier in the path can this be decided? So they are different things, and they could change from the time we write to the time that the write completes because of COW. Also keep in mind that the metadata (the file extent items and such) for the inodes are not stored specifically within the inode, they're stored inside the same tree that the inode resides in. So you can have a leaf node with multiple inodes and extents for those different inodes. And so any sort of random things can happen, other inodes can be deleted and this inode's metadata will be shifted into a new leaf, or another inode could be added and this inode's data could be pushed off into an adjacent leaf. The only way to know which leaf/page the inode is associated with is to search for whatever you are looking for in the tree, and then while you are holding all of the locks and reference counting you can be sure that those pages contain the metadata you are looking for, but once you let that go there are no guarantees. So as far as how it is written to disk, that is where transactions come in. We track all the dirty metadata pages we have per transaction, and then at transaction commit time we make sure that all of those pages are written to disk and then we commit our super to point to the new root of the tree root, which in turn points at all of our new roots because of COW. These pages can be written before the commit though because of memory pressure, and if they are written and then modified again within in the same transaction we will re-cow them to make sure we don't have any partial-page updates. Keeping track of where a specific inodes metadata is contained is a tricky business. Let me know if that helped. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
Okay, that makes lot more sense to me now. Thank you very much. Regards, Aastha. On 5 March 2013 02:51, Josef Bacik jo...@toxicpanda.com wrote: On Mon, Mar 4, 2013 at 7:57 PM, Aastha Mehta aasth...@gmail.com wrote: I must admit, it is quite convoluted :-) Please tell me if I understand this. A file system tree (containing the inodes, the extents of all the inodes, etc.) is itself laid out in the leaf extents of another big tree, which is the root tree. This is why you say that inode and other such metadata may be lying in the leaf nodes. Correct? Sort of. We have lot's of tree's, but the inode data is laid out in what we refer to as fs trees. All these trees are just b-trees that have different data in them. In the fs-trees they will hold inode items, directory items, file extent items, xattr items and orphan items. So any given leaf in this tree could have any number of those items in them referring to any number of inodes. You could have [inode item for inode 1][file extent item for inode 1][inode item for inode 2][xattr for inode 2][file extent item for inode 2] all contained within one leaf. Does that make sense? I did not completely understand what you meant when you said that the metadata (the file extent items and such) for the inodes are stored inside the same tree that the inode resides in. I thought the btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to the actual data of a file. Yes the btrfs_file_extent_item points to a [offset, size] pair that describes a data extent. Okay, now I am not even sure if in btrfs there is something like an indirect block for a huge file. In file systems with fixed block size, one can hold only as many pointers to data blocks and hence when the file size grows indirects are added in the file's tree. Is there any equivalent indirect extent required for huge files in btrfs, or do all the files fit within one level? If there are indirects, what item type do they have? Would something like btrfs_get_extent() be useful to get the indirect extents of a file? So there are no indirects, there are just btrfs_file_extent_items that are held within the btree that describe all of the extents that relate to a particular file. So you can have (in the case of large fragmented files) hundreds of leaves within the btree that just contain btrfs_file_extent_items for all the ranges for a file. btrfs_get_extent() just looks up the relevant btrfs_file_extent_item for the range that you are wondering about, and maps it to our extent_map structure internally. Hth, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
Hi Josef, I have some more questions following up on my previous e-mails. I now do somewhat understand the place where extent entries get cow'ed. But I am unclear about the order of operations. Is it correct that the data extent written first, then the pointer in the indirect block needs to be updated, so then it is cowed and written to disk and so on recursively up the tree? Or is the entire path from leaf to node that is going to be affected by the write cowed first and then all the cowed extents are written to the disk and then the rest of the metadata pointers, (for example, in checksum tree, extent tree, etc., I am not sure about this)? Also, I need to understand specifically how the data (leaf nodes) of a file is written to disk v/s the metadata including the indirect nodes of the file. In extent_writepage I only know the pages of a file that are to be written. I guess, I can identify metadata pages based on the inode of the page's owner. But is it possible to distinguish the pages available in extent_writepage path as belonging to the leaf node or internal node for a file? If it cannot be identified at this point, where earlier in the path can this be decided? Many thanks, Aastha. On 25 February 2013 20:00, Aastha Mehta aasth...@gmail.com wrote: Ah okay, I now see how it works. Thanks a lot for your response. Regards, Aastha. On 25 February 2013 18:27, Josef Bacik jba...@fusionio.com wrote: On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote: Thanks again Josef. I understood that cow_file_range is called for a regular file. Just to clarify, in cow_file_range is cow done at the time of reserving extents in the extent btree for the io to be done in this delalloc? I see the following comment above find_free_extent() which is called while trying to reserve extents: /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: * ins-objectid == block start * ins-flags = BTRFS_EXTENT_ITEM_KEY * ins-offset == number of blocks * Any available blocks before search_start are skipped. */ This seems to be the only place where a cow might be done, because a key is being inserted into an extent which modifies it. The key isn't inserted at this time, it's just returned with those values for us to do as we please. There is no update of the btree until insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
Thanks again Josef. I understood that cow_file_range is called for a regular file. Just to clarify, in cow_file_range is cow done at the time of reserving extents in the extent btree for the io to be done in this delalloc? I see the following comment above find_free_extent() which is called while trying to reserve extents: /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: * ins-objectid == block start * ins-flags = BTRFS_EXTENT_ITEM_KEY * ins-offset == number of blocks * Any available blocks before search_start are skipped. */ This seems to be the only place where a cow might be done, because a key is being inserted into an extent which modifies it. Thanks, Aastha. On 24 February 2013 02:39, Josef Bacik jo...@toxicpanda.com wrote: On Thu, Feb 21, 2013 at 12:32 PM, Aastha Mehta aasth...@gmail.com wrote: Thanks a lot for the prompt response. I had seen that, but I am still not sure of where it really happens within fill_delalloc. Could you help me a little further in that path? So we check the properties of the inode and do one of 3 things, either we call btrfs_cow_file_range directly in the case of a normal file, run_delalloc_nocow in the case of a file with prealloc extents or NOCOW, or we do the compression dance. We make an ordered extent for this range and return. And then the normal io path happens. Secondly, now I am confused between the btree_writepages and btrfs_writepages/btrfs_writepage methods. I thought btrfs_writepages was for writing the pages holding inodes and btree_writepages for writing the other indirect and leaf extents of the btree. Then, it seems that the write operations lead to update of the file system data structures in a top-down manner, i.e. first changing the inode and then the data extents. Is that correct? You are right that btrfs_writepages/writepage are for normal files and btree_writepages is for the metadata. The write operations do start in data and then modify metadata later down the line if that is what you are getting at. Thirdly, it seems that the old extents maybe dropped before the new extents are flushed to the disk. What would happen if the write fails before the disk commit? What am I missing here? Yeah, the metadata isn't updated until the data is on the disk. In -fill_delalloc we setup an btrfs_ordered_extent that describes the range of the dirty pages we are writing. When we've written all these pages we run btrfs_finish_ordered_io, which will drop the old extent entries if there are any and then add the new extent entries and update the references and such. So if something fails we just continue to point to the original file extent entries and return an EIO, we maintain consistency by making sure the metadata is updated only after the data is written out. I hope that helps. Thanks, Josef -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
Ah okay, I now see how it works. Thanks a lot for your response. Regards, Aastha. On 25 February 2013 18:27, Josef Bacik jba...@fusionio.com wrote: On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote: Thanks again Josef. I understood that cow_file_range is called for a regular file. Just to clarify, in cow_file_range is cow done at the time of reserving extents in the extent btree for the io to be done in this delalloc? I see the following comment above find_free_extent() which is called while trying to reserve extents: /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: * ins-objectid == block start * ins-flags = BTRFS_EXTENT_ITEM_KEY * ins-offset == number of blocks * Any available blocks before search_start are skipped. */ This seems to be the only place where a cow might be done, because a key is being inserted into an extent which modifies it. The key isn't inserted at this time, it's just returned with those values for us to do as we please. There is no update of the btree until insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
A gentle reminder on this one. Thanks, Aastha. On 21 February 2013 18:32, Aastha Mehta aasth...@gmail.com wrote: Thanks a lot for the prompt response. I had seen that, but I am still not sure of where it really happens within fill_delalloc. Could you help me a little further in that path? Secondly, now I am confused between the btree_writepages and btrfs_writepages/btrfs_writepage methods. I thought btrfs_writepages was for writing the pages holding inodes and btree_writepages for writing the other indirect and leaf extents of the btree. Then, it seems that the write operations lead to update of the file system data structures in a top-down manner, i.e. first changing the inode and then the data extents. Is that correct? Thirdly, it seems that the old extents maybe dropped before the new extents are flushed to the disk. What would happen if the write fails before the disk commit? What am I missing here? Thanks, Aastha. On 20 February 2013 18:54, Josef Bacik jba...@fusionio.com wrote: On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote: Hello, I am trying to understand the COW mechanism in Btrfs. Is it correct to say that unless nodatacow option is specified, Btrfs always performs COW for all the data+metadata extents used in the system? So we always cow the metadata, but yes nodatacow means we don't cow the actual data in the data extents. I saw that COWing is implemented in btrfs_cow_block() function, which is called at the time of searching a slot for a particular item, while inserting into a new slot, committing transactions, while creating pending snapshots and few other places. However, while tracing through the complete write path, I could not quite figure out when extents actually get COWed. Could you please point me to the place where COWing takes place? Is there any time when, for performance or any other reasons, the extents are not COWed but overwritten in place (apart from the explicit nodatacow flag being set during mount)? You'll want to look at the tree operation -fill_delalloc(). Thats where we do cow_file_range(). We allocate new space and write. When we finish the ordered io we do btrfs_drop_extents() on the range we just wrote which will free up any existing extents that exist, and then insert our new file extent. Thanks, Josef -- Aastha Mehta MPI-SWS, Germany E-mail: aasth...@mpi-sws.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
Thanks a lot for the prompt response. I had seen that, but I am still not sure of where it really happens within fill_delalloc. Could you help me a little further in that path? Secondly, now I am confused between the btree_writepages and btrfs_writepages/btrfs_writepage methods. I thought btrfs_writepages was for writing the pages holding inodes and btree_writepages for writing the other indirect and leaf extents of the btree. Then, it seems that the write operations lead to update of the file system data structures in a top-down manner, i.e. first changing the inode and then the data extents. Is that correct? Thirdly, it seems that the old extents maybe dropped before the new extents are flushed to the disk. What would happen if the write fails before the disk commit? What am I missing here? Thanks, Aastha. On 20 February 2013 18:54, Josef Bacik jba...@fusionio.com wrote: On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote: Hello, I am trying to understand the COW mechanism in Btrfs. Is it correct to say that unless nodatacow option is specified, Btrfs always performs COW for all the data+metadata extents used in the system? So we always cow the metadata, but yes nodatacow means we don't cow the actual data in the data extents. I saw that COWing is implemented in btrfs_cow_block() function, which is called at the time of searching a slot for a particular item, while inserting into a new slot, committing transactions, while creating pending snapshots and few other places. However, while tracing through the complete write path, I could not quite figure out when extents actually get COWed. Could you please point me to the place where COWing takes place? Is there any time when, for performance or any other reasons, the extents are not COWed but overwritten in place (apart from the explicit nodatacow flag being set during mount)? You'll want to look at the tree operation -fill_delalloc(). Thats where we do cow_file_range(). We allocate new space and write. When we finish the ordered io we do btrfs_drop_extents() on the range we just wrote which will free up any existing extents that exist, and then insert our new file extent. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding some btrfs features
Thanks a lot for the explanation. Regards, Aastha. On 3 December 2012 13:02, Hugo Mills h...@carfax.org.uk wrote: On Mon, Dec 03, 2012 at 10:52:41AM +0100, Aastha Mehta wrote: On 2 December 2012 23:46, Hugo Mills h...@carfax.org.uk wrote: On Sun, Dec 02, 2012 at 11:17:26PM +0100, Aastha Mehta wrote: I am looking at btrfs to understand some of its features. One of them is the snapshot feature. Please tell me if my following understanding about snapshots in btrfs is correct or not. Btrfs supports both readonly and writeable snapshots. Writeable snapshots are like clone volumes (or subvolumes as in btrfs). We get a point in time copy of the subvolume in both case. I looked through the kernel code and it seems that creating a subvolume and taking a snapshot (readonly and writeable) all have a common ioctl interface. What I am not completely clear about is whether snapshots get same fsid as the source subvolume fsid or different. Yes, it's the same UUID, because they're all part of the same filesystem. Just to clarify, apart from UUID, is the FSID in the fs_info of the root also same for all snapshots of a subvolume? Also, I do not understand what does it mean to be able to take snapshot of a snapshot. Snapshots are completely equal partners with their original subvolumes. This is not the case in, say, LVM. What are benefits compared to say, being able to take snapshots only of the active subvolume and not of the snapshots? Let's say you take a snapshot (B) of your root filesystem (A). Then you decide to roll back to using the old version, so you mount B as root instead of A. Later that night, your backup process starts up and tries to take a temporary read-only snapshot (C) of your root filesystem (which is now B) so that it can make a stable backup. That's a snapshot of a snapshot. Okay, but still the snapshot can be taken only for a subvolume in use. Is that correct? Well, it depends on what you mean by in use. You can't snapshot something which doesn't appear somewhere in your directory hierarchy. In your example, C is taken on B after file system was rolled back to version B. What happens when the file system version mounted is A (which contains snapshot B) and we take another snapshot D on this mounted version. Does the snapshot D contain B or only the active contents of A? Snapshots are not recursive. If you have a subvolume inside another: subv1/subv2 and then snapshot that # btrfs sub snap subv1 subv1-A you will end up with a subvolume subv1-A, containing an empty directory called subv2. Note that you don't have to have subvolumes inside subvolumes at all, if you don't use the top level of your filesystem as anything other than a place to store and manage subvolumes. Consider this btrfs filesystem layout: / subvolid=5 (=0) rootsubvolid=256 (default subvol) homesubvolid=257 snapshots directory With this in fstab: /dev/sda/ btrfs subvolid=256 /dev/sda/home btrfs subvolid=257 /dev/sda/media/btrfsbtrfs subvolid=5,noauto We get this filesystem hierarchy: / subvolid=256 home subvolid=257 media directory btrfs subvolid=5 Note that the mount of the full filesystem on /media/btrfs isn't done automatically -- it only needs to be done when you're managing subvolumes. So, we can take a snapshot of /, for example: # mount /media/btrfs # btrfs sub snap /media/btrfs/root /media/btrfs/snapshots/root.2012-12-03 # umount /media/btrfs The FS (from its top level) now looks like this: / subvolid=5 (=0) rootsubvolid=256 (default subvol) homesubvolid=257 snapshots directory root.2012-12-03 subvolid=258 To roll back root temporarily to the earlier version, you can edit your boot manager's config to supply subvolid=258 as a mount parameter. To do so permanently, you can set the default subvolume to 258, and optionally move the snapshot to /root within the btrfs filesystem: # mount /media/btrfs # mv /media/btrfs/root /media/btrfs/root.old # mv /media/btrfs/snapshots/root.2012-12-03 /media/btrfs/root # umount /media/btrfs Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I'm on a 30-day diet. So far I've lost 18 days. --- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding some btrfs features
Hello, Thank you so much for your prompt response. Few more questions inline. On 2 December 2012 23:46, Hugo Mills h...@carfax.org.uk wrote: On Sun, Dec 02, 2012 at 11:17:26PM +0100, Aastha Mehta wrote: I am looking at btrfs to understand some of its features. One of them is the snapshot feature. Please tell me if my following understanding about snapshots in btrfs is correct or not. Btrfs supports both readonly and writeable snapshots. Writeable snapshots are like clone volumes (or subvolumes as in btrfs). We get a point in time copy of the subvolume in both case. I looked through the kernel code and it seems that creating a subvolume and taking a snapshot (readonly and writeable) all have a common ioctl interface. What I am not completely clear about is whether snapshots get same fsid as the source subvolume fsid or different. Yes, it's the same UUID, because they're all part of the same filesystem. Just to clarify, apart from UUID, is the FSID in the fs_info of the root also same for all snapshots of a subvolume? Also, I do not understand what does it mean to be able to take snapshot of a snapshot. Snapshots are completely equal partners with their original subvolumes. This is not the case in, say, LVM. What are benefits compared to say, being able to take snapshots only of the active subvolume and not of the snapshots? Let's say you take a snapshot (B) of your root filesystem (A). Then you decide to roll back to using the old version, so you mount B as root instead of A. Later that night, your backup process starts up and tries to take a temporary read-only snapshot (C) of your root filesystem (which is now B) so that it can make a stable backup. That's a snapshot of a snapshot. Okay, but still the snapshot can be taken only for a subvolume in use. Is that correct? In your example, C is taken on B after file system was rolled back to version B. What happens when the file system version mounted is A (which contains snapshot B) and we take another snapshot D on this mounted version. Does the snapshot D contain B or only the active contents of A? Probably before that, I need to get some clarity on why does a subvolume always belong in the directory of some parent subvolume. Is it possible to have more than one root subvolumes or more than one subvolumes in the same parent subvolume directory? No, there's precisely one top-level subvolume (subvolid=5). Everything else in the filesystem lives within that. However, you can have as many subvolumes as you like below that, and in whatever directories or subvolumes you want. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Quidquid latine dictum sit, altum videtur. --- Thanks again, Aastha. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html