How do I find the physical block number?

2014-04-16 Thread Aastha Mehta
Hello,

I have created a 500GB partition on my HDD and formatted it for btrfs.
I created a file on it.
# echo tmp data in the tmp file..  /mnt/btrfs/tmp-file
# umount /mnt/btrfs

Next I want to know the blocks allocated for the file and I used
filefrag for it. I get some information as follows -

# mount -o max_inline=0 /dev/sdc2 /mnt/btrfs
# filefrag -v /mnt/btrfs/tmp-file
Filesystem type is: 9123683e
File size of /mnt/btrfs/tmp-file is 27 (1 block, blocksize 4096)
 ext logical physical expected length flags
   0   0 65924123   1 eof
/mnt/btrfs/tmp-file: 1 extent found

Now, I want to read the same data from the disk directly. I tried the
following -

block 65924123 = byte (65924123*4096) = 270025207808

# dd if=/dev/sdc2 of=tmp-file skip=270025207808 bs=1 count=4096
# cat tmp-file
I cannot read the file's contents but some garbage.

I read somewhere that the physical block number shown in filefrag may
actually be a logical block for the file system and it has an
additional translation to physical block number. So next I tried the
following -

# btrfs-map-logical -l 65924123 /dev/sdc2
mirror 1 logical 65924123 physical 74312731 device /dev/sdc2
mirror 2 logical 65924123 physical 1148054555 device /dev/sdc2

I again tried reading the block 74312731 using the dd command as
above, but it is still not the right block.

I want to know what does the physical block number returned by
filefrag mean, why there are two mappings for the above block number
and how I can find the exact physical disk block number the file
system actually writes to.

My sdc has the following partitions:
   Device Boot  Start End  Blocks   Id  System
/dev/sdc12048   419432447   209715200   83  Linux
/dev/sdc2  1468008448  2516584447   524288000   83  Linux (BTRFS)
/dev/sdc3   419432448  1468008447   524288000   83  Linux

Thanks,
Aastha.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I find the physical block number?

2014-04-16 Thread Aastha Mehta
On 16 April 2014 15:27, Aastha Mehta aasth...@gmail.com wrote:
 Hello,

 I have created a 500GB partition on my HDD and formatted it for btrfs.
 I created a file on it.
 # echo tmp data in the tmp file..  /mnt/btrfs/tmp-file
 # umount /mnt/btrfs

 Next I want to know the blocks allocated for the file and I used
 filefrag for it. I get some information as follows -

 # mount -o max_inline=0 /dev/sdc2 /mnt/btrfs
 # filefrag -v /mnt/btrfs/tmp-file
 Filesystem type is: 9123683e
 File size of /mnt/btrfs/tmp-file is 27 (1 block, blocksize 4096)
  ext logical physical expected length flags
0   0 65924123   1 eof
 /mnt/btrfs/tmp-file: 1 extent found

 Now, I want to read the same data from the disk directly. I tried the
 following -

 block 65924123 = byte (65924123*4096) = 270025207808

 # dd if=/dev/sdc2 of=tmp-file skip=270025207808 bs=1 count=4096
 # cat tmp-file
 I cannot read the file's contents but some garbage.

 I read somewhere that the physical block number shown in filefrag may
 actually be a logical block for the file system and it has an
 additional translation to physical block number. So next I tried the
 following -

 # btrfs-map-logical -l 65924123 /dev/sdc2
 mirror 1 logical 65924123 physical 74312731 device /dev/sdc2
 mirror 2 logical 65924123 physical 1148054555 device /dev/sdc2

 I again tried reading the block 74312731 using the dd command as
 above, but it is still not the right block.

 I want to know what does the physical block number returned by
 filefrag mean, why there are two mappings for the above block number
 and how I can find the exact physical disk block number the file
 system actually writes to.

 My sdc has the following partitions:
Device Boot  Start End  Blocks   Id  System
 /dev/sdc12048   419432447   209715200   83  Linux
 /dev/sdc2  1468008448  2516584447   524288000   83  Linux (BTRFS)
 /dev/sdc3   419432448  1468008447   524288000   83  Linux

 Thanks,
 Aastha.

I realized my mistake in using the btrfs-map-logical command. It
should have been
# btrfs-map-logical -l 270025207808 /dev/sdc2

Now, everything works fine. Please ignore my post, except it may be
useful for somebody else needing this information in future.

Thanks,
Aastha.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: questions regarding fsync in btrfs

2014-01-29 Thread Aastha Mehta
On 25 January 2014 16:21, Josef Bacik jba...@fb.com wrote:

 On 01/24/2014 07:09 PM, Aastha Mehta wrote:

 Hello,

 I would like to clarify a bit on how the fsync works in btrfs. The log
 tree journals only the metadata of the files that have been modified
 prior to the fsync, correct? It does not log the data extents of
 files, which are directly sync'ed to the disk. Also, if I understand
 correctly, fsync and fdatasync are the same thing in btrfs currently.
 Is it more like fsync or fdatasync?


 More like fsync.  Because we cow we always are updating metadata so there is
 no fdatasync, we can't get away with just flushing the data.


 What exactly happens once a file inode is in the tree log? Does it
 mean it is guaranteed to be persisted on disk, or is it already on
 disk? I see two flags in btrfs_sync_file -
 BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not
 fully understand them. After full sync, what does log_dentry_safe and
 sync_log do?


 It is guaranteed to be on disk.  We copy all of the inode metadata to the
 log, sync the log and the data and the super block that points to hte tree
 log.  HAS_ASYNC_EXTENT is for compression where we will return to writepages
 without actually having marked the page as writeback, so we need to go back
 and re-lock the pages to make sure it has passed through the async
 compression threads and the pages have been properly marked writeback so we
 can wait on them properly.  NEEDS_FULL_SYNC means we can't do our fancy
 tricks of only updating some of the metadata, we have to go and copy all of
 the inode metadata (the inode, its references, its xattrs) and all of its
 extents.  log_dentry_safe copies all the info into the tree log and sync_log
 syncs the tree log to disk and writes out a super that points to the tree
 log.

 Finally, Wikipedia says that the items in the log tree are replayed
 and deleted at the next full tree commit or (if there was a system
 crash) at the next remount. Even if there is no crash, why is there a
 need to replay the log?

 There isn't, once we commit a transaction we commit a super that doesn't
 point to the tree log and we free up the blocks we used for the tree log.
 The tree log only exists for one transaction, if we crash before a
 transaction commits we will see that there is a tree log on the next mount
 and replay it.  If we commit the transaction we simply free the tree log and
 carry on.  Thanks,

 Josef


Thank you for your response. I ran few small experiments and I see
that fsync on an average leads to writing of about 30-40KB of
metadata, irrespective of the amount of data changes. I wonder why is
it so much? Besides the superblocks and a couple of blocks in the tree
log, what else may be updated? Also, why does it seem to be
independent of the amount of writes?

Thanks,
Aastha.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


questions regarding fsync in btrfs

2014-01-24 Thread Aastha Mehta
Hello,

I would like to clarify a bit on how the fsync works in btrfs. The log
tree journals only the metadata of the files that have been modified
prior to the fsync, correct? It does not log the data extents of
files, which are directly sync'ed to the disk. Also, if I understand
correctly, fsync and fdatasync are the same thing in btrfs currently.
Is it more like fsync or fdatasync?

What exactly happens once a file inode is in the tree log? Does it
mean it is guaranteed to be persisted on disk, or is it already on
disk? I see two flags in btrfs_sync_file -
BTRFS_INODE_HAS_ASYNC_EXTENT and BTRFS_INODE_NEEDS_FULL_SYNC. I do not
fully understand them. After full sync, what does log_dentry_safe and
sync_log do?

Finally, Wikipedia says that the items in the log tree are replayed
and deleted at the next full tree commit or (if there was a system
crash) at the next remount. Even if there is no crash, why is there a
need to replay the log?

Thanks,
Aastha.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


question regarding caching

2013-12-30 Thread Aastha Mehta
Hello,

I have some questions regarding caching in BTRFS. When a file system
is unmounted and mounted again, would all the previously cached
content be removed from the cache after flushing to disk? After
remounting, would the initial requests always be fetched from the
disk?

Rather than a local disk, I have a remote device to which my IO
requests are sent and from which the data is fetched. I need certain
data to be fetched from the remote device after a remount. But somehow
I do not see any request appearing at the device. I even tried to do
drop_caches after remounting the file system, but that does not seem
to help.

I guess my problem is not related to BTRFS, but since I am working
with BTRFS, I wanted to ask here for help. Could any one tell me how I
can ensure that requests are fetched from the (remote) device,
especially after file system remount, without having to use
drop_caches?

Please let me know if I described the problem too vaguely and should
give some more details.

Wishing everyone a happy new year.

Thanks and regards,
Aastha.

-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-10-02 Thread Aastha Mehta
On 2 October 2013 13:52, Josef Bacik jba...@fusionio.com wrote:
 On Tue, Oct 01, 2013 at 10:13:25PM +0200, Aastha Mehta wrote:
 On 1 October 2013 21:42, Aastha Mehta aasth...@gmail.com wrote:
  On 1 October 2013 21:40, Aastha Mehta aasth...@gmail.com wrote:
  On 1 October 2013 19:34, Josef Bacik jba...@fusionio.com wrote:
  On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote:
  On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote:
   On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
   On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com 
wrote:
 On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
 Thank you very much for the reply. That clarifies a lot of 
 things.

 I was trying a small test case that opens a file, writes a 
 block of
 data, calls fsync and then closes the file. If I understand 
 correctly,
 fsync would return only after all in-memory buffers have been
 committed to disk. I have added few print statements in the
 __extent_writepage function, and I notice that the function 
 gets
 called a bit later after fsync returns. It seems that I am not
 guaranteed to see the data going to disk by the time fsync 
 returns.

 Am I doing something wrong, or am I looking at the wrong place 
 for
 disk write? This happens both with tree logging enabled as 
 well as
 with notreelog.


 So 3.1 was a long time ago and to be sure it had issues I don't 
 think it was
 _that_ broken.  You are probably better off instrumenting a 
 recent kernel, 3.11
 or just build btrfs-next from git.  But if I were to make a 
 guess I'd say that
 __extent_writepage was how both data and metadata was written 
 out at the time (I
 don't think I changed it until 3.2 or something later) so what 
 you are likely
 seeing is the normal transaction commit after the fsync.  In 
 the case of
 notreelog we are likely starting another transaction and you 
 are seeing that
 commit (at the time the transaction kthread would start a 
 transaction even if
 none had been started yet.)  Thanks,

 Josef
   
Is there any special handling for very small file write, less 
than 4K? As
I understand there is an optimization to inline the first extent 
in a file if
it is smaller than 4K, does it affect the writeback on fsync as 
well? I did
set the max_inline mount option to 0, but even then it seems 
there is
some difference in fsync behaviour for writing first extent of 
less than 4K
size and writing 4K or more.
   
   
Yeah if the file is an inline extent then it will be copied into 
the log
directly and the log will be written out, no going through the 
data write path
at all.  Max inline == 0 should make it so we don't inline, so if 
it isn't
honoring that then that may be a bug.  Thanks,
   
Josef
  
   I tried it on 3.12-rc2 release, and it seems there is a bug then.
   Please find attached logs to confirm.
   Also, probably on the older release.
  
  
   Oooh ok I understand, you have your printk's in the wrong place ;).
   do_writepages doesn't necessarily mean you are writing something.  If 
   you want
   to see if stuff got written to the disk I'd put a printk at 
   run_delalloc_range
   and have it spit out the range it is writing out since thats what we 
   think is
   actually dirty.  Thanks,
  
   Josef
 
  No, but I also placed dump_stack() in the beginning of
  __extent_writepage. run_delalloc_range is being called only from
  __extent_writepage, if it were to be called, the dump_stack() at the
  top of __extent_writepage would have printed as well, no?
 
 
  Ok I've done the same thing and I'm not seeing what you are seeing.  Are 
  you
  using any mount options other than notreelog and max_inline=0?  Could 
  you adjust
  your printk to print out the root objectid for the inode as well?  It 
  could be
  possible that this is the writeout for the space cache or inode cache.  
  Thanks,
 
  Josef
 
  I actually printed the stack only when the root objectid is 5. I have
  attached another log for writing the first 500 bytes in a file. I also
  print the root objectid for the inode in run_delalloc and
  __extent_writepage.
 
  Thanks
 
 
  Just to clarify, in the latest logs, I allowed printing of debug
  printk's and stack dump for all root objectid's.

 Actually, it is the same behaviour when I write anything less than 4K
 long, no matter what offset, except if I straddle the page boundary.
 To summarise:
 1. write 4K - write in the fsync path
 2. write less than 4K, within a single page - bdi_writeback by flush worker
 3. small write that straddles a page boundary or write 4K+delta - the
 first page

Re: Questions regarding logging upon fsync in btrfs

2013-10-01 Thread Aastha Mehta
On 1 October 2013 19:34, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote:
 On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote:
  On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
  On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
   On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
   On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
Thank you very much for the reply. That clarifies a lot of things.
   
I was trying a small test case that opens a file, writes a block of
data, calls fsync and then closes the file. If I understand 
correctly,
fsync would return only after all in-memory buffers have been
committed to disk. I have added few print statements in the
__extent_writepage function, and I notice that the function gets
called a bit later after fsync returns. It seems that I am not
guaranteed to see the data going to disk by the time fsync returns.
   
Am I doing something wrong, or am I looking at the wrong place for
disk write? This happens both with tree logging enabled as well as
with notreelog.
   
   
So 3.1 was a long time ago and to be sure it had issues I don't 
think it was
_that_ broken.  You are probably better off instrumenting a recent 
kernel, 3.11
or just build btrfs-next from git.  But if I were to make a guess 
I'd say that
__extent_writepage was how both data and metadata was written out at 
the time (I
don't think I changed it until 3.2 or something later) so what you 
are likely
seeing is the normal transaction commit after the fsync.  In the 
case of
notreelog we are likely starting another transaction and you are 
seeing that
commit (at the time the transaction kthread would start a 
transaction even if
none had been started yet.)  Thanks,
   
Josef
  
   Is there any special handling for very small file write, less than 4K? 
   As
   I understand there is an optimization to inline the first extent in a 
   file if
   it is smaller than 4K, does it affect the writeback on fsync as well? 
   I did
   set the max_inline mount option to 0, but even then it seems there is
   some difference in fsync behaviour for writing first extent of less 
   than 4K
   size and writing 4K or more.
  
  
   Yeah if the file is an inline extent then it will be copied into the log
   directly and the log will be written out, no going through the data 
   write path
   at all.  Max inline == 0 should make it so we don't inline, so if it 
   isn't
   honoring that then that may be a bug.  Thanks,
  
   Josef
 
  I tried it on 3.12-rc2 release, and it seems there is a bug then.
  Please find attached logs to confirm.
  Also, probably on the older release.
 
 
  Oooh ok I understand, you have your printk's in the wrong place ;).
  do_writepages doesn't necessarily mean you are writing something.  If you 
  want
  to see if stuff got written to the disk I'd put a printk at 
  run_delalloc_range
  and have it spit out the range it is writing out since thats what we think 
  is
  actually dirty.  Thanks,
 
  Josef

 No, but I also placed dump_stack() in the beginning of
 __extent_writepage. run_delalloc_range is being called only from
 __extent_writepage, if it were to be called, the dump_stack() at the
 top of __extent_writepage would have printed as well, no?


 Ok I've done the same thing and I'm not seeing what you are seeing.  Are you
 using any mount options other than notreelog and max_inline=0?  Could you 
 adjust
 your printk to print out the root objectid for the inode as well?  It could be
 possible that this is the writeout for the space cache or inode cache.  
 Thanks,

 Josef

I actually printed the stack only when the root objectid is 5. I have
attached another log for writing the first 500 bytes in a file. I also
print the root objectid for the inode in run_delalloc and
__extent_writepage.

Thanks

-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
write success, size: 500
fsync:: Success
file size: 500
fclose:: Success


[  829.847483] btrfs: device fsid 4d2767ea-85cc-443c-88bd-2d8c693f7c7c devid 1 
transid 2201 /dev/sdd
[  829.848199] btrfs: max_inline at 0
[  829.848201] btrfs: disabling tree log
[  829.848202] btrfs: disk space caching is enabled
[  835.266483] [do_writepages] inum: 258, nr_to_write: 9223372036854775807, 
writepages fn: 8129ce40
[  835.266492] CPU: 0 PID: 10021 Comm: bb Not tainted 3.12.0-rc2-latest+ #7
[  835.266494] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 
12/01/2006
[  835.266496]  88003cdc5e60 88003cdc5e28 817a8acf 
88003fc0eb48
[  835.266511]  88001f478b28 88003cdc5e48 81105c1f 
88001f4789e0
[  835.266515]  880037557400 88003cdc5e88 810fb689 
88003fc12c00

Re: Questions regarding logging upon fsync in btrfs

2013-10-01 Thread Aastha Mehta
On 1 October 2013 21:42, Aastha Mehta aasth...@gmail.com wrote:
 On 1 October 2013 21:40, Aastha Mehta aasth...@gmail.com wrote:
 On 1 October 2013 19:34, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Sep 30, 2013 at 11:07:20PM +0200, Aastha Mehta wrote:
 On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote:
  On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
  On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
   On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
   On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
Thank you very much for the reply. That clarifies a lot of things.
   
I was trying a small test case that opens a file, writes a block 
of
data, calls fsync and then closes the file. If I understand 
correctly,
fsync would return only after all in-memory buffers have been
committed to disk. I have added few print statements in the
__extent_writepage function, and I notice that the function gets
called a bit later after fsync returns. It seems that I am not
guaranteed to see the data going to disk by the time fsync 
returns.
   
Am I doing something wrong, or am I looking at the wrong place for
disk write? This happens both with tree logging enabled as well as
with notreelog.
   
   
So 3.1 was a long time ago and to be sure it had issues I don't 
think it was
_that_ broken.  You are probably better off instrumenting a recent 
kernel, 3.11
or just build btrfs-next from git.  But if I were to make a guess 
I'd say that
__extent_writepage was how both data and metadata was written out 
at the time (I
don't think I changed it until 3.2 or something later) so what you 
are likely
seeing is the normal transaction commit after the fsync.  In the 
case of
notreelog we are likely starting another transaction and you are 
seeing that
commit (at the time the transaction kthread would start a 
transaction even if
none had been started yet.)  Thanks,
   
Josef
  
   Is there any special handling for very small file write, less than 
   4K? As
   I understand there is an optimization to inline the first extent in 
   a file if
   it is smaller than 4K, does it affect the writeback on fsync as 
   well? I did
   set the max_inline mount option to 0, but even then it seems there is
   some difference in fsync behaviour for writing first extent of less 
   than 4K
   size and writing 4K or more.
  
  
   Yeah if the file is an inline extent then it will be copied into the 
   log
   directly and the log will be written out, no going through the data 
   write path
   at all.  Max inline == 0 should make it so we don't inline, so if it 
   isn't
   honoring that then that may be a bug.  Thanks,
  
   Josef
 
  I tried it on 3.12-rc2 release, and it seems there is a bug then.
  Please find attached logs to confirm.
  Also, probably on the older release.
 
 
  Oooh ok I understand, you have your printk's in the wrong place ;).
  do_writepages doesn't necessarily mean you are writing something.  If 
  you want
  to see if stuff got written to the disk I'd put a printk at 
  run_delalloc_range
  and have it spit out the range it is writing out since thats what we 
  think is
  actually dirty.  Thanks,
 
  Josef

 No, but I also placed dump_stack() in the beginning of
 __extent_writepage. run_delalloc_range is being called only from
 __extent_writepage, if it were to be called, the dump_stack() at the
 top of __extent_writepage would have printed as well, no?


 Ok I've done the same thing and I'm not seeing what you are seeing.  Are you
 using any mount options other than notreelog and max_inline=0?  Could you 
 adjust
 your printk to print out the root objectid for the inode as well?  It could 
 be
 possible that this is the writeout for the space cache or inode cache.  
 Thanks,

 Josef

 I actually printed the stack only when the root objectid is 5. I have
 attached another log for writing the first 500 bytes in a file. I also
 print the root objectid for the inode in run_delalloc and
 __extent_writepage.

 Thanks


 Just to clarify, in the latest logs, I allowed printing of debug
 printk's and stack dump for all root objectid's.

Actually, it is the same behaviour when I write anything less than 4K
long, no matter what offset, except if I straddle the page boundary.
To summarise:
1. write 4K - write in the fsync path
2. write less than 4K, within a single page - bdi_writeback by flush worker
3. small write that straddles a page boundary or write 4K+delta - the
first page gets written in the fsync path, the remaining length that
straddles the page boundary is written in the bdi_writeback path

Please let me know, if I am trying out incorrect cases.

Sorry for too many mails.

Thanks
--
To unsubscribe from this list: send the line unsubscribe linux

Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Aastha Mehta
On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
 On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
 Thank you very much for the reply. That clarifies a lot of things.

 I was trying a small test case that opens a file, writes a block of
 data, calls fsync and then closes the file. If I understand correctly,
 fsync would return only after all in-memory buffers have been
 committed to disk. I have added few print statements in the
 __extent_writepage function, and I notice that the function gets
 called a bit later after fsync returns. It seems that I am not
 guaranteed to see the data going to disk by the time fsync returns.

 Am I doing something wrong, or am I looking at the wrong place for
 disk write? This happens both with tree logging enabled as well as
 with notreelog.


 So 3.1 was a long time ago and to be sure it had issues I don't think it was
 _that_ broken.  You are probably better off instrumenting a recent kernel, 
 3.11
 or just build btrfs-next from git.  But if I were to make a guess I'd say that
 __extent_writepage was how both data and metadata was written out at the time 
 (I
 don't think I changed it until 3.2 or something later) so what you are likely
 seeing is the normal transaction commit after the fsync.  In the case of
 notreelog we are likely starting another transaction and you are seeing that
 commit (at the time the transaction kthread would start a transaction even if
 none had been started yet.)  Thanks,

 Josef

Is there any special handling for very small file write, less than 4K? As
I understand there is an optimization to inline the first extent in a file if
it is smaller than 4K, does it affect the writeback on fsync as well? I did
set the max_inline mount option to 0, but even then it seems there is
some difference in fsync behaviour for writing first extent of less than 4K
size and writing 4K or more.

Thanks,
Aastha.


-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Aastha Mehta
On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
 On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
  On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
  Thank you very much for the reply. That clarifies a lot of things.
 
  I was trying a small test case that opens a file, writes a block of
  data, calls fsync and then closes the file. If I understand correctly,
  fsync would return only after all in-memory buffers have been
  committed to disk. I have added few print statements in the
  __extent_writepage function, and I notice that the function gets
  called a bit later after fsync returns. It seems that I am not
  guaranteed to see the data going to disk by the time fsync returns.
 
  Am I doing something wrong, or am I looking at the wrong place for
  disk write? This happens both with tree logging enabled as well as
  with notreelog.
 
 
  So 3.1 was a long time ago and to be sure it had issues I don't think it 
  was
  _that_ broken.  You are probably better off instrumenting a recent kernel, 
  3.11
  or just build btrfs-next from git.  But if I were to make a guess I'd say 
  that
  __extent_writepage was how both data and metadata was written out at the 
  time (I
  don't think I changed it until 3.2 or something later) so what you are 
  likely
  seeing is the normal transaction commit after the fsync.  In the case of
  notreelog we are likely starting another transaction and you are seeing 
  that
  commit (at the time the transaction kthread would start a transaction even 
  if
  none had been started yet.)  Thanks,
 
  Josef

 Is there any special handling for very small file write, less than 4K? As
 I understand there is an optimization to inline the first extent in a file if
 it is smaller than 4K, does it affect the writeback on fsync as well? I did
 set the max_inline mount option to 0, but even then it seems there is
 some difference in fsync behaviour for writing first extent of less than 4K
 size and writing 4K or more.


 Yeah if the file is an inline extent then it will be copied into the log
 directly and the log will be written out, no going through the data write path
 at all.  Max inline == 0 should make it so we don't inline, so if it isn't
 honoring that then that may be a bug.  Thanks,

 Josef

I tried it on 3.12-rc2 release, and it seems there is a bug then.
Please find attached logs to confirm.
Also, probably on the older release.

Thanks,
Aastha.

-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
[ 2808.125838] [do_writepages] inum: 260, nr_to_write: 9223372036854775807, 
writepages fn: 8129ce40
[ 2808.125977] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4
[ 2808.125980] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 
12/01/2006
[ 2808.125983]  88003d6b5e60 88003d6b5e28 817a8a6f 
88003fc0eb48
[ 2808.125987]  88002e325b48 88003d6b5e48 81105c1f 
88002e325a00
[ 2808.125990]  880037b1f300 88003d6b5e88 810fb689 
88003fc12c00
[ 2808.125993] Call Trace:
[ 2808.126003]  [817a8a6f] dump_stack+0x46/0x58
[ 2808.126010]  [81105c1f] do_writepages+0x4f/0x90
[ 2808.126013]  [810fb689] __filemap_fdatawrite_range+0x49/0x50
[ 2808.126013]  [810fc41e] filemap_fdatawrite_range+0xe/0x10
[ 2808.126013]  [812aafdc] btrfs_sync_file+0xac/0x380
[ 2808.126098]  [817aedd2] ? __schedule+0x692/0x730
[ 2808.126115]  [811752d8] do_fsync+0x58/0x90
[ 2808.126149]  [81148a6d] ? SyS_write+0x4d/0xa0
[ 2808.126165]  [8117565b] SyS_fsync+0xb/0x10
[ 2808.126182]  [817b83a2] system_call_fastpath+0x16/0x1b
[ 2808.127447] CPU: 0 PID: 10215 Comm: bb Not tainted 3.12.0-rc2-latest+ #4
[ 2808.127450] Hardware name: innotek GmbH VirtualBox, BIOS VirtualBox 
12/01/2006
[ 2808.127452]  ea802440 88003d6b5ba8 817a8a6f 
88003d6b5de0
[ 2808.127456]  88002e325870 88003d6b5cb8 812b7c9f 
88003e2083d0
[ 2808.127459]   88003d6b5c48 817aedd2 
88003c3707b0
[ 2808.127462] Call Trace:
[ 2808.127467]  [817a8a6f] dump_stack+0x46/0x58
[ 2808.127473]  [812b7c9f] __extent_writepage+0x6f/0x750
[ 2808.127479]  [817aedd2] ? __schedule+0x692/0x730
[ 2808.127483]  [810fac71] ? find_get_pages_tag+0x121/0x160
[ 2808.127487]  [812b8552] 
extent_write_cache_pages.isra.24.constprop.31+0x1d2/0x380
[ 2808.127492]  [81005877] ? show_trace_log_lvl+0x57/0x70
[ 2808.127496]  [812b89d9] extent_writepages+0x49/0x60
[ 2808.127501]  [8129f160] ? btrfs_submit_direct+0x6c0/0x6c0
[ 2808.127505]  [8129ce63] btrfs_writepages+0x23/0x30
[ 2808.127510]  [81105c3e] do_writepages+0x6e/0x90
[ 2808.127513]  [810fb689] __filemap_fdatawrite_range+0x49/0x50
[ 2808.127517]  [810fc41e] filemap_fdatawrite_range+0xe/0x10

Re: Questions regarding logging upon fsync in btrfs

2013-09-30 Thread Aastha Mehta
On 30 September 2013 22:47, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Sep 30, 2013 at 10:30:59PM +0200, Aastha Mehta wrote:
 On 30 September 2013 22:11, Josef Bacik jba...@fusionio.com wrote:
  On Mon, Sep 30, 2013 at 09:32:54PM +0200, Aastha Mehta wrote:
  On 29 September 2013 15:12, Josef Bacik jba...@fusionio.com wrote:
   On Sun, Sep 29, 2013 at 11:22:36AM +0200, Aastha Mehta wrote:
   Thank you very much for the reply. That clarifies a lot of things.
  
   I was trying a small test case that opens a file, writes a block of
   data, calls fsync and then closes the file. If I understand correctly,
   fsync would return only after all in-memory buffers have been
   committed to disk. I have added few print statements in the
   __extent_writepage function, and I notice that the function gets
   called a bit later after fsync returns. It seems that I am not
   guaranteed to see the data going to disk by the time fsync returns.
  
   Am I doing something wrong, or am I looking at the wrong place for
   disk write? This happens both with tree logging enabled as well as
   with notreelog.
  
  
   So 3.1 was a long time ago and to be sure it had issues I don't think 
   it was
   _that_ broken.  You are probably better off instrumenting a recent 
   kernel, 3.11
   or just build btrfs-next from git.  But if I were to make a guess I'd 
   say that
   __extent_writepage was how both data and metadata was written out at 
   the time (I
   don't think I changed it until 3.2 or something later) so what you are 
   likely
   seeing is the normal transaction commit after the fsync.  In the case of
   notreelog we are likely starting another transaction and you are seeing 
   that
   commit (at the time the transaction kthread would start a transaction 
   even if
   none had been started yet.)  Thanks,
  
   Josef
 
  Is there any special handling for very small file write, less than 4K? As
  I understand there is an optimization to inline the first extent in a 
  file if
  it is smaller than 4K, does it affect the writeback on fsync as well? I 
  did
  set the max_inline mount option to 0, but even then it seems there is
  some difference in fsync behaviour for writing first extent of less than 
  4K
  size and writing 4K or more.
 
 
  Yeah if the file is an inline extent then it will be copied into the log
  directly and the log will be written out, no going through the data write 
  path
  at all.  Max inline == 0 should make it so we don't inline, so if it isn't
  honoring that then that may be a bug.  Thanks,
 
  Josef

 I tried it on 3.12-rc2 release, and it seems there is a bug then.
 Please find attached logs to confirm.
 Also, probably on the older release.


 Oooh ok I understand, you have your printk's in the wrong place ;).
 do_writepages doesn't necessarily mean you are writing something.  If you want
 to see if stuff got written to the disk I'd put a printk at run_delalloc_range
 and have it spit out the range it is writing out since thats what we think is
 actually dirty.  Thanks,

 Josef

No, but I also placed dump_stack() in the beginning of
__extent_writepage. run_delalloc_range is being called only from
__extent_writepage, if it were to be called, the dump_stack() at the
top of __extent_writepage would have printed as well, no?

Thanks

-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-29 Thread Aastha Mehta
Thank you very much for the reply. That clarifies a lot of things.

I was trying a small test case that opens a file, writes a block of
data, calls fsync and then closes the file. If I understand correctly,
fsync would return only after all in-memory buffers have been
committed to disk. I have added few print statements in the
__extent_writepage function, and I notice that the function gets
called a bit later after fsync returns. It seems that I am not
guaranteed to see the data going to disk by the time fsync returns.

Am I doing something wrong, or am I looking at the wrong place for
disk write? This happens both with tree logging enabled as well as
with notreelog.

Thanks

On 29 September 2013 02:42, Josef Bacik jba...@fusionio.com wrote:
 On Sun, Sep 29, 2013 at 01:35:15AM +0200, Aastha Mehta wrote:
 Hi,

 I have few questions regarding logging triggered by calling fsync in BTRFS:

 1. If I understand correctly, fsync will call to log entire inode in
 the log tree. Does this mean that the data extents are also logged
 into the log tree? Are they copied into the log tree, or just
 referenced? Are they copied into the subvolume's extent tree again
 upon replay?


 The data extents are copied as well, as in the metadata that points to the 
 data,
 not the actual data itself.  For 3.1 it's all of the extents in the inode, in
 3.8 on it's only the extents that have changed this transaction.

 2. During replay, when the extents are added into the extent
 allocation tree, do they acquire the physical extent number during
 replay? Does they physical extent allocated to the data in the log
 tree differ from that in the subvolume?


 No the physical location was picked when we wrote the data out during fsync.  
 If
 we crash and re-mount the replay will just insert the ref into the extent tree
 for the disk offset as it replays the extents.

 3. I see there is a mount option of notreelog available. After
 disabling tree logging, does fsync still lead to flushing of buffers
 to the disk directly?


 notreelog just means that we write the data and wait on the ordered data 
 extents
 and then commit the transaction.  So you get the data for the inode you are
 fsycning and all of the metadata for the entire file system that has changed 
 in
 that transaction.

 4. Is it possible to selectively identify certain files in the log
 tree and flush them to disk directly, without waiting for the replay
 to do it?


 I don't understand this question, replay only happens on mount after a
 crash/power loss, and everything is replayed that is in the log, there is no 
 way
 to select which inode is replayed.  Thanks,

 Josef



-- 
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Questions regarding logging upon fsync in btrfs

2013-09-28 Thread Aastha Mehta
Hi,

I have few questions regarding logging triggered by calling fsync in BTRFS:

1. If I understand correctly, fsync will call to log entire inode in
the log tree. Does this mean that the data extents are also logged
into the log tree? Are they copied into the log tree, or just
referenced? Are they copied into the subvolume's extent tree again
upon replay?

2. During replay, when the extents are added into the extent
allocation tree, do they acquire the physical extent number during
replay? Does they physical extent allocated to the data in the log
tree differ from that in the subvolume?

3. I see there is a mount option of notreelog available. After
disabling tree logging, does fsync still lead to flushing of buffers
to the disk directly?

4. Is it possible to selectively identify certain files in the log
tree and flush them to disk directly, without waiting for the replay
to do it?

Thanks

-- 
Aastha Mehta
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions regarding logging upon fsync in btrfs

2013-09-28 Thread Aastha Mehta
I am using linux kernel 3.1.10-1.16, just to let you know.

Thanks

On 29 September 2013 01:35, Aastha Mehta aasth...@gmail.com wrote:
 Hi,

 I have few questions regarding logging triggered by calling fsync in BTRFS:

 1. If I understand correctly, fsync will call to log entire inode in
 the log tree. Does this mean that the data extents are also logged
 into the log tree? Are they copied into the log tree, or just
 referenced? Are they copied into the subvolume's extent tree again
 upon replay?

 2. During replay, when the extents are added into the extent
 allocation tree, do they acquire the physical extent number during
 replay? Does they physical extent allocated to the data in the log
 tree differ from that in the subvolume?

 3. I see there is a mount option of notreelog available. After
 disabling tree logging, does fsync still lead to flushing of buffers
 to the disk directly?

 4. Is it possible to selectively identify certain files in the log
 tree and flush them to disk directly, without waiting for the replay
 to do it?

 Thanks

 --
 Aastha Mehta



-- 
Aastha Mehta
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-03-04 Thread Aastha Mehta
I must admit, it is quite convoluted :-)

Please tell me if I understand this. A file system tree (containing
the inodes, the extents of all the inodes, etc.) is itself laid out in
the leaf extents of another big tree, which is the root tree. This is
why you say that inode and other such metadata may be lying in the
leaf nodes. Correct?

I did not completely understand what you meant when you said that the
metadata (the file extent items and such) for the inodes are stored
inside the same tree that the inode resides in. I thought the
btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to
the actual data of a file.

Okay, now I am not even sure if in btrfs there is something like an
indirect block for a huge file. In file systems with fixed block size,
one can hold only as many pointers to data blocks and hence when the
file size grows indirects are added in the file's tree. Is there any
equivalent indirect extent required for huge files in btrfs, or do all
the files fit within one level? If there are indirects, what item type
do they have? Would something like btrfs_get_extent() be useful to get
the indirect extents of a file?

Too many questions, sorry :(

Thanks.

On 4 March 2013 00:52, Josef Bacik jo...@toxicpanda.com wrote:
 On Sun, Mar 3, 2013 at 10:41 AM, Aastha Mehta aasth...@gmail.com wrote:
 Hi Josef,

 I have some more questions following up on my previous e-mails.
 I now do somewhat understand the place where extent entries get
 cow'ed. But I am unclear about the order of operations.

 Is it correct that the data extent written first, then the pointer in
 the indirect block needs to be updated, so then it is cowed and
 written to disk and so on recursively up the tree? Or is the entire
 path from leaf to node that is going to be affected by the write cowed
 first and then all the cowed extents are written to the disk and then
 the rest of the metadata pointers, (for example, in checksum tree,
 extent tree, etc., I am not sure about this)?

 The second one.  We COW the entire path from root to leaf as things
 need COW'ing.  We start a transaction, we insert the file extent
 entries, we add the checksums, and we add the delayed ref updates to
 the extent tree.  The delayed things are guaranteed to happen in that
 transaction so we have consistency there.  The COW'ing from top to
 bottom works like that for all trees.


 Also, I need to understand specifically how the data (leaf nodes) of a
 file is written to disk v/s the metadata including the indirect nodes
 of the file. In extent_writepage I only know the pages of a file that
 are to be written. I guess, I can identify metadata pages based on the
 inode of the page's owner. But is it possible to distinguish the pages
 available in extent_writepage path as belonging to the leaf node or
 internal node for a file? If it cannot be identified at this point,
 where earlier in the path can this be decided?


 So they are different things, and they could change from the time we
 write to the time that the write completes because of COW.  Also keep
 in mind that the metadata (the file extent items and such) for the
 inodes are not stored specifically within the inode, they're stored
 inside the same tree that the inode resides in.  So you can have a
 leaf node with multiple inodes and extents for those different inodes.
  And so any sort of random things can happen, other inodes can be
 deleted and this inode's metadata will be shifted into a new leaf, or
 another inode could be added and this inode's data could be pushed off
 into an adjacent leaf.  The only way to know which leaf/page the inode
 is associated with is to search for whatever you are looking for in
 the tree, and then while you are holding all of the locks and
 reference counting you can be sure that those pages contain the
 metadata you are looking for, but once you let that go there are no
 guarantees.

 So as far as how it is written to disk, that is where transactions
 come in.  We track all the dirty metadata pages we have per
 transaction, and then at transaction commit time we make sure that all
 of those pages are written to disk and then we commit our super to
 point to the new root of the tree root, which in turn points at all of
 our new roots because of COW.  These pages can be written before the
 commit though because of memory pressure, and if they are written and
 then modified again within in the same transaction we will re-cow them
 to make sure we don't have any partial-page updates.  Keeping track of
 where a specific inodes metadata is contained is a tricky business.
 Let me know if that helped.  Thanks,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-03-04 Thread Aastha Mehta
Okay, that makes lot more sense to me now.

Thank you very much.

Regards,
Aastha.

On 5 March 2013 02:51, Josef Bacik jo...@toxicpanda.com wrote:
 On Mon, Mar 4, 2013 at 7:57 PM, Aastha Mehta aasth...@gmail.com wrote:
 I must admit, it is quite convoluted :-)

 Please tell me if I understand this. A file system tree (containing
 the inodes, the extents of all the inodes, etc.) is itself laid out in
 the leaf extents of another big tree, which is the root tree. This is
 why you say that inode and other such metadata may be lying in the
 leaf nodes. Correct?


 Sort of.  We have lot's of tree's, but the inode data is laid out in
 what we refer to as fs trees.  All these trees are just b-trees that
 have different data in them.  In the fs-trees they will hold inode
 items, directory items, file extent items, xattr items and orphan
 items.  So any given leaf in this tree could have any number of those
 items in them referring to any number of inodes.  You could have

 [inode item for inode 1][file extent item for inode 1][inode item for
 inode 2][xattr for inode 2][file extent item for inode 2]

 all contained within one leaf.  Does that make sense?

 I did not completely understand what you meant when you said that the
 metadata (the file extent items and such) for the inodes are stored
 inside the same tree that the inode resides in. I thought the
 btrfs_file_extent_item associated with EXTENT_DATA_KEY corresponds to
 the actual data of a file.


 Yes the btrfs_file_extent_item points to a [offset, size] pair that
 describes a data extent.

 Okay, now I am not even sure if in btrfs there is something like an
 indirect block for a huge file. In file systems with fixed block size,
 one can hold only as many pointers to data blocks and hence when the
 file size grows indirects are added in the file's tree. Is there any
 equivalent indirect extent required for huge files in btrfs, or do all
 the files fit within one level? If there are indirects, what item type
 do they have? Would something like btrfs_get_extent() be useful to get
 the indirect extents of a file?


 So there are no indirects, there are just btrfs_file_extent_items that
 are held within the btree that describe all of the extents that relate
 to a particular file.  So you can have (in the case of large
 fragmented files) hundreds of leaves within the btree that just
 contain btrfs_file_extent_items for all the ranges for a file.
 btrfs_get_extent() just looks up the relevant btrfs_file_extent_item
 for the range that you are wondering about, and maps it to our
 extent_map structure internally.  Hth,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-03-03 Thread Aastha Mehta
Hi Josef,

I have some more questions following up on my previous e-mails.
I now do somewhat understand the place where extent entries get
cow'ed. But I am unclear about the order of operations.

Is it correct that the data extent written first, then the pointer in
the indirect block needs to be updated, so then it is cowed and
written to disk and so on recursively up the tree? Or is the entire
path from leaf to node that is going to be affected by the write cowed
first and then all the cowed extents are written to the disk and then
the rest of the metadata pointers, (for example, in checksum tree,
extent tree, etc., I am not sure about this)?

Also, I need to understand specifically how the data (leaf nodes) of a
file is written to disk v/s the metadata including the indirect nodes
of the file. In extent_writepage I only know the pages of a file that
are to be written. I guess, I can identify metadata pages based on the
inode of the page's owner. But is it possible to distinguish the pages
available in extent_writepage path as belonging to the leaf node or
internal node for a file? If it cannot be identified at this point,
where earlier in the path can this be decided?

Many thanks,
Aastha.

On 25 February 2013 20:00, Aastha Mehta aasth...@gmail.com wrote:
 Ah okay, I now see how it works. Thanks a lot for your response.

 Regards,
 Aastha.


 On 25 February 2013 18:27, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:
 Thanks again Josef.

 I understood that cow_file_range is called for a regular file. Just to
 clarify, in cow_file_range is cow done at the time of reserving
 extents in the extent btree for the io to be done in this delalloc? I
 see the following comment above find_free_extent() which is called
 while trying to reserve extents:

 /*
  * walks the btree of allocated extents and find a hole of a given size.
  * The key ins is changed to record the hole:
  * ins-objectid == block start
  * ins-flags = BTRFS_EXTENT_ITEM_KEY
  * ins-offset == number of blocks
  * Any available blocks before search_start are skipped.
  */

 This seems to be the only place where a cow might be done, because a
 key is being inserted into an extent which modifies it.


 The key isn't inserted at this time, it's just returned with those values 
 for us
 to do as we please.  There is no update of the btree until
 insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io.
 Thanks,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-02-25 Thread Aastha Mehta
Thanks again Josef.

I understood that cow_file_range is called for a regular file. Just to
clarify, in cow_file_range is cow done at the time of reserving
extents in the extent btree for the io to be done in this delalloc? I
see the following comment above find_free_extent() which is called
while trying to reserve extents:

/*
 * walks the btree of allocated extents and find a hole of a given size.
 * The key ins is changed to record the hole:
 * ins-objectid == block start
 * ins-flags = BTRFS_EXTENT_ITEM_KEY
 * ins-offset == number of blocks
 * Any available blocks before search_start are skipped.
 */

This seems to be the only place where a cow might be done, because a
key is being inserted into an extent which modifies it.

Thanks,
Aastha.

On 24 February 2013 02:39, Josef Bacik jo...@toxicpanda.com wrote:
 On Thu, Feb 21, 2013 at 12:32 PM, Aastha Mehta aasth...@gmail.com wrote:

 Thanks a lot for the prompt response. I had seen that, but I am still
 not sure of where it really
 happens within fill_delalloc. Could you help me a little further in that
 path?


 So we check the properties of the inode and do one of 3 things, either we
 call btrfs_cow_file_range directly in the case of a normal file,
 run_delalloc_nocow in the case of a file with prealloc extents or NOCOW, or
 we do the compression dance.  We make an ordered extent for this range and
 return.  And then the normal io path happens.


 Secondly, now I am confused between the btree_writepages and
 btrfs_writepages/btrfs_writepage
 methods. I thought btrfs_writepages was for writing the pages holding
 inodes and btree_writepages
 for writing the other indirect and leaf extents of the btree. Then, it
 seems that the write operations
 lead to update of the file system data structures in a top-down
 manner, i.e. first changing the inode
 and then the data extents. Is that correct?


 You are right that btrfs_writepages/writepage are for normal files and
 btree_writepages is for the metadata.  The write operations do start in data
 and then modify metadata later down the line if that is what you are getting
 at.


 Thirdly, it seems that the old extents maybe dropped before the new
 extents are flushed to the disk.
 What would happen if the write fails before the disk commit? What am I
 missing here?


 Yeah, the metadata isn't updated until the data is on the disk.  In
 -fill_delalloc we setup an btrfs_ordered_extent that describes the range of
 the dirty pages we are writing.  When we've written all these pages we run
 btrfs_finish_ordered_io, which will drop the old extent entries if there are
 any and then add the new extent entries and update the references and such.
 So if something fails we just continue to point to the original file extent
 entries and return an EIO, we maintain consistency by making sure the
 metadata is updated only after the data is written out.  I hope that helps.
 Thanks,

 Josef



--
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-02-25 Thread Aastha Mehta
Ah okay, I now see how it works. Thanks a lot for your response.

Regards,
Aastha.


On 25 February 2013 18:27, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:
 Thanks again Josef.

 I understood that cow_file_range is called for a regular file. Just to
 clarify, in cow_file_range is cow done at the time of reserving
 extents in the extent btree for the io to be done in this delalloc? I
 see the following comment above find_free_extent() which is called
 while trying to reserve extents:

 /*
  * walks the btree of allocated extents and find a hole of a given size.
  * The key ins is changed to record the hole:
  * ins-objectid == block start
  * ins-flags = BTRFS_EXTENT_ITEM_KEY
  * ins-offset == number of blocks
  * Any available blocks before search_start are skipped.
  */

 This seems to be the only place where a cow might be done, because a
 key is being inserted into an extent which modifies it.


 The key isn't inserted at this time, it's just returned with those values for 
 us
 to do as we please.  There is no update of the btree until
 insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io.
 Thanks,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-02-23 Thread Aastha Mehta
A gentle reminder on this one.

Thanks,
Aastha.

On 21 February 2013 18:32, Aastha Mehta aasth...@gmail.com wrote:
 Thanks a lot for the prompt response. I had seen that, but I am still
 not sure of where it really
 happens within fill_delalloc. Could you help me a little further in that path?

 Secondly, now I am confused between the btree_writepages and
 btrfs_writepages/btrfs_writepage
 methods. I thought btrfs_writepages was for writing the pages holding
 inodes and btree_writepages
 for writing the other indirect and leaf extents of the btree. Then, it
 seems that the write operations
 lead to update of the file system data structures in a top-down
 manner, i.e. first changing the inode
 and then the data extents. Is that correct?

 Thirdly, it seems that the old extents maybe dropped before the new
 extents are flushed to the disk.
 What would happen if the write fails before the disk commit? What am I
 missing here?

 Thanks,
 Aastha.

 On 20 February 2013 18:54, Josef Bacik jba...@fusionio.com wrote:
 On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote:
 Hello,

 I am trying to understand the COW mechanism in Btrfs. Is it correct to
 say that unless nodatacow option is specified, Btrfs always performs
 COW for all the data+metadata extents used in the system?


 So we always cow the metadata, but yes nodatacow means we don't cow the 
 actual
 data in the data extents.

 I saw that COWing is implemented in btrfs_cow_block() function, which
 is called at the time of searching a slot for a particular item, while
 inserting into a new slot, committing transactions, while creating
 pending snapshots and few other places.

 However, while tracing through the complete write path, I could not
 quite figure out when extents actually get COWed. Could you please
 point me to the place where COWing takes place? Is there any time
 when, for performance or any other reasons, the extents are not COWed
 but overwritten in place (apart from the explicit nodatacow flag being
 set during mount)?

 You'll want to look at the tree operation -fill_delalloc().  Thats where we 
 do
 cow_file_range().  We allocate new space and write.  When we finish the 
 ordered
 io we do btrfs_drop_extents() on the range we just wrote which will free up 
 any
 existing extents that exist, and then insert our new file extent.  Thanks,

 Josef



--
Aastha Mehta
MPI-SWS, Germany
E-mail: aasth...@mpi-sws.org
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-02-21 Thread Aastha Mehta
Thanks a lot for the prompt response. I had seen that, but I am still
not sure of where it really
happens within fill_delalloc. Could you help me a little further in that path?

Secondly, now I am confused between the btree_writepages and
btrfs_writepages/btrfs_writepage
methods. I thought btrfs_writepages was for writing the pages holding
inodes and btree_writepages
for writing the other indirect and leaf extents of the btree. Then, it
seems that the write operations
lead to update of the file system data structures in a top-down
manner, i.e. first changing the inode
and then the data extents. Is that correct?

Thirdly, it seems that the old extents maybe dropped before the new
extents are flushed to the disk.
What would happen if the write fails before the disk commit? What am I
missing here?

Thanks,
Aastha.

On 20 February 2013 18:54, Josef Bacik jba...@fusionio.com wrote:
 On Wed, Feb 20, 2013 at 10:28:10AM -0700, Aastha Mehta wrote:
 Hello,

 I am trying to understand the COW mechanism in Btrfs. Is it correct to
 say that unless nodatacow option is specified, Btrfs always performs
 COW for all the data+metadata extents used in the system?


 So we always cow the metadata, but yes nodatacow means we don't cow the actual
 data in the data extents.

 I saw that COWing is implemented in btrfs_cow_block() function, which
 is called at the time of searching a slot for a particular item, while
 inserting into a new slot, committing transactions, while creating
 pending snapshots and few other places.

 However, while tracing through the complete write path, I could not
 quite figure out when extents actually get COWed. Could you please
 point me to the place where COWing takes place? Is there any time
 when, for performance or any other reasons, the extents are not COWed
 but overwritten in place (apart from the explicit nodatacow flag being
 set during mount)?

 You'll want to look at the tree operation -fill_delalloc().  Thats where we 
 do
 cow_file_range().  We allocate new space and write.  When we finish the 
 ordered
 io we do btrfs_drop_extents() on the range we just wrote which will free up 
 any
 existing extents that exist, and then insert our new file extent.  Thanks,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding some btrfs features

2012-12-04 Thread Aastha Mehta
Thanks a lot for the explanation.

Regards,
Aastha.

On 3 December 2012 13:02, Hugo Mills h...@carfax.org.uk wrote:
 On Mon, Dec 03, 2012 at 10:52:41AM +0100, Aastha Mehta wrote:
 On 2 December 2012 23:46, Hugo Mills h...@carfax.org.uk wrote:
  On Sun, Dec 02, 2012 at 11:17:26PM +0100, Aastha Mehta wrote:
  I am looking at btrfs to understand some of its features. One of them
  is the snapshot feature. Please tell me if my following understanding
  about snapshots in btrfs is correct or not.
 
  Btrfs supports both readonly and writeable snapshots. Writeable
  snapshots are like clone volumes (or subvolumes as in btrfs). We get a
  point in time copy of the subvolume in both case.
 
  I looked through the kernel code and it seems that creating a
  subvolume and taking a snapshot (readonly and writeable) all have a
  common ioctl interface.
 
  What I am not completely clear about is whether snapshots get same
  fsid as the source subvolume fsid or different.
 
 Yes, it's the same UUID, because they're all part of the same
  filesystem.
 
 Just to clarify, apart from UUID, is the FSID in the fs_info of the
 root also same for all snapshots of a subvolume?

  Also, I do not understand what does it mean to be able to take
  snapshot of a snapshot.
 
 Snapshots are completely equal partners with their original
  subvolumes. This is not the case in, say, LVM.
 
  What are benefits compared to say, being able to take snapshots only
  of the active subvolume and not of the snapshots?
 
 Let's say you take a snapshot (B) of your root filesystem (A). Then
  you decide to roll back to using the old version, so you mount B as
  root instead of A. Later that night, your backup process starts up and
  tries to take a temporary read-only snapshot (C) of your root
  filesystem (which is now B) so that it can make a stable backup.
  That's a snapshot of a snapshot.
 
 Okay, but still the snapshot can be taken only for a subvolume in use.
 Is that correct?

Well, it depends on what you mean by in use. You can't snapshot
 something which doesn't appear somewhere in your directory hierarchy.

 In your example, C is taken on B after file system
 was rolled back to version B. What happens when the file system
 version mounted is A (which contains snapshot B) and we take another
 snapshot D on this mounted version. Does the snapshot D contain B or
 only the active contents of A?

Snapshots are not recursive. If you have a subvolume inside another:

 subv1/subv2

 and then snapshot that

 # btrfs sub snap subv1 subv1-A

 you will end up with a subvolume subv1-A, containing an empty
 directory called subv2.

Note that you don't have to have subvolumes inside subvolumes at
 all, if you don't use the top level of your filesystem as anything
 other than a place to store and manage subvolumes.

Consider this btrfs filesystem layout:

 /   subvolid=5 (=0)
 rootsubvolid=256  (default subvol)
 homesubvolid=257
 snapshots   directory

 With this in fstab:

 /dev/sda/   btrfs   subvolid=256
 /dev/sda/home   btrfs   subvolid=257
 /dev/sda/media/btrfsbtrfs   subvolid=5,noauto

 We get this filesystem hierarchy:

 / subvolid=256
 home  subvolid=257
 media directory
 btrfs subvolid=5

Note that the mount of the full filesystem on /media/btrfs isn't
 done automatically -- it only needs to be done when you're managing
 subvolumes.

So, we can take a snapshot of /, for example:

 # mount /media/btrfs
 # btrfs sub snap /media/btrfs/root /media/btrfs/snapshots/root.2012-12-03
 # umount /media/btrfs

 The FS (from its top level) now looks like this:

 /   subvolid=5 (=0)
 rootsubvolid=256  (default subvol)
 homesubvolid=257
 snapshots   directory
 root.2012-12-03 subvolid=258

 To roll back root temporarily to the earlier version, you can edit
 your boot manager's config to supply subvolid=258 as a mount
 parameter. To do so permanently, you can set the default subvolume to
 258, and optionally move the snapshot to /root within the btrfs
 filesystem:

 # mount /media/btrfs
 # mv /media/btrfs/root /media/btrfs/root.old
 # mv /media/btrfs/snapshots/root.2012-12-03 /media/btrfs/root
 # umount /media/btrfs

Hugo.

 --
 === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
   PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- I'm on a 30-day diet. So far I've lost 18 days. ---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding some btrfs features

2012-12-03 Thread Aastha Mehta
Hello,

Thank you so much for your prompt response. Few more questions inline.

On 2 December 2012 23:46, Hugo Mills h...@carfax.org.uk wrote:
 On Sun, Dec 02, 2012 at 11:17:26PM +0100, Aastha Mehta wrote:
 I am looking at btrfs to understand some of its features. One of them
 is the snapshot feature. Please tell me if my following understanding
 about snapshots in btrfs is correct or not.

 Btrfs supports both readonly and writeable snapshots. Writeable
 snapshots are like clone volumes (or subvolumes as in btrfs). We get a
 point in time copy of the subvolume in both case.

 I looked through the kernel code and it seems that creating a
 subvolume and taking a snapshot (readonly and writeable) all have a
 common ioctl interface.

 What I am not completely clear about is whether snapshots get same
 fsid as the source subvolume fsid or different.

Yes, it's the same UUID, because they're all part of the same
 filesystem.

Just to clarify, apart from UUID, is the FSID in the fs_info of the
root also same for all snapshots of a subvolume?

 Also, I do not understand what does it mean to be able to take
 snapshot of a snapshot.

Snapshots are completely equal partners with their original
 subvolumes. This is not the case in, say, LVM.

 What are benefits compared to say, being able to take snapshots only
 of the active subvolume and not of the snapshots?

Let's say you take a snapshot (B) of your root filesystem (A). Then
 you decide to roll back to using the old version, so you mount B as
 root instead of A. Later that night, your backup process starts up and
 tries to take a temporary read-only snapshot (C) of your root
 filesystem (which is now B) so that it can make a stable backup.
 That's a snapshot of a snapshot.

Okay, but still the snapshot can be taken only for a subvolume in use.
Is that correct? In your example, C is taken on B after file system
was rolled back to version B. What happens when the file system
version mounted is A (which contains snapshot B) and we take another
snapshot D on this mounted version. Does the snapshot D contain B or
only the active contents of A?

 Probably before that, I need to get some clarity on why does a
 subvolume always belong in the directory of some parent subvolume. Is
 it possible to have more than one root subvolumes or more than one
 subvolumes in the same parent subvolume directory?

No, there's precisely one top-level subvolume (subvolid=5).
 Everything else in the filesystem lives within that. However, you can
 have as many subvolumes as you like below that, and in whatever
 directories or subvolumes you want.

Hugo.

 --
 === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
   PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Quidquid latine dictum sit,  altum videtur. ---


Thanks again,
Aastha.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html