2.6.37 BUG at inode.c:1616 (was Re: 2.6.37: Bug on btrfs while umount)

2011-01-10 Thread Andy Isaacson
On Thu, Jan 06, 2011 at 08:29:12PM -0500, Chris Mason wrote:
  [50010.838804] [ cut here ]
  [50010.838931] kernel BUG at fs/btrfs/inode.c:1616!
  [50010.839053] invalid opcode:  [#1] PREEMPT SMP
[snip]
  [50010.839653] Pid: 1681, comm: btrfs-endio-wri Not tainted 2.6.37 #1
 
 Could you please pull from the master branch of the btrfs unstable tree.
 We had a late fix that is related to this.

I saw BUG at inode.c:1616 while running 2.6.37-rc6-11882-g55ec86f, I saw
your message and upgraded to Linus tip (0c21e3a) + btrfs-unstable tip
(65e5341), and I just saw it again.  Including both BUG traces below.

The machine is a Core i7 with 12GB, with btrfs spanning three volumes:

Label: btr  uuid: 1271de53-b3d2-4d68-9d48-b19487e1c982
Total devices 3 FS bytes used 735.97GB
devid1 size 18.65GB used 18.64GB path /dev/sda2
devid2 size 512.00GB used 511.88GB path /dev/sdb1
devid3 size 512.00GB used 225.26GB path /dev/sdc1

The primary writer to the filesystem is rtorrent; normally I have ffmpeg
writing to the filesystem at about 100 kbyte/sec as well, but it wasn't
running in this latest crash.

[ 9275.240027] [ cut here ]
[ 9275.249991] kernel BUG at fs/btrfs/inode.c:1616!
[ 9275.259914] invalid opcode:  [#1] SMP 
[ 9275.269794] last sysfs file: 
/sys/devices/pci:00/:00:1a.7/usb1/1-4/1-4:1.0/host8/target8:0:0/8:0:0:0/block/sdd/stat
[ 9275.280066] CPU 0 
[ 9275.280127] Modules linked in: tun ebtable_nat ebtables ipt_MASQUERADE 
iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack 
ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel 
kvm xfs exportfs loop snd_hda_codec_hdmi snd_hda_codec_realtek radeon ttm 
drm_kms_helper drm snd_hda_intel snd_hda_codec i2c_algo_bit snd_usb_audio 
uvcvideo snd_hwdep i2c_i801 snd_usbmidi_lib snd_pcm snd_rawmidi snd_timer 
videodev snd_seq_device snd v4l2_compat_ioctl32 pcspkr i2c_core serio_raw 
soundcore snd_page_alloc processor tpm_tis tpm tpm_bios evdev shpchp button 
thermal_sys ext3 jbd mbcache dm_mod btrfs zlib_deflate crc32c libcrc32c 
usb_storage uas sd_mod crc_t10dif ehci_hcd usbcore ahci libahci libata r8169 
scsi_mod mii nls_base [last unloaded: scsi_wait_scan]
[ 9275.358450] 
[ 9275.369821] Pid: 3654, comm: btrfs-endio-wri Not tainted 
2.6.37-03739-gccda756 #73 MSI X58 Pro-E (MS-7522)/MS-7522
[ 9275.381570] RIP: 0010:[a0152824]  [a0152824] 
T.1234+0x76/0x201 [btrfs]
[ 9275.393380] RSP: 0018:88025f275c30  EFLAGS: 00010286
[ 9275.405100] RAX: ffe4 RBX: 88032b596b40 RCX: 88032b596c60
[ 9275.416865] RDX:  RSI: ea000b17b8d0 RDI: fff4
[ 9275.428666] RBP: 88025f275cc0 R08: 0005 R09: 88025f2759a0
[ 9275.440522] R10: 88025f275970 R11: dead00100100 R12: 880083a7e888
[ 9275.452374] R13: 06e0c000 R14: 880331d7c800 R15: 8800bb38d880
[ 9275.464146] FS:  () GS:8800bf40() 
knlGS:
[ 9275.475923] CS:  0010 DS:  ES:  CR0: 8005003b
[ 9275.487673] CR2: 7f3eada57000 CR3: 01603000 CR4: 26e0
[ 9275.499547] DR0:  DR1:  DR2: 
[ 9275.511395] DR3:  DR6: 0ff0 DR7: 0400
[ 9275.523192] Process btrfs-endio-wri (pid: 3654, threadinfo 88025f274000, 
task 88032554b020)
[ 9275.535168] Stack:
[ 9275.546852]  06e0c000 1000 00b8c1376000 
1000
[ 9275.558616]  880331d7c800 0001 8800bb38d880 
880331d7c800
[ 9275.570416]  88025f275cb0 a014b53f 88025f275ce0 
8802e21ad7f0
[ 9275.582149] Call Trace:
[ 9275.593731]  [a014b53f] ? start_transaction+0x1a9/0x1d8 [btrfs]
[ 9275.605513]  [a0152e1e] btrfs_finish_ordered_io+0x1e6/0x2c2 [btrfs]
[ 9275.617426]  [a0152f14] btrfs_writepage_end_io_hook+0x1a/0x1c 
[btrfs]
[ 9275.629403]  [a0166871] end_bio_extent_writepage+0xae/0x159 [btrfs]
[ 9275.641463]  [81125947] bio_endio+0x2d/0x2f
[ 9275.653462]  [a01470a0] end_workqueue_fn+0x111/0x120 [btrfs]
[ 9275.665484]  [a016ecc2] worker_loop+0x195/0x4c4 [btrfs]
[ 9275.677451]  [a016eb2d] ? worker_loop+0x0/0x4c4 [btrfs]
[ 9275.689317]  [a016eb2d] ? worker_loop+0x0/0x4c4 [btrfs]
[ 9275.701079]  [81061a8b] kthread+0x82/0x8a
[ 9275.712839]  [8100aaa4] kernel_thread_helper+0x4/0x10
[ 9275.724455]  [81061a09] ? kthread+0x0/0x8a
[ 9275.735873]  [8100aaa0] ? kernel_thread_helper+0x0/0x10
[ 9275.747329] Code: 0f 0b eb fe 80 88 88 00 00 00 08 45 31 c9 48 8b 4d 88 4c 
8d 45 c0 4c 01 e9 4c 89 ea 4c 89 e6 4c 89 ff e8 7c 4c 00 00 85 c0 74 04 0f 0b 
eb fe 49 8b 84 24 a8 00 00 00 4c 89 6d a9 48 89 45 a0 c6 
[ 9275.771177] RIP  [a0152824] T.1234+0x76/0x201 [btrfs]
[ 9275.782973]  RSP 88025f275c30
[ 

Re: Synching a Backup Server

2011-01-10 Thread Hubert Kario
On Sunday 09 of January 2011 12:46:59 Alan Chandler wrote:
 On 07/01/11 16:20, Hubert Kario wrote:
  I usually create subvolumes in btrfs root volume:
  
  /mnt/btrfs/
  
   |- server-a
   |- server-b
   
   \- server-c
  
  then create snapshots of these directories:
  
  /mnt/btrfs/
  
   |- server-a
   |- server-b
   |- server-c
   |- snapshots-server-a
   |
|- @GMT-2010.12.21-16.48.09
  
  \- @GMT-2010.12.22-16.45.14
   |
   |- snapshots-server-b
   
   \- snapshots-server-c
  
  This way I can use the shadow_copy module for samba to publish the
  snapshots to windows clients.
 
 Can you post some actual commands to do this part

# create the default subvolume and mount it
mkfs.btrfs /dev/sdx
mount /dev/sdx /mnt/btrfs
# to be able to snapshot individual servers we have to put them to individual
# subvolumes
btrfs subvolume create /mnt/btrfs/server-a
btrfs subvolume create /mnt/btrfs/server-b
btrfs subvolume create /mnt/btrfs/server-c
# copy data over
rsync --exclude /proc [...] r...@server-a:/ /mnt/btrfs/server-a
rsync --exclude /proc [...] r...@server-b:/ /mnt/btrfs/server-b
rsync --exclude /proc [...] r...@server-c:/ /mnt/btrfs/server-c
# create snapshot directories (in the default subvolume)
mkdir /mnt/btrfs/{snapshots-server-a,snapshots-server-b,snapshots-server-c}
# create snapshot from the synced data:
btrfs subvolume snapshot /mnt/btrfs/server-a /mnt/btrfs/snapshots-server-
a/@GMT-2010.12.21-16.48.09
# copy new data over:
rsync --inplace --exclude /proc [...] r...@server-a:/ /mnt/btrfs/server-a
# make a new snapshot
btrfs subvolume snapshot /mnt/btrfs/server-a /mnt/btrfs/snapshots-server-
a/@GMT-2010.12.22-16.45.14

in the end we have 5 subvolumes, 2 of witch are snapshots of the server-a
 
 I am extremely confused about btrfs subvolumes v the root filesystem and
 mounting, particularly in relation to the default subvolume.
 
 For instance, if I create the initial file system using mkfs.btrfs and
 then mount it on /mnt/btrfs is there already a default subvolume?  or do
 I have to make one?  What happens when you unmount the whole filesystem
 and then come back
 
 The wiki also makes the following statement
 
 *Note:* to be mounted the subvolume or snapshot have to be in the root
 of the btrfs filesystem.
 
 
 but you seems to have snapshots at one layer down from the root.
 
 
 I am trying to use this method for my offsite backups - to a large spare
 sata disk loaded via a usb port.
 
 I want to create the main filesystem (and possibly a subvolume - this is
 where I start to get confused) and rsync my current daily backup files
 to it.  I would then also (just so I get the correct time - rather than
 do it at the next cycle, as explained below) take a snapshot with a time
 label. I would transport this disk offsite.
 
 I would repeat this in a months time with a totally different disk
 
 In a couple of months time - when I come to recycle the first disk for
 my offsite backup, I would mount the retrieved disk (and again I am
 confused - mount the complete filesystem or the subvolume?)  rsync
 (--inplace ? - is this necessary) again the various backup files from my
 server and take another snapshot.

you mount the default, this way you have access to all the data on the HDD, --
inplace is necessary 

 
 I am hoping that this would effectively allow me to leave the snapshot I
 took last time in place, as because not everything will have changed it
 won't have used much space - so effectively I can keep quite a long
 stream of backup snapshots in place offsite.

yes

 
 Eventually of course the disk will start to become full, but I assume I
 can reclaim the space by deleting some of the old snapshots.

yes, of course:

btrfs subvolume delete /mnt/btrfs/snapshots-server-a/@GMT-2010.12.21-16.48.09

will reclaim the space used up by the deltas

-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Backup Command

2011-01-10 Thread Carl Cook

Here is my proposed cron:

btrfs subvolume snapshot hex:///home /media/backups/snapshots/hex-{DATE}

rsync --archive --hard-links --delete-during --delete-excluded --inplace 
--numeric-ids -e ssh --exclude-from=/media/backups/exclude-hex hex:///home 
/media/backups/hex

btrfs subvolume snapshot droog:///home /media/backups/snapshots/droog-{DATE}

rsync --archive --hard-links --delete-during --delete-excluded --inplace 
--numeric-ids -e ssh --exclude-from=/media/backups/exclude-droog droog:///home 
/media/backups/droog

Comments?  Criticisms?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Backup Command

2011-01-10 Thread Hubert Kario
On Monday 10 of January 2011 14:25:32 Carl Cook wrote:
 Here is my proposed cron:
 
 btrfs subvolume snapshot hex:///home /media/backups/snapshots/hex-{DATE}
 
 rsync --archive --hard-links --delete-during --delete-excluded --inplace
 --numeric-ids -e ssh --exclude-from=/media/backups/exclude-hex hex:///home
 /media/backups/hex
 
 btrfs subvolume snapshot droog:///home
 /media/backups/snapshots/droog-{DATE}
 
 rsync --archive --hard-links --delete-during --delete-excluded --inplace
 --numeric-ids -e ssh --exclude-from=/media/backups/exclude-droog
 droog:///home /media/backups/droog
 
 Comments?  Criticisms?

This will make the dates associated with snapshots offset by how often cron is 
run.

In other words, if you run above script daily you will have data from 
2011.01.01 in the hex-2011.01.02 directory.

I do save the current date, do a LVM snapshot on the source, rsync --inplace 
data over and do a local snapshot naming the folder using the saved date. This 
way the date in the name of backup directory is exact to about a second.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs

2011-01-10 Thread Wu Fengguang
Shaohua,

On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
 Hi,
   We have file readahead to do asyn file read, but has no metadata
 readahead. For a list of files, their metadata is stored in fragmented
 disk space and metadata read is a sync operation, which impacts the
 efficiency of readahead much. The patches try to add meatadata readahead
 for btrfs.
   In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
 the inode to a fd so we could use existing syscalls (readahead, mincore
 or upcoming fincore) to do readahead, but the inode is hidden, there is
 no easy way for this from my understanding. So we add two ioctls for

If that is the main obstacle, why not do straightforward fincore()/
fadvise(), and add ioctls to btrfs to export/grab the hidden
btree_inode in any form?  This will address btrfs' specific issue, and
have the benefit of making the VFS part general enough. You know
ext2/3/4 already have block_dev ready for metadata readahead.

Thanks,
Fengguang

 this. One is like readahead syscall, the other is like micore/fincore
 syscall.
   Under a harddisk based netbook with Meego, the metadata readahead
 reduced about 3.5s boot time in average from total 16s.
   Last time I posted similar patches to btrfs maillist, which adds the
 new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
 have a generic interface to do this so other filesystem can share some
 code, so I came up with the new one. Comments and suggestions are
 welcome!
 
 v1-v2:
 1. Added more comments and fix return values suggested by Andrew Morton
 2. fix a race condition pointed out by Yan Zheng
 
 initial post:
 http://marc.info/?l=linux-fsdevelm=129222493406353w=2
 
 Thanks,
 Shaohua
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Ric Wheeler


I think that dedup has a variety of use cases that are all very dependent on 
your workload. The approach you have here seems to be a quite reasonable one.


I did not see it in the code, but it is great to be able to collect statistics 
on how effective your hash is and any counters for the extra IO imposed.


Also very useful to have a paranoid mode where when you see a hash collision 
(dedup candidate), you fall back to a byte-by-byte compare to verify that the 
the collision is correct.  Keeping stats on how often this is a false collision 
would be quite interesting as well :)


Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Josef Bacik
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:

 I think that dedup has a variety of use cases that are all very dependent 
 on your workload. The approach you have here seems to be a quite 
 reasonable one.

 I did not see it in the code, but it is great to be able to collect 
 statistics on how effective your hash is and any counters for the extra 
 IO imposed.


So I have counters for how many extents are deduped and the overall file
savings, is that what you are talking about?

 Also very useful to have a paranoid mode where when you see a hash 
 collision (dedup candidate), you fall back to a byte-by-byte compare to 
 verify that the the collision is correct.  Keeping stats on how often 
 this is a false collision would be quite interesting as well :)


So I've always done a byte-by-byte compare, first in userspace but now its in
kernel, because frankly I don't trust hashing algorithms with my data.  It would
be simple enough to keep statistics on how often the byte-by-byte compare comes
out wrong, but really this is to catch changes to the file, so I have a
suspicion that most of these statistics would be simply that the file changed,
not that the hash was a collision.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Chris Mason
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
 On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
 
  I think that dedup has a variety of use cases that are all very dependent 
  on your workload. The approach you have here seems to be a quite 
  reasonable one.
 
  I did not see it in the code, but it is great to be able to collect 
  statistics on how effective your hash is and any counters for the extra 
  IO imposed.
 
 
 So I have counters for how many extents are deduped and the overall file
 savings, is that what you are talking about?
 
  Also very useful to have a paranoid mode where when you see a hash 
  collision (dedup candidate), you fall back to a byte-by-byte compare to 
  verify that the the collision is correct.  Keeping stats on how often 
  this is a false collision would be quite interesting as well :)
 
 
 So I've always done a byte-by-byte compare, first in userspace but now its in
 kernel, because frankly I don't trust hashing algorithms with my data.  It 
 would
 be simple enough to keep statistics on how often the byte-by-byte compare 
 comes
 out wrong, but really this is to catch changes to the file, so I have a
 suspicion that most of these statistics would be simply that the file changed,
 not that the hash was a collision.  Thanks,

At least in the kernel, if you're comparing extents on disk that are
from a committed transaction.  The contents won't change.  We could read
into a private buffer instead of into the file's address space to make
this more reliable/strict.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Offline Deduplication for Btrfs

2011-01-10 Thread Josef Bacik
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote:
 Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500:
  On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote:
  
   I think that dedup has a variety of use cases that are all very dependent 
   on your workload. The approach you have here seems to be a quite 
   reasonable one.
  
   I did not see it in the code, but it is great to be able to collect 
   statistics on how effective your hash is and any counters for the extra 
   IO imposed.
  
  
  So I have counters for how many extents are deduped and the overall file
  savings, is that what you are talking about?
  
   Also very useful to have a paranoid mode where when you see a hash 
   collision (dedup candidate), you fall back to a byte-by-byte compare to 
   verify that the the collision is correct.  Keeping stats on how often 
   this is a false collision would be quite interesting as well :)
  
  
  So I've always done a byte-by-byte compare, first in userspace but now its 
  in
  kernel, because frankly I don't trust hashing algorithms with my data.  It 
  would
  be simple enough to keep statistics on how often the byte-by-byte compare 
  comes
  out wrong, but really this is to catch changes to the file, so I have a
  suspicion that most of these statistics would be simply that the file 
  changed,
  not that the hash was a collision.  Thanks,
 
 At least in the kernel, if you're comparing extents on disk that are
 from a committed transaction.  The contents won't change.  We could read
 into a private buffer instead of into the file's address space to make
 this more reliable/strict.
 

Right sorry I was talking in the userspace case.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs

2011-01-10 Thread Shaohua Li
On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
 On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
  On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
   Shaohua,
   
   On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
Hi,
  We have file readahead to do asyn file read, but has no metadata
readahead. For a list of files, their metadata is stored in fragmented
disk space and metadata read is a sync operation, which impacts the
efficiency of readahead much. The patches try to add meatadata readahead
for btrfs.
  In btrfs, metadata is stored in btree_inode. Ideally, if we could hook
the inode to a fd so we could use existing syscalls (readahead, mincore
or upcoming fincore) to do readahead, but the inode is hidden, there is
no easy way for this from my understanding. So we add two ioctls for
   
   If that is the main obstacle, why not do straightforward fincore()/
   fadvise(), and add ioctls to btrfs to export/grab the hidden
   btree_inode in any form?  This will address btrfs' specific issue, and
   have the benefit of making the VFS part general enough. You know
   ext2/3/4 already have block_dev ready for metadata readahead.
  I forgot to update this comment. Please see patch 2 and patch 4, both
  incore and readahead need btrfs specific staff involved, so we can't use
  generic fincore or something.
 
 You can if you like :)
 
 - fincore() can return the referenced bit, which is generally
   useful information
metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
we can't blindly filter out such pages with the bit. fincore can takes a
parameter or it returns a bit to distinguish referenced pages, but I
don't think it's a good API. This should be transparent to userspace.

 - btrfs_metadata_readahead() can be passed to some (faked)
   -readpages() for use with fadvise.
this need filesystem specific hook too, the difference is your proposal
uses fadvise but I'm using ioctl. There isn't big difference.

BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I
didn't find a easy way to do this. It might be possible to do this for
example adding a fake device or fake fs (anon_inode doesn't work here,
IIRC), which is a bit ugly. Before it's proved generic API can handle
metadata readahead, I don't want to do it.

Thanks,
Shaohua

this. One is like readahead syscall, the other is like micore/fincore
syscall.
  Under a harddisk based netbook with Meego, the metadata readahead
reduced about 3.5s boot time in average from total 16s.
  Last time I posted similar patches to btrfs maillist, which adds the
new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we
have a generic interface to do this so other filesystem can share some
code, so I came up with the new one. Comments and suggestions are
welcome!

v1-v2:
1. Added more comments and fix return values suggested by Andrew Morton
2. fix a race condition pointed out by Yan Zheng

initial post:
http://marc.info/?l=linux-fsdevelm=129222493406353w=2

Thanks,
Shaohua

--
To unsubscribe from this list: send the line unsubscribe 
linux-fsdevel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs

2011-01-10 Thread Wu Fengguang
On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
 On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
  On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
   On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
Shaohua,

On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
 Hi,
   We have file readahead to do asyn file read, but has no metadata
 readahead. For a list of files, their metadata is stored in fragmented
 disk space and metadata read is a sync operation, which impacts the
 efficiency of readahead much. The patches try to add meatadata 
 readahead
 for btrfs.
   In btrfs, metadata is stored in btree_inode. Ideally, if we could 
 hook
 the inode to a fd so we could use existing syscalls (readahead, 
 mincore
 or upcoming fincore) to do readahead, but the inode is hidden, there 
 is
 no easy way for this from my understanding. So we add two ioctls for

If that is the main obstacle, why not do straightforward fincore()/
fadvise(), and add ioctls to btrfs to export/grab the hidden
btree_inode in any form?  This will address btrfs' specific issue, and
have the benefit of making the VFS part general enough. You know
ext2/3/4 already have block_dev ready for metadata readahead.
   I forgot to update this comment. Please see patch 2 and patch 4, both
   incore and readahead need btrfs specific staff involved, so we can't use
   generic fincore or something.
  
  You can if you like :)
  
  - fincore() can return the referenced bit, which is generally
useful information
 metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
 we can't blindly filter out such pages with the bit.

block_dev inodes have the accessed bits. Look at the below output.

/dev/sda5 is a mounted ext4 partition.  The 'A'/'R' in the
dump_page_cache lines stand for Active/Referenced.

r...@bay /home/wfg# echo /dev/sda5  /debug/tracing/objects/mm/pages/dump-file
r...@bay /home/wfg# cat /debug/tracing/trace
# tracer: nop
#
#   TASK-PIDCPU#TIMESTAMP  FUNCTION
#  | |   |  | |
 zsh-2950  [003]   879.500764: dump_inode_cache:0  
55643986944  170393621879 D___  BLKmount /dev/sda5
 zsh-2950  [003]   879.500774: dump_page_cache:0  2 
___AR_P20
 zsh-2950  [003]   879.500776: dump_page_cache:2  3 
R_P20
 zsh-2950  [003]   879.500777: dump_page_cache: 1026  5 
___AR_P20
 zsh-2950  [003]   879.500778: dump_page_cache: 1031  3 
___A__P20
 zsh-2950  [003]   879.500779: dump_page_cache: 1034  1 
___AR_P20
 zsh-2950  [003]   879.500780: dump_page_cache: 1035  2 
___A__P20
 zsh-2950  [003]   879.500781: dump_page_cache: 1037  1 
___AR_P20
 zsh-2950  [003]   879.500782: dump_page_cache: 1038  3 
R_P20
 zsh-2950  [003]   879.500782: dump_page_cache: 1041  1 
___A__P20
 zsh-2950  [003]   879.500783: dump_page_cache: 1057  1 
___AR_D___P20
 zsh-2950  [003]   879.500788: dump_page_cache: 1058  6 
___A__P20
 zsh-2950  [003]   879.500788: dump_page_cache: 9249  1 
___AR_P20
 zsh-2950  [003]   879.500789: dump_page_cache:   524289  1 
R_P20
 zsh-2950  [003]   879.500790: dump_page_cache:   524290  2 
___A__P20
 zsh-2950  [003]   879.500790: dump_page_cache:   524292  1 
___AR_P20
 zsh-2950  [003]   879.500791: dump_page_cache:   524293  1 
___A__P20
 zsh-2950  [003]   879.500796: dump_page_cache:   524294  9 
R_P20
 zsh-2950  [003]   879.500797: dump_page_cache:   524303  1 
___A__P20
 zsh-2950  [003]   879.500798: dump_page_cache:   987136  1 
___AR_P20
 zsh-2950  [003]   879.500798: dump_page_cache:  1048576  1 
R_P20
 zsh-2950  [003]   879.500799: dump_page_cache:  1048577  2 
___A__P20
 zsh-2950  [003]   879.500800: dump_page_cache:  1048579  1 
___AR_P20
 zsh-2950  [003]   879.500801: dump_page_cache:  1048580  5 
___A__P20
 zsh-2950  [003]   879.500802: dump_page_cache:  1048585  1 
___AR_P20
 zsh-2950  [003]   879.500805: dump_page_cache:  1048586  5 
___A__P20
 zsh-2950  [003]   879.500805: dump_page_cache:  1048591  1 
___AR_P

Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs

2011-01-10 Thread Shaohua Li
On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote:
 On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote:
  On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote:
   On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote:
On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote:
 Shaohua,

 On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote:
  Hi,
We have file readahead to do asyn file read, but has no metadata
  readahead. For a list of files, their metadata is stored in 
  fragmented
  disk space and metadata read is a sync operation, which impacts the
  efficiency of readahead much. The patches try to add meatadata 
  readahead
  for btrfs.
In btrfs, metadata is stored in btree_inode. Ideally, if we could 
  hook
  the inode to a fd so we could use existing syscalls (readahead, 
  mincore
  or upcoming fincore) to do readahead, but the inode is hidden, 
  there is
  no easy way for this from my understanding. So we add two ioctls for

 If that is the main obstacle, why not do straightforward fincore()/
 fadvise(), and add ioctls to btrfs to export/grab the hidden
 btree_inode in any form?  This will address btrfs' specific issue, and
 have the benefit of making the VFS part general enough. You know
 ext2/3/4 already have block_dev ready for metadata readahead.
I forgot to update this comment. Please see patch 2 and patch 4, both
incore and readahead need btrfs specific staff involved, so we can't use
generic fincore or something.
  
   You can if you like :)
  
   - fincore() can return the referenced bit, which is generally
 useful information
  metadata page in ext2/3 doesn't have reference bit set, while btrfs has.
  we can't blindly filter out such pages with the bit.
 
 block_dev inodes have the accessed bits. Look at the below output.
 
 /dev/sda5 is a mounted ext4 partition.  The 'A'/'R' in the
 dump_page_cache lines stand for Active/Referenced.
ext4 already does readahead? please check other filesystems.
filesystem sues bread like API to read metadata, which definitely
doesn't set referenced bit.

 r...@bay /home/wfg# echo /dev/sda5  /debug/tracing/objects/mm/pages/dump-file
 r...@bay /home/wfg# cat /debug/tracing/trace
 # tracer: nop
 #
 #   TASK-PIDCPU#TIMESTAMP  FUNCTION
 #  | |   |  | |
  zsh-2950  [003]   879.500764: dump_inode_cache:0  
 55643986944  170393621879 D___  BLKmount /dev/sda5
  zsh-2950  [003]   879.500774: dump_page_cache:0  
 2 ___AR_P20
  zsh-2950  [003]   879.500776: dump_page_cache:2  
 3 R_P20
  zsh-2950  [003]   879.500777: dump_page_cache: 1026  
 5 ___AR_P20
  zsh-2950  [003]   879.500778: dump_page_cache: 1031  
 3 ___A__P20
  zsh-2950  [003]   879.500779: dump_page_cache: 1034  
 1 ___AR_P20
  zsh-2950  [003]   879.500780: dump_page_cache: 1035  
 2 ___A__P20
  zsh-2950  [003]   879.500781: dump_page_cache: 1037  
 1 ___AR_P20
  zsh-2950  [003]   879.500782: dump_page_cache: 1038  
 3 R_P20
  zsh-2950  [003]   879.500782: dump_page_cache: 1041  
 1 ___A__P20
  zsh-2950  [003]   879.500783: dump_page_cache: 1057  
 1 ___AR_D___P20
  zsh-2950  [003]   879.500788: dump_page_cache: 1058  
 6 ___A__P20
  zsh-2950  [003]   879.500788: dump_page_cache: 9249  
 1 ___AR_P20
  zsh-2950  [003]   879.500789: dump_page_cache:   524289  
 1 R_P20
  zsh-2950  [003]   879.500790: dump_page_cache:   524290  
 2 ___A__P20
  zsh-2950  [003]   879.500790: dump_page_cache:   524292  
 1 ___AR_P20
  zsh-2950  [003]   879.500791: dump_page_cache:   524293  
 1 ___A__P20
  zsh-2950  [003]   879.500796: dump_page_cache:   524294  
 9 R_P20
  zsh-2950  [003]   879.500797: dump_page_cache:   524303  
 1 ___A__P20
  zsh-2950  [003]   879.500798: dump_page_cache:   987136  
 1 ___AR_P20
  zsh-2950  [003]   879.500798: dump_page_cache:  1048576  
 1 R_P20
  zsh-2950  [003]   879.500799: dump_page_cache:  1048577  
 2 ___A__P20
  zsh-2950  [003]   879.500800: dump_page_cache:  1048579  
 1 ___AR_P20
  zsh-2950  [003]   879.500801: dump_page_cache:  1048580  
 5 ___A__P