2.6.37 BUG at inode.c:1616 (was Re: 2.6.37: Bug on btrfs while umount)
On Thu, Jan 06, 2011 at 08:29:12PM -0500, Chris Mason wrote: [50010.838804] [ cut here ] [50010.838931] kernel BUG at fs/btrfs/inode.c:1616! [50010.839053] invalid opcode: [#1] PREEMPT SMP [snip] [50010.839653] Pid: 1681, comm: btrfs-endio-wri Not tainted 2.6.37 #1 Could you please pull from the master branch of the btrfs unstable tree. We had a late fix that is related to this. I saw BUG at inode.c:1616 while running 2.6.37-rc6-11882-g55ec86f, I saw your message and upgraded to Linus tip (0c21e3a) + btrfs-unstable tip (65e5341), and I just saw it again. Including both BUG traces below. The machine is a Core i7 with 12GB, with btrfs spanning three volumes: Label: btr uuid: 1271de53-b3d2-4d68-9d48-b19487e1c982 Total devices 3 FS bytes used 735.97GB devid1 size 18.65GB used 18.64GB path /dev/sda2 devid2 size 512.00GB used 511.88GB path /dev/sdb1 devid3 size 512.00GB used 225.26GB path /dev/sdc1 The primary writer to the filesystem is rtorrent; normally I have ffmpeg writing to the filesystem at about 100 kbyte/sec as well, but it wasn't running in this latest crash. [ 9275.240027] [ cut here ] [ 9275.249991] kernel BUG at fs/btrfs/inode.c:1616! [ 9275.259914] invalid opcode: [#1] SMP [ 9275.269794] last sysfs file: /sys/devices/pci:00/:00:1a.7/usb1/1-4/1-4:1.0/host8/target8:0:0/8:0:0:0/block/sdd/stat [ 9275.280066] CPU 0 [ 9275.280127] Modules linked in: tun ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm xfs exportfs loop snd_hda_codec_hdmi snd_hda_codec_realtek radeon ttm drm_kms_helper drm snd_hda_intel snd_hda_codec i2c_algo_bit snd_usb_audio uvcvideo snd_hwdep i2c_i801 snd_usbmidi_lib snd_pcm snd_rawmidi snd_timer videodev snd_seq_device snd v4l2_compat_ioctl32 pcspkr i2c_core serio_raw soundcore snd_page_alloc processor tpm_tis tpm tpm_bios evdev shpchp button thermal_sys ext3 jbd mbcache dm_mod btrfs zlib_deflate crc32c libcrc32c usb_storage uas sd_mod crc_t10dif ehci_hcd usbcore ahci libahci libata r8169 scsi_mod mii nls_base [last unloaded: scsi_wait_scan] [ 9275.358450] [ 9275.369821] Pid: 3654, comm: btrfs-endio-wri Not tainted 2.6.37-03739-gccda756 #73 MSI X58 Pro-E (MS-7522)/MS-7522 [ 9275.381570] RIP: 0010:[a0152824] [a0152824] T.1234+0x76/0x201 [btrfs] [ 9275.393380] RSP: 0018:88025f275c30 EFLAGS: 00010286 [ 9275.405100] RAX: ffe4 RBX: 88032b596b40 RCX: 88032b596c60 [ 9275.416865] RDX: RSI: ea000b17b8d0 RDI: fff4 [ 9275.428666] RBP: 88025f275cc0 R08: 0005 R09: 88025f2759a0 [ 9275.440522] R10: 88025f275970 R11: dead00100100 R12: 880083a7e888 [ 9275.452374] R13: 06e0c000 R14: 880331d7c800 R15: 8800bb38d880 [ 9275.464146] FS: () GS:8800bf40() knlGS: [ 9275.475923] CS: 0010 DS: ES: CR0: 8005003b [ 9275.487673] CR2: 7f3eada57000 CR3: 01603000 CR4: 26e0 [ 9275.499547] DR0: DR1: DR2: [ 9275.511395] DR3: DR6: 0ff0 DR7: 0400 [ 9275.523192] Process btrfs-endio-wri (pid: 3654, threadinfo 88025f274000, task 88032554b020) [ 9275.535168] Stack: [ 9275.546852] 06e0c000 1000 00b8c1376000 1000 [ 9275.558616] 880331d7c800 0001 8800bb38d880 880331d7c800 [ 9275.570416] 88025f275cb0 a014b53f 88025f275ce0 8802e21ad7f0 [ 9275.582149] Call Trace: [ 9275.593731] [a014b53f] ? start_transaction+0x1a9/0x1d8 [btrfs] [ 9275.605513] [a0152e1e] btrfs_finish_ordered_io+0x1e6/0x2c2 [btrfs] [ 9275.617426] [a0152f14] btrfs_writepage_end_io_hook+0x1a/0x1c [btrfs] [ 9275.629403] [a0166871] end_bio_extent_writepage+0xae/0x159 [btrfs] [ 9275.641463] [81125947] bio_endio+0x2d/0x2f [ 9275.653462] [a01470a0] end_workqueue_fn+0x111/0x120 [btrfs] [ 9275.665484] [a016ecc2] worker_loop+0x195/0x4c4 [btrfs] [ 9275.677451] [a016eb2d] ? worker_loop+0x0/0x4c4 [btrfs] [ 9275.689317] [a016eb2d] ? worker_loop+0x0/0x4c4 [btrfs] [ 9275.701079] [81061a8b] kthread+0x82/0x8a [ 9275.712839] [8100aaa4] kernel_thread_helper+0x4/0x10 [ 9275.724455] [81061a09] ? kthread+0x0/0x8a [ 9275.735873] [8100aaa0] ? kernel_thread_helper+0x0/0x10 [ 9275.747329] Code: 0f 0b eb fe 80 88 88 00 00 00 08 45 31 c9 48 8b 4d 88 4c 8d 45 c0 4c 01 e9 4c 89 ea 4c 89 e6 4c 89 ff e8 7c 4c 00 00 85 c0 74 04 0f 0b eb fe 49 8b 84 24 a8 00 00 00 4c 89 6d a9 48 89 45 a0 c6 [ 9275.771177] RIP [a0152824] T.1234+0x76/0x201 [btrfs] [ 9275.782973] RSP 88025f275c30 [
Re: Synching a Backup Server
On Sunday 09 of January 2011 12:46:59 Alan Chandler wrote: On 07/01/11 16:20, Hubert Kario wrote: I usually create subvolumes in btrfs root volume: /mnt/btrfs/ |- server-a |- server-b \- server-c then create snapshots of these directories: /mnt/btrfs/ |- server-a |- server-b |- server-c |- snapshots-server-a | |- @GMT-2010.12.21-16.48.09 \- @GMT-2010.12.22-16.45.14 | |- snapshots-server-b \- snapshots-server-c This way I can use the shadow_copy module for samba to publish the snapshots to windows clients. Can you post some actual commands to do this part # create the default subvolume and mount it mkfs.btrfs /dev/sdx mount /dev/sdx /mnt/btrfs # to be able to snapshot individual servers we have to put them to individual # subvolumes btrfs subvolume create /mnt/btrfs/server-a btrfs subvolume create /mnt/btrfs/server-b btrfs subvolume create /mnt/btrfs/server-c # copy data over rsync --exclude /proc [...] r...@server-a:/ /mnt/btrfs/server-a rsync --exclude /proc [...] r...@server-b:/ /mnt/btrfs/server-b rsync --exclude /proc [...] r...@server-c:/ /mnt/btrfs/server-c # create snapshot directories (in the default subvolume) mkdir /mnt/btrfs/{snapshots-server-a,snapshots-server-b,snapshots-server-c} # create snapshot from the synced data: btrfs subvolume snapshot /mnt/btrfs/server-a /mnt/btrfs/snapshots-server- a/@GMT-2010.12.21-16.48.09 # copy new data over: rsync --inplace --exclude /proc [...] r...@server-a:/ /mnt/btrfs/server-a # make a new snapshot btrfs subvolume snapshot /mnt/btrfs/server-a /mnt/btrfs/snapshots-server- a/@GMT-2010.12.22-16.45.14 in the end we have 5 subvolumes, 2 of witch are snapshots of the server-a I am extremely confused about btrfs subvolumes v the root filesystem and mounting, particularly in relation to the default subvolume. For instance, if I create the initial file system using mkfs.btrfs and then mount it on /mnt/btrfs is there already a default subvolume? or do I have to make one? What happens when you unmount the whole filesystem and then come back The wiki also makes the following statement *Note:* to be mounted the subvolume or snapshot have to be in the root of the btrfs filesystem. but you seems to have snapshots at one layer down from the root. I am trying to use this method for my offsite backups - to a large spare sata disk loaded via a usb port. I want to create the main filesystem (and possibly a subvolume - this is where I start to get confused) and rsync my current daily backup files to it. I would then also (just so I get the correct time - rather than do it at the next cycle, as explained below) take a snapshot with a time label. I would transport this disk offsite. I would repeat this in a months time with a totally different disk In a couple of months time - when I come to recycle the first disk for my offsite backup, I would mount the retrieved disk (and again I am confused - mount the complete filesystem or the subvolume?) rsync (--inplace ? - is this necessary) again the various backup files from my server and take another snapshot. you mount the default, this way you have access to all the data on the HDD, -- inplace is necessary I am hoping that this would effectively allow me to leave the snapshot I took last time in place, as because not everything will have changed it won't have used much space - so effectively I can keep quite a long stream of backup snapshots in place offsite. yes Eventually of course the disk will start to become full, but I assume I can reclaim the space by deleting some of the old snapshots. yes, of course: btrfs subvolume delete /mnt/btrfs/snapshots-server-a/@GMT-2010.12.21-16.48.09 will reclaim the space used up by the deltas -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Backup Command
Here is my proposed cron: btrfs subvolume snapshot hex:///home /media/backups/snapshots/hex-{DATE} rsync --archive --hard-links --delete-during --delete-excluded --inplace --numeric-ids -e ssh --exclude-from=/media/backups/exclude-hex hex:///home /media/backups/hex btrfs subvolume snapshot droog:///home /media/backups/snapshots/droog-{DATE} rsync --archive --hard-links --delete-during --delete-excluded --inplace --numeric-ids -e ssh --exclude-from=/media/backups/exclude-droog droog:///home /media/backups/droog Comments? Criticisms? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backup Command
On Monday 10 of January 2011 14:25:32 Carl Cook wrote: Here is my proposed cron: btrfs subvolume snapshot hex:///home /media/backups/snapshots/hex-{DATE} rsync --archive --hard-links --delete-during --delete-excluded --inplace --numeric-ids -e ssh --exclude-from=/media/backups/exclude-hex hex:///home /media/backups/hex btrfs subvolume snapshot droog:///home /media/backups/snapshots/droog-{DATE} rsync --archive --hard-links --delete-during --delete-excluded --inplace --numeric-ids -e ssh --exclude-from=/media/backups/exclude-droog droog:///home /media/backups/droog Comments? Criticisms? This will make the dates associated with snapshots offset by how often cron is run. In other words, if you run above script daily you will have data from 2011.01.01 in the hex-2011.01.02 directory. I do save the current date, do a LVM snapshot on the source, rsync --inplace data over and do a local snapshot naming the folder using the saved date. This way the date in the name of backup directory is exact to about a second. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
Shaohua, On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote: Hi, We have file readahead to do asyn file read, but has no metadata readahead. For a list of files, their metadata is stored in fragmented disk space and metadata read is a sync operation, which impacts the efficiency of readahead much. The patches try to add meatadata readahead for btrfs. In btrfs, metadata is stored in btree_inode. Ideally, if we could hook the inode to a fd so we could use existing syscalls (readahead, mincore or upcoming fincore) to do readahead, but the inode is hidden, there is no easy way for this from my understanding. So we add two ioctls for If that is the main obstacle, why not do straightforward fincore()/ fadvise(), and add ioctls to btrfs to export/grab the hidden btree_inode in any form? This will address btrfs' specific issue, and have the benefit of making the VFS part general enough. You know ext2/3/4 already have block_dev ready for metadata readahead. Thanks, Fengguang this. One is like readahead syscall, the other is like micore/fincore syscall. Under a harddisk based netbook with Meego, the metadata readahead reduced about 3.5s boot time in average from total 16s. Last time I posted similar patches to btrfs maillist, which adds the new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we have a generic interface to do this so other filesystem can share some code, so I came up with the new one. Comments and suggestions are welcome! v1-v2: 1. Added more comments and fix return values suggested by Andrew Morton 2. fix a race condition pointed out by Yan Zheng initial post: http://marc.info/?l=linux-fsdevelm=129222493406353w=2 Thanks, Shaohua -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) Ric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. So I have counters for how many extents are deduped and the overall file savings, is that what you are talking about? Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) So I've always done a byte-by-byte compare, first in userspace but now its in kernel, because frankly I don't trust hashing algorithms with my data. It would be simple enough to keep statistics on how often the byte-by-byte compare comes out wrong, but really this is to catch changes to the file, so I have a suspicion that most of these statistics would be simply that the file changed, not that the hash was a collision. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500: On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. So I have counters for how many extents are deduped and the overall file savings, is that what you are talking about? Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) So I've always done a byte-by-byte compare, first in userspace but now its in kernel, because frankly I don't trust hashing algorithms with my data. It would be simple enough to keep statistics on how often the byte-by-byte compare comes out wrong, but really this is to catch changes to the file, so I have a suspicion that most of these statistics would be simply that the file changed, not that the hash was a collision. Thanks, At least in the kernel, if you're comparing extents on disk that are from a committed transaction. The contents won't change. We could read into a private buffer instead of into the file's address space to make this more reliable/strict. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Offline Deduplication for Btrfs
On Mon, Jan 10, 2011 at 10:39:56AM -0500, Chris Mason wrote: Excerpts from Josef Bacik's message of 2011-01-10 10:37:31 -0500: On Mon, Jan 10, 2011 at 10:28:14AM -0500, Ric Wheeler wrote: I think that dedup has a variety of use cases that are all very dependent on your workload. The approach you have here seems to be a quite reasonable one. I did not see it in the code, but it is great to be able to collect statistics on how effective your hash is and any counters for the extra IO imposed. So I have counters for how many extents are deduped and the overall file savings, is that what you are talking about? Also very useful to have a paranoid mode where when you see a hash collision (dedup candidate), you fall back to a byte-by-byte compare to verify that the the collision is correct. Keeping stats on how often this is a false collision would be quite interesting as well :) So I've always done a byte-by-byte compare, first in userspace but now its in kernel, because frankly I don't trust hashing algorithms with my data. It would be simple enough to keep statistics on how often the byte-by-byte compare comes out wrong, but really this is to catch changes to the file, so I have a suspicion that most of these statistics would be simply that the file changed, not that the hash was a collision. Thanks, At least in the kernel, if you're comparing extents on disk that are from a committed transaction. The contents won't change. We could read into a private buffer instead of into the file's address space to make this more reliable/strict. Right sorry I was talking in the userspace case. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote: On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote: On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote: Shaohua, On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote: Hi, We have file readahead to do asyn file read, but has no metadata readahead. For a list of files, their metadata is stored in fragmented disk space and metadata read is a sync operation, which impacts the efficiency of readahead much. The patches try to add meatadata readahead for btrfs. In btrfs, metadata is stored in btree_inode. Ideally, if we could hook the inode to a fd so we could use existing syscalls (readahead, mincore or upcoming fincore) to do readahead, but the inode is hidden, there is no easy way for this from my understanding. So we add two ioctls for If that is the main obstacle, why not do straightforward fincore()/ fadvise(), and add ioctls to btrfs to export/grab the hidden btree_inode in any form? This will address btrfs' specific issue, and have the benefit of making the VFS part general enough. You know ext2/3/4 already have block_dev ready for metadata readahead. I forgot to update this comment. Please see patch 2 and patch 4, both incore and readahead need btrfs specific staff involved, so we can't use generic fincore or something. You can if you like :) - fincore() can return the referenced bit, which is generally useful information metadata page in ext2/3 doesn't have reference bit set, while btrfs has. we can't blindly filter out such pages with the bit. fincore can takes a parameter or it returns a bit to distinguish referenced pages, but I don't think it's a good API. This should be transparent to userspace. - btrfs_metadata_readahead() can be passed to some (faked) -readpages() for use with fadvise. this need filesystem specific hook too, the difference is your proposal uses fadvise but I'm using ioctl. There isn't big difference. BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I didn't find a easy way to do this. It might be possible to do this for example adding a fake device or fake fs (anon_inode doesn't work here, IIRC), which is a bit ugly. Before it's proved generic API can handle metadata readahead, I don't want to do it. Thanks, Shaohua this. One is like readahead syscall, the other is like micore/fincore syscall. Under a harddisk based netbook with Meego, the metadata readahead reduced about 3.5s boot time in average from total 16s. Last time I posted similar patches to btrfs maillist, which adds the new ioctls in btrfs specific ioctl code. But Christoph Hellwig asks we have a generic interface to do this so other filesystem can share some code, so I came up with the new one. Comments and suggestions are welcome! v1-v2: 1. Added more comments and fix return values suggested by Andrew Morton 2. fix a race condition pointed out by Yan Zheng initial post: http://marc.info/?l=linux-fsdevelm=129222493406353w=2 Thanks, Shaohua -- To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote: On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote: On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote: On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote: Shaohua, On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote: Hi, We have file readahead to do asyn file read, but has no metadata readahead. For a list of files, their metadata is stored in fragmented disk space and metadata read is a sync operation, which impacts the efficiency of readahead much. The patches try to add meatadata readahead for btrfs. In btrfs, metadata is stored in btree_inode. Ideally, if we could hook the inode to a fd so we could use existing syscalls (readahead, mincore or upcoming fincore) to do readahead, but the inode is hidden, there is no easy way for this from my understanding. So we add two ioctls for If that is the main obstacle, why not do straightforward fincore()/ fadvise(), and add ioctls to btrfs to export/grab the hidden btree_inode in any form? This will address btrfs' specific issue, and have the benefit of making the VFS part general enough. You know ext2/3/4 already have block_dev ready for metadata readahead. I forgot to update this comment. Please see patch 2 and patch 4, both incore and readahead need btrfs specific staff involved, so we can't use generic fincore or something. You can if you like :) - fincore() can return the referenced bit, which is generally useful information metadata page in ext2/3 doesn't have reference bit set, while btrfs has. we can't blindly filter out such pages with the bit. block_dev inodes have the accessed bits. Look at the below output. /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the dump_page_cache lines stand for Active/Referenced. r...@bay /home/wfg# echo /dev/sda5 /debug/tracing/objects/mm/pages/dump-file r...@bay /home/wfg# cat /debug/tracing/trace # tracer: nop # # TASK-PIDCPU#TIMESTAMP FUNCTION # | | | | | zsh-2950 [003] 879.500764: dump_inode_cache:0 55643986944 170393621879 D___ BLKmount /dev/sda5 zsh-2950 [003] 879.500774: dump_page_cache:0 2 ___AR_P20 zsh-2950 [003] 879.500776: dump_page_cache:2 3 R_P20 zsh-2950 [003] 879.500777: dump_page_cache: 1026 5 ___AR_P20 zsh-2950 [003] 879.500778: dump_page_cache: 1031 3 ___A__P20 zsh-2950 [003] 879.500779: dump_page_cache: 1034 1 ___AR_P20 zsh-2950 [003] 879.500780: dump_page_cache: 1035 2 ___A__P20 zsh-2950 [003] 879.500781: dump_page_cache: 1037 1 ___AR_P20 zsh-2950 [003] 879.500782: dump_page_cache: 1038 3 R_P20 zsh-2950 [003] 879.500782: dump_page_cache: 1041 1 ___A__P20 zsh-2950 [003] 879.500783: dump_page_cache: 1057 1 ___AR_D___P20 zsh-2950 [003] 879.500788: dump_page_cache: 1058 6 ___A__P20 zsh-2950 [003] 879.500788: dump_page_cache: 9249 1 ___AR_P20 zsh-2950 [003] 879.500789: dump_page_cache: 524289 1 R_P20 zsh-2950 [003] 879.500790: dump_page_cache: 524290 2 ___A__P20 zsh-2950 [003] 879.500790: dump_page_cache: 524292 1 ___AR_P20 zsh-2950 [003] 879.500791: dump_page_cache: 524293 1 ___A__P20 zsh-2950 [003] 879.500796: dump_page_cache: 524294 9 R_P20 zsh-2950 [003] 879.500797: dump_page_cache: 524303 1 ___A__P20 zsh-2950 [003] 879.500798: dump_page_cache: 987136 1 ___AR_P20 zsh-2950 [003] 879.500798: dump_page_cache: 1048576 1 R_P20 zsh-2950 [003] 879.500799: dump_page_cache: 1048577 2 ___A__P20 zsh-2950 [003] 879.500800: dump_page_cache: 1048579 1 ___AR_P20 zsh-2950 [003] 879.500801: dump_page_cache: 1048580 5 ___A__P20 zsh-2950 [003] 879.500802: dump_page_cache: 1048585 1 ___AR_P20 zsh-2950 [003] 879.500805: dump_page_cache: 1048586 5 ___A__P20 zsh-2950 [003] 879.500805: dump_page_cache: 1048591 1 ___AR_P
Re: [PATCH v2 0/5] add new ioctls to do metadata readahead in btrfs
On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote: On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote: On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote: On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote: On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote: Shaohua, On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote: Hi, We have file readahead to do asyn file read, but has no metadata readahead. For a list of files, their metadata is stored in fragmented disk space and metadata read is a sync operation, which impacts the efficiency of readahead much. The patches try to add meatadata readahead for btrfs. In btrfs, metadata is stored in btree_inode. Ideally, if we could hook the inode to a fd so we could use existing syscalls (readahead, mincore or upcoming fincore) to do readahead, but the inode is hidden, there is no easy way for this from my understanding. So we add two ioctls for If that is the main obstacle, why not do straightforward fincore()/ fadvise(), and add ioctls to btrfs to export/grab the hidden btree_inode in any form? This will address btrfs' specific issue, and have the benefit of making the VFS part general enough. You know ext2/3/4 already have block_dev ready for metadata readahead. I forgot to update this comment. Please see patch 2 and patch 4, both incore and readahead need btrfs specific staff involved, so we can't use generic fincore or something. You can if you like :) - fincore() can return the referenced bit, which is generally useful information metadata page in ext2/3 doesn't have reference bit set, while btrfs has. we can't blindly filter out such pages with the bit. block_dev inodes have the accessed bits. Look at the below output. /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the dump_page_cache lines stand for Active/Referenced. ext4 already does readahead? please check other filesystems. filesystem sues bread like API to read metadata, which definitely doesn't set referenced bit. r...@bay /home/wfg# echo /dev/sda5 /debug/tracing/objects/mm/pages/dump-file r...@bay /home/wfg# cat /debug/tracing/trace # tracer: nop # # TASK-PIDCPU#TIMESTAMP FUNCTION # | | | | | zsh-2950 [003] 879.500764: dump_inode_cache:0 55643986944 170393621879 D___ BLKmount /dev/sda5 zsh-2950 [003] 879.500774: dump_page_cache:0 2 ___AR_P20 zsh-2950 [003] 879.500776: dump_page_cache:2 3 R_P20 zsh-2950 [003] 879.500777: dump_page_cache: 1026 5 ___AR_P20 zsh-2950 [003] 879.500778: dump_page_cache: 1031 3 ___A__P20 zsh-2950 [003] 879.500779: dump_page_cache: 1034 1 ___AR_P20 zsh-2950 [003] 879.500780: dump_page_cache: 1035 2 ___A__P20 zsh-2950 [003] 879.500781: dump_page_cache: 1037 1 ___AR_P20 zsh-2950 [003] 879.500782: dump_page_cache: 1038 3 R_P20 zsh-2950 [003] 879.500782: dump_page_cache: 1041 1 ___A__P20 zsh-2950 [003] 879.500783: dump_page_cache: 1057 1 ___AR_D___P20 zsh-2950 [003] 879.500788: dump_page_cache: 1058 6 ___A__P20 zsh-2950 [003] 879.500788: dump_page_cache: 9249 1 ___AR_P20 zsh-2950 [003] 879.500789: dump_page_cache: 524289 1 R_P20 zsh-2950 [003] 879.500790: dump_page_cache: 524290 2 ___A__P20 zsh-2950 [003] 879.500790: dump_page_cache: 524292 1 ___AR_P20 zsh-2950 [003] 879.500791: dump_page_cache: 524293 1 ___A__P20 zsh-2950 [003] 879.500796: dump_page_cache: 524294 9 R_P20 zsh-2950 [003] 879.500797: dump_page_cache: 524303 1 ___A__P20 zsh-2950 [003] 879.500798: dump_page_cache: 987136 1 ___AR_P20 zsh-2950 [003] 879.500798: dump_page_cache: 1048576 1 R_P20 zsh-2950 [003] 879.500799: dump_page_cache: 1048577 2 ___A__P20 zsh-2950 [003] 879.500800: dump_page_cache: 1048579 1 ___AR_P20 zsh-2950 [003] 879.500801: dump_page_cache: 1048580 5 ___A__P