Re: [PATCH] xfstests 276: fix error 'FIBMAP: Invalid argument'
HI, This issue still exists when xfstests is run on debian although xfstests has contained this patch. root@debian-i386:/home/zwu/xfstests# git describe linux-v3.8-122-gdc75401 root@debian-i386:/home/zwu/xfstests# uname -a Linux debian-i386 3.10.0-rc2+ #3 SMP Sun May 26 14:04:13 CST 2013 x86_64 GNU/Linux root@debian-i386:/home/zwu/xfstests# - output mismatch (see /home/zwu/xfstests/results/btrfs/276.out.bad) --- tests/btrfs/276.out2013-05-25 15:09:01.0 + +++ /home/zwu/xfstests/results/btrfs/276.out.bad2013-05-26 09:31:49.962876905 + @@ -1,3 +1,32393 @@ QA output created by 276 *** test backref walking +FIBMAP: Invalid argument +FIBMAP: Invalid argument +FIBMAP: Invalid argument +FIBMAP: Invalid argument +FIBMAP: Invalid argument ... (Run 'diff -u tests/btrfs/276.out /home/zwu/xfstests/results/btrfs/276.out.bad' to see the entire diff) [ 3877.215268] device fsid 8367602a-14fa-49eb-aff0-44da1faa3b45 devid 1 transid 239 /dev/vdb [ 3877.218417] btrfs: disk space caching is enabled Ran: btrfs/276 Failures: btrfs/276 Failed 1 of 1 tests On Wed, Mar 6, 2013 at 2:43 AM, Rich Johnston rjohns...@sgi.com wrote: This patch has been committed. --Rich commit eba3a5206cd7f8df73d6e6dbf2bb0afca677fca8 Author: Wang Sheng-Hui shh...@gmail.com Date: Thu Feb 28 02:05:56 2013 + xfstests 276: fix error 'FIBMAP: Invalid argument' Commit 05dadc0 (Btrfs: add fiemap's flag check) added validity checks to the fiemap flags and hence invalid flags are now being rejected. Test 276 passes an invalid fiemap flag to btrfs, and so that test fails since this commit was added. Btrfs doesn't support FIEMAP_FLAG_XATTR, which is enabled by -x option of filefrag, and will fail with 'FIBMAP: Invalid argument' for 'filefrag -vx'. 'filefrag -vx' fails on btrfs with 'FIEMAP failed with unsupported flags 2' Remove the '-x' option. ___ xfs mailing list x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/12] VFS hot tracking
HI, Al Viro. I have incorporated all comments from all reviewers and waited for so long time. If you have no comments, can you merge the patchset? thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/12] VFS hot tracking
Ping... On Tue, May 14, 2013 at 8:59 AM, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The patchset is trying to introduce hot tracking function in VFS layer, which will keep track of real disk I/O in memory. By it, you will easily know more details about disk I/O, and then detect where disk I/O hot spots are. Also, specific FS can take use of it to do accurate defragment, and hot relocation support, etc. After V1 was sent out, Chandra Seetharaman has reviewed and made a lot of comments, thanks a lot to him. Not it's time to send out its V2 for external review, any comments or ideas are appreciated, thanks. NOTE: The patchset can be obtained via my kernel dev git on github: git://github.com/wuzhy/kernel.git hot_tracking If you're interested, you can also review them via https://github.com/wuzhy/kernel/commits/hot_tracking For how to use and more other info and performance report, please check hot_tracking.txt in Documentation and following links: 1.) http://lwn.net/Articles/525651/ 2.) https://lkml.org/lkml/2012/12/20/199 Changelog from v1: - Refactored to be under RCU [Chandra Seetharaman] - Merged some code changes [Chandra Seetharaman] - Fixed some issues [Chandra Seetharaman] v1: - Solved 64 bits inode number issue. [David Sterba] - Embed struct hot_type in struct file_system_type [Darrick J. Wong] - Cleanup Some issues [David Sterba] - Use a static hot debugfs root [Greg KH] rfcv4: - Introduce hot func registering framework [Zhiyong] - Remove global variable for hot tracking [Zhiyong] - Add btrfs hot tracking support [Zhiyong] rfcv3: 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner] 2.) Refactored workqueue support. [Dave Chinner] 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng] TIME_TO_KICK, and HEAT_UPDATE_DELAY 4.) Cleanedup a lot of other issues [Dave Chinner] rfcv2: 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] 2.) Added memory shrinker [Dave Chinner] 3.) Converted to one workqueue to update map info periodically [Dave Chinner] 4.) Cleanedup a lot of other issues [Dave Chinner] rfcv1: 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] 2.) The first three patches can probably just be flattened into one. [Marco Stornelli , Dave Chinner] Zhi Yong Wu (12): VFS hot tracking: introduce some data structures VFS hot tracking: add i/o freq tracking hooks VFS hot tracking: add one workqueue to update hot map VFS hot tracking: register one shrinker VFS hot tracking, rcu: introduce one rcu macro for list VFS hot tracking, seq_file: introduce one set of rcu seq_list interfaces VFS hot tracking: add debugfs support VFS hot tracking: add one ioctl interface VFS hot tracking, procfs: add two proc interfaces VFS hot tracking, btrfs: add hot tracking support VFS hot tracking: add documentation VFS hot tracking: add fs hot type support Documentation/filesystems/00-INDEX |2 + Documentation/filesystems/hot_tracking.txt | 256 ++ fs/Makefile|2 +- fs/btrfs/ctree.h |1 + fs/btrfs/super.c | 22 +- fs/compat_ioctl.c |5 + fs/dcache.c|2 + fs/direct-io.c |5 + fs/hot_tracking.c | 1320 fs/hot_tracking.h | 67 ++ fs/ioctl.c | 70 ++ fs/namei.c |2 + fs/seq_file.c | 37 + include/linux/fs.h |5 + include/linux/hot_tracking.h | 175 include/linux/rculist.h|5 + include/linux/seq_file.h |7 + kernel/sysctl.c| 14 + mm/filemap.c |6 + mm/page-writeback.c| 12 + mm/readahead.c |6 + 21 files changed, 2019 insertions(+), 2 deletions(-) create mode 100644 Documentation/filesystems/hot_tracking.txt create mode 100644 fs/hot_tracking.c create mode 100644 fs/hot_tracking.h create mode 100644 include/linux/hot_tracking.h -- 1.7.11.7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/5] BTRFS hot relocation support
On Sun, May 19, 2013 at 6:41 PM, Martin Steigerwald mar...@lichtvoll.de wrote: Am Donnerstag, 9. Mai 2013, 07:13:56 schrieb Zhi Yong Wu: HI, all Hi! I saw that bcache will be merged into kernel upstream soon, so i want to know if btrfs hot relocation support is still meanful, if no, i will not continue to work on it. can anyone let me know this? thanks. I really look forward to VFS hot data tracking with BTRFS (and other filesystem) support. I also really want to see this will happen. ZFS and BTRFS have shown that RAID support within the filesystem can make a lot of sense. I think hot relocation is another area which can be done more accurate within the filesystem. I think there are several features only possible by going this route: 1) Mark files as hot via ioctl. MySQL can mark InnoDB journal files for example. It has been implemented to get its hot status of one specified file via ioctl. It is not hard to enable its set function. Before it is done, i more hope that the core of VFS hot tracking can get merged by kernel upstream. You know, it is harder if too new functions have been done. We can put it in to-do list at first, what do you think? 2) Proper BTRFS RAID support (as noted elsewhere in this thread) 3) Easier setup. With BTRFS flexibility I would expect that a SSD as hot data cache can be added and removed on the fly during filesystem is mounted. Only seems supported at mkfs-time as I read the patch docs, but from my basic technical understanding of BTRFS it can be extented to be done on the fly with a mounted FS as well. Yes, it is supported only at mkfs time. It shouldn't be hard to enable adding or removing a nonrotating disk as hot cache on the fly. You know, before this is done, i would like to make sure that the current design for BTRFS hot relocation is this RFC patchset is going in the direction and is appropriate. If it is going in wrong direction, we will be doing a lot of meanless work. E.g. How about the current idea on introducing one new block group for nonrotating disks. When working on this work, i were trying to change the least current btrfs code as far as possible. You know, if we change too many current btrfs code, it will introduce regression bug more easily, and so the patchset will be also harder to get accepted by btrfs upstream. What do you think? Block based caching can make sense for other use cases like raw device with Oracle DB on it or maybe a swap device. Or for filesystems not (yet) supporting VFS hot data tracking. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/5] BTRFS hot relocation support
On Sun, May 19, 2013 at 6:41 PM, Martin Steigerwald mar...@lichtvoll.de wrote: Am Donnerstag, 9. Mai 2013, 07:13:56 schrieb Zhi Yong Wu: HI, all Hi! I saw that bcache will be merged into kernel upstream soon, so i want to know if btrfs hot relocation support is still meanful, if no, i will not continue to work on it. can anyone let me know this? thanks. I really look forward to VFS hot data tracking with BTRFS (and other filesystem) support. ZFS and BTRFS have shown that RAID support within the filesystem can make a lot of sense. I think hot relocation is another area which can be done more accurate within the filesystem. I think there are several features only possible by going this route: 1) Mark files as hot via ioctl. MySQL can mark InnoDB journal files for example. 2) Proper BTRFS RAID support (as noted elsewhere in this thread) After the current design draft for BTRFS hot relocation in this RFC get basic agreement as btrfs community, maintainer or main developers, i will dive into this further. By the way, it is in my to-do list, thanks. 3) Easier setup. With BTRFS flexibility I would expect that a SSD as hot data cache can be added and removed on the fly during filesystem is mounted. Only seems supported at mkfs-time as I read the patch docs, but from my basic technical understanding of BTRFS it can be extented to be done on the fly with a mounted FS as well. Block based caching can make sense for other use cases like raw device with Oracle DB on it or maybe a swap device. Or for filesystems not (yet) supporting VFS hot data tracking. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/5] BTRFS hot relocation support
HI, You should check the patchset about VFS hot tracking https://lwn.net/Articles/550495/ On Thu, May 16, 2013 at 3:12 PM, Kai Krakow hurikhan77+bt...@gmail.com wrote: Hi! I think such a solution as part of the filesystem could do much better than something outside of it (like bcache). But I'm not sure: What makes data hot? I think the most benefit is detecting random read access and mark only those data as hot, also writes should go to the SSD first and then should be spooled to the harddisks in background. Bcache does a lot regarding this. Since this is within the filesystem, users could even mark files as being always hot with some attribute or ioctl. This could be used by a boot- readahead and preload implementation to automatically make files hot used during booting or for preloading when I start an application. On the other side hot relocation should be able to reduce writes to the SSD as good as possible, for example: Do not defragment files during autodefrag, it makes no sense. Also write data in bursts of erase block size etc. And also important: What if the SSD dies due to wearing? Will it gracefully fall back to harddisk? What does relocation mean? Files (hot data) should only be cached in copy to SSD, and not moved there. It should be possible for btrfs to just drop a failing SSD from the filesystem without data loss because otherwise one should use two SSDs in raid-1 mode to get a safe cache storage. Altogether I think that a spinning media btrfs raid can outperform a single SSD so hot relocation should probably be used to reduce head movements because this is where SSD really excels. So everything that involves heavy head movement should go to SSD first, then written back to harddisk. And I think there's a lot potential to optimize because a COW filesystem like btrfs naturally has a lot of head movement. What do you think? BTW: I have not tried the one or the other yet because I'm still deciding which way to go. Your patches are more welcome because I do not need to migrate my storage to bcache-provided block devices. OTOH the bcache implementation looks a lot more mature (with regard to performance and safety) at this point because it provides many of the above mentioned features - most importantly gracefully handling failing SSDs. Regarding btrfs raid outperforms SSD: During boot my spinning media 3 device btrfs raid reads boot files with up to 600 MB/s (from LZ compressed fs), boot takes about 7 seconds until the display manager starts (which takes another 30 seconds but that's another story), and the system is pretty crowded with services I actually wouldn't need if I optimized for boot performance. But I think systemd's read-ahead implementation has a lot influence on this fast booting: It defragments and relocates boot files on btrfs during boot so the harddisks can sequentially read all this stuff. I think it also compresses boot files if compression is enabled because booting is IO bound, not CPU bound. Benchmarks showed that my btrfs raid could technically read up to 450 MB/s, so I think the 600 MB/s counts for decompressed data. A single SSD could not do that. For that same reason I created a small script to defragment and compress files used by the preload daemon. Without benchmarking it, this felt like another small performance boost. So I'm eager what could be next with some sort of SSD cache because the only problem left seems to be heavy head movement which slows down the system. Zhi Yong Wu zwu.ker...@gmail.com schrieb: HI, What do you think if its design approach goes correctly? Do you have any comments or better design idea for BTRFS hot relocation support? any comments are appreciated, thanks. On Mon, May 6, 2013 at 4:53 PM, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The patchset as RFC is sent out mainly to see if it goes in the correct development direction. The patchset is trying to introduce hot relocation support for BTRFS. In hybrid storage environment, when the data in HDD disk get hot, it can be relocated to SSD disk by BTRFS hot relocation support automatically; also, if SSD disk ratio exceed its upper threshold, the data which get cold can be looked up and relocated to HDD disk to make more space in SSD disk at first, and then the data which get hot will be relocated to SSD disk automatically. BTRFS hot relocation mainly reserve block space from SSD disk at first, load the hot data to page cache from HDD, allocate block space from SSD disk, and finally write the data to SSD disk. If you'd like to play with it, pls pull the patchset from my git on github: https://github.com/wuzhy/kernel.git hot_reloc For how to use, please refer too the example below: root@debian-i386:~# echo 0 /sys/block/vdc/queue/rotational ^^^ Above command will hack /dev/vdc to be one SSD disk root@debian-i386:~# echo 99 /proc/sys/fs/hot-age
Re: [RFC 0/5] BTRFS hot relocation support
HI, What do you think if its design approach goes correctly? Do you have any comments or better design idea for BTRFS hot relocation support? any comments are appreciated, thanks. On Mon, May 6, 2013 at 4:53 PM, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The patchset as RFC is sent out mainly to see if it goes in the correct development direction. The patchset is trying to introduce hot relocation support for BTRFS. In hybrid storage environment, when the data in HDD disk get hot, it can be relocated to SSD disk by BTRFS hot relocation support automatically; also, if SSD disk ratio exceed its upper threshold, the data which get cold can be looked up and relocated to HDD disk to make more space in SSD disk at first, and then the data which get hot will be relocated to SSD disk automatically. BTRFS hot relocation mainly reserve block space from SSD disk at first, load the hot data to page cache from HDD, allocate block space from SSD disk, and finally write the data to SSD disk. If you'd like to play with it, pls pull the patchset from my git on github: https://github.com/wuzhy/kernel.git hot_reloc For how to use, please refer too the example below: root@debian-i386:~# echo 0 /sys/block/vdc/queue/rotational ^^^ Above command will hack /dev/vdc to be one SSD disk root@debian-i386:~# echo 99 /proc/sys/fs/hot-age-interval root@debian-i386:~# echo 10 /proc/sys/fs/hot-update-interval root@debian-i386:~# echo 10 /proc/sys/fs/hot-reloc-interval root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc adding device /dev/vdc id 2 [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc fs created label (null) on /dev/vdb nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB Btrfs v0.20-rc1-254-gb0136aa-dirty root@debian-i386:~# mount -o hot_move /dev/vdb /data2 [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb [ 144.870444] btrfs: disk space caching is enabled [ 144.904214] VFS: Turning on hot data tracking root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.0G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=2.00GB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.19MB Data_SSD: total=8.00MB, used=0.00 root@debian-i386:~# echo 108 /proc/sys/fs/hot-reloc-threshold ^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109 root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.1G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=6.25MB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.26MB Data_SSD: total=2.01GB, used=2.00GB root@debian-i386:~# Zhi Yong Wu (5): vfs: add one list_head field btrfs: add one new block group btrfs: add one hot relocation kthread procfs: add three proc interfaces btrfs: add hot relocation support fs/btrfs/Makefile| 3 +- fs/btrfs/ctree.h | 26 +- fs/btrfs/extent-tree.c | 107 +- fs/btrfs/extent_io.c | 31 +- fs/btrfs/extent_io.h | 4 + fs/btrfs/file.c | 36 +- fs/btrfs/hot_relocate.c | 802 +++ fs/btrfs/hot_relocate.h | 48 +++ fs/btrfs/inode-map.c | 13 +- fs/btrfs/inode.c | 92 - fs/btrfs/ioctl.c | 23 +- fs/btrfs/relocation.c| 14 +- fs/btrfs/super.c | 30 +- fs/btrfs/volumes.c | 28 +- fs/hot_tracking.c| 1 + include/linux/btrfs.h| 4 + include/linux/hot_tracking.h | 1 + kernel/sysctl.c | 22 ++ 18 files changed, 1234 insertions(+), 51 deletions(-) create mode 100644 fs/btrfs/hot_relocate.c create mode 100644 fs/btrfs/hot_relocate.h
Re: [RFC 0/5] BTRFS hot relocation support
btrfs maintainer's opinion is very important, i guess. On Thu, May 9, 2013 at 2:30 PM, Stefan Behrens sbehr...@giantdisaster.de wrote: On 05/09/2013 01:13, Zhi Yong Wu wrote: HI, all I saw that bcache will be merged into kernel upstream soon, so i want to know if btrfs hot relocation support is still meanful, if no, i will not continue to work on it. can anyone let me know this? thanks. Which one is better? Please do some measurements. Select typical file system use cases, and publish and compare the measurement results of the two approaches. On Mon, May 6, 2013 at 4:53 PM, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The patchset as RFC is sent out mainly to see if it goes in the correct development direction. The patchset is trying to introduce hot relocation support for BTRFS. In hybrid storage environment, when the data in HDD disk get hot, it can be relocated to SSD disk by BTRFS hot relocation support automatically; also, if SSD disk ratio exceed its upper threshold, the data which get cold can be looked up and relocated to HDD disk to make more space in SSD disk at first, and then the data which get hot will be relocated to SSD disk automatically. BTRFS hot relocation mainly reserve block space from SSD disk at first, load the hot data to page cache from HDD, allocate block space from SSD disk, and finally write the data to SSD disk. If you'd like to play with it, pls pull the patchset from my git on github: https://github.com/wuzhy/kernel.git hot_reloc For how to use, please refer too the example below: root@debian-i386:~# echo 0 /sys/block/vdc/queue/rotational ^^^ Above command will hack /dev/vdc to be one SSD disk root@debian-i386:~# echo 99 /proc/sys/fs/hot-age-interval root@debian-i386:~# echo 10 /proc/sys/fs/hot-update-interval root@debian-i386:~# echo 10 /proc/sys/fs/hot-reloc-interval root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc adding device /dev/vdc id 2 [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc fs created label (null) on /dev/vdb nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB Btrfs v0.20-rc1-254-gb0136aa-dirty root@debian-i386:~# mount -o hot_move /dev/vdb /data2 [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb [ 144.870444] btrfs: disk space caching is enabled [ 144.904214] VFS: Turning on hot data tracking root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.0G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=2.00GB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.19MB Data_SSD: total=8.00MB, used=0.00 root@debian-i386:~# echo 108 /proc/sys/fs/hot-reloc-threshold ^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109 root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.1G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=6.25MB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.26MB Data_SSD: total=2.01GB, used=2.00GB root@debian-i386:~# Zhi Yong Wu (5): vfs: add one list_head field btrfs: add one new block group btrfs: add one hot relocation kthread procfs: add three proc interfaces btrfs: add hot relocation support fs/btrfs/Makefile| 3 +- fs/btrfs/ctree.h | 26 +- fs/btrfs/extent-tree.c | 107 +- fs/btrfs/extent_io.c | 31 +- fs/btrfs/extent_io.h | 4 + fs/btrfs/file.c | 36 +- fs/btrfs/hot_relocate.c | 802 +++ fs/btrfs/hot_relocate.h | 48 +++ fs/btrfs/inode-map.c | 13 +- fs/btrfs/inode.c | 92 - fs/btrfs/ioctl.c | 23 +- fs/btrfs/relocation.c
Re: [RFC 0/5] BTRFS hot relocation support
HI, all I saw that bcache will be merged into kernel upstream soon, so i want to know if btrfs hot relocation support is still meanful, if no, i will not continue to work on it. can anyone let me know this? thanks. On Mon, May 6, 2013 at 4:53 PM, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The patchset as RFC is sent out mainly to see if it goes in the correct development direction. The patchset is trying to introduce hot relocation support for BTRFS. In hybrid storage environment, when the data in HDD disk get hot, it can be relocated to SSD disk by BTRFS hot relocation support automatically; also, if SSD disk ratio exceed its upper threshold, the data which get cold can be looked up and relocated to HDD disk to make more space in SSD disk at first, and then the data which get hot will be relocated to SSD disk automatically. BTRFS hot relocation mainly reserve block space from SSD disk at first, load the hot data to page cache from HDD, allocate block space from SSD disk, and finally write the data to SSD disk. If you'd like to play with it, pls pull the patchset from my git on github: https://github.com/wuzhy/kernel.git hot_reloc For how to use, please refer too the example below: root@debian-i386:~# echo 0 /sys/block/vdc/queue/rotational ^^^ Above command will hack /dev/vdc to be one SSD disk root@debian-i386:~# echo 99 /proc/sys/fs/hot-age-interval root@debian-i386:~# echo 10 /proc/sys/fs/hot-update-interval root@debian-i386:~# echo 10 /proc/sys/fs/hot-reloc-interval root@debian-i386:~# mkfs.btrfs -d single -m single -h /dev/vdb /dev/vdc -f WARNING! - Btrfs v0.20-rc1-254-gb0136aa-dirty IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using [ 140.279011] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 1 transid 16 /dev/vdb [ 140.283650] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc [ 140.517089] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb [ 140.550759] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 3 /dev/vdb [ 140.552473] device fsid c563a6dc-f192-41a9-9fe1-5a3aa01f5e4c devid 2 transid 16 /dev/vdc adding device /dev/vdc id 2 [ 140.636215] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 2 transid 3 /dev/vdc fs created label (null) on /dev/vdb nodesize 4096 leafsize 4096 sectorsize 4096 size 14.65GB Btrfs v0.20-rc1-254-gb0136aa-dirty root@debian-i386:~# mount -o hot_move /dev/vdb /data2 [ 144.855471] device fsid 197d47a7-b9cd-46a8-9360-eb087b119424 devid 1 transid 6 /dev/vdb [ 144.870444] btrfs: disk space caching is enabled [ 144.904214] VFS: Turning on hot data tracking root@debian-i386:~# dd if=/dev/zero of=/data2/test1 bs=1M count=2048 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 23.4948 s, 91.4 MB/s root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.0G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=2.00GB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.19MB Data_SSD: total=8.00MB, used=0.00 root@debian-i386:~# echo 108 /proc/sys/fs/hot-reloc-threshold ^^^ Above command will start HOT RLEOCATE, because The data temperature is currently 109 root@debian-i386:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 16G 13G 2.2G 86% / tmpfs 4.8G 0 4.8G 0% /lib/init/rw udev 10M 176K 9.9M 2% /dev tmpfs 4.8G 0 4.8G 0% /dev/shm /dev/vdb 15G 2.1G 13G 14% /data2 root@debian-i386:~# btrfs fi df /data2 Data: total=3.01GB, used=6.25MB System: total=4.00MB, used=4.00KB Metadata: total=8.00MB, used=2.26MB Data_SSD: total=2.01GB, used=2.00GB root@debian-i386:~# Zhi Yong Wu (5): vfs: add one list_head field btrfs: add one new block group btrfs: add one hot relocation kthread procfs: add three proc interfaces btrfs: add hot relocation support fs/btrfs/Makefile| 3 +- fs/btrfs/ctree.h | 26 +- fs/btrfs/extent-tree.c | 107 +- fs/btrfs/extent_io.c | 31 +- fs/btrfs/extent_io.h | 4 + fs/btrfs/file.c | 36 +- fs/btrfs/hot_relocate.c | 802 +++ fs/btrfs/hot_relocate.h | 48 +++ fs/btrfs/inode-map.c | 13 +- fs/btrfs/inode.c | 92 - fs/btrfs/ioctl.c | 23 +- fs/btrfs/relocation.c| 14 +- fs/btrfs/super.c | 30 +- fs/btrfs/volumes.c | 28 +- fs/hot_tracking.c| 1 + include/linux/btrfs.h| 4 + include/linux/hot_tracking.h | 1 + kernel/sysctl.c | 22 ++ 18 files changed, 1234 insertions(+), 51 deletions(-) create mode 100644 fs/btrfs/hot_relocate.c create mode 100644
Re: [PATCH] btrfs-progs: update mkfs.btrfs help info for raid5/6
I found this patch had been merged into David's unstable btrfs-progs git, but not into its master git, so don't know the reason. If it need me to send v2 based on your comments, please let me know, thanks. On Sun, Apr 21, 2013 at 5:33 PM, Zhi Yong Wu zwu.ker...@gmail.com wrote: I found this patch had been merged into David's unstable btrfs-progs git, but not into its master git, so don't know the reason. If it need me to send v2 based on your comments, please let me know, thanks. On Sun, Apr 21, 2013 at 7:48 AM, Eric Sandeen sand...@redhat.com wrote: On 4/20/13 4:13 PM, Frank A. Kingswood wrote: On 04/03/13 01:37, zwu.ker...@gmail.com wrote: From: Zhi Yong Wuwu...@linux.vnet.ibm.com Since raid5/6 support was introduced, we should update mkfs.btrfs help info. Signed-off-by: Zhi Yong Wuwu...@linux.vnet.ibm.com --- mkfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mkfs.c b/mkfs.c index 5ece186..f9f26a5 100644 --- a/mkfs.c +++ b/mkfs.c @@ -326,7 +326,7 @@ static void print_usage(void) fprintf(stderr, options:\n); fprintf(stderr, \t -A --alloc-start the offset to start the FS\n); fprintf(stderr, \t -b --byte-count total number of bytes in the FS\n); -fprintf(stderr, \t -d --data data profile, raid0, raid1, raid10, dup or single\n); +fprintf(stderr, \t -d --data data profile, raid0, raid1, raid5, raid6, raid10, dup or single\n); fprintf(stderr, \t -l --leafsize size of btree leaves\n); fprintf(stderr, \t -L --label set a label\n); fprintf(stderr, \t -m --metadata metadata profile, values like data profile\n); Nitpick: that line of output is now longer than 80. Perhaps instead of \t have a few spaces at the start of each line. or maybe: +fprintf(stderr, \t -d --data data profile: [raid0|raid1|raid5|raid6|raid10|dup|single]\n); I think that fits... Frank -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] How to enable btrfs reserve block space from specific device?
On Wed, Mar 27, 2013 at 7:34 PM, Ilya Dryomov idryo...@gmail.com wrote: On Wed, Mar 27, 2013 at 09:45:59AM +0800, Zhi Yong Wu wrote: HI, When i work on btrfs hot relocation feature, i hit one question about block reservation. btrfs hot relocation need to reserve block space from specific devices such as SSD, but current btrfs reserving code doesn't differentiate if block space is reserved from SSD or HDD. In order to make btrfs support this, I thought that we can introduce one new block group or flag, but this will maybe make large impact on current existing other functions. For this, does any guy have some better idea? thanks. Hi, Sorry for the late reply. I have a lot on my plate right now, so I No matter, thanks for you reply. it is definitely very helpful to me, thanks. haven't looked at your WIP code. However, based on my knowledge, I can tell you that block reservation problem is completely orthogonal to the way you choose to handle hot data storage. A separate block group is one way to do it, and probably the easiest one. Block groups - chunks is a 1:1 mapping, so, for now, it might make sense to have one giant Yes, bg:chunks is 1:1 mapping. block group spanning over the entire hot data device. OTOH, IIRC the If so,when hot relocation is enabled, will the other btrfs functions reserve block space only from HDD? In this case, will SSD seem to be only used as one cache? original implementation from IBM just used the rotation flag to detect yes, it determine if one device is SSD by one flag in request queue of that device. hot data devices. You have a lot of choices here, but, if you are planning on implementing mirrored/striped hot data devices in future, This is maybe next item, i will put it and meta relocation support in my TODO list. I'd definitely go with block groups, since that'll give you some of the infrastructure for that for free. Thanks, Ilya -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] How to enable btrfs reserve block space from specific device?
HI, When i work on btrfs hot relocation feature, i hit one question about block reservation. btrfs hot relocation need to reserve block space from specific devices such as SSD, but current btrfs reserving code doesn't differentiate if block space is reserved from SSD or HDD. In order to make btrfs support this, I thought that we can introduce one new block group or flag, but this will maybe make large impact on current existing other functions. For this, does any guy have some better idea? thanks. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [POSSIBLE SPAM] Re: Hybrid Storage proposal
On Tue, Feb 26, 2013 at 9:08 PM, Matias Bjørling m...@itu.dk wrote: On 02/26/2013 12:04 PM, Zhi Yong Wu wrote: HI, It's a bit long so that i haven't read its whole, but i want to know if it has any collision with my ongoing feature btrfs hot relocation/migration? It will utilize the hot track patch set you're been creating for VFS. I think the methods are complementary. Do you have a description of the relocation/migration approach? sorry, no currently. i will post out them after they are completed. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hybrid Storage proposal
times. Implementation details: - This can be implemented on a per-tree basis. E.g. specific extent trees, checksum tree or other frequently updated tree. * Data updates Data are placed two places. If small, directly inside the tree leafs or if larger, within extents. If an inode is known to be hot, we cache the updates. - Write-through cache for user data If the cache isn't safe, i.e. no duplicate copies. The cache can still be used using write-through and cache subsequent reads. This is similar to a pure disk block-based cache approach. 2.5 Other - Warmup cache at mount time Reread the SSD cache on mount to enjoy a preheated cache of the bulk storage. This can be archived by storing information about the cache and reconstruct the cache tree. - (By David Sterba) Implement appropriate debugfs/sysfs hooks for monitoring the cache and get information about the size of trees. This is useful for deciding if a tree should be cached on an SSD or not. E.g. the checksum tree might always be in memory, but if it isn't, it should be cached on the SSD storage to lower checksum tree seeks on the bulk storage. 2.6 Summary The following list of items have to be addressed for the first full patchset: - Cache lookup - Cache type (write through, write back, hot tracking, etc.) - Data structure for lookup cache - Allow for prioritized storage (e.g. PCMSSDHDD) - Eviction strategy. LRU, LFU, FIFO, Temparature-based (VFS hot track) - Disk layout for cache storage Here we presented our design space for a hybrid drive solution, as well as what it would take to carry it out. We are very much open to any kind of input, feedback or new ideas you might have. Sincerely, Matias Jesper -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 02/19] vfs: initialize and free data structures
On Wed, Nov 7, 2012 at 6:24 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:44PM +0800, zwu.ker...@gmail.com wrote: +/* Frees the entire hot_range_tree. */ +static void hot_inode_item_free(struct kref *kref) +{ + struct hot_comm_item *comm_item = container_of(kref, + struct hot_comm_item, refs); + struct hot_inode_item *he = container_of(comm_item, + struct hot_inode_item, hot_inode); + + hot_range_tree_free(he); + radix_tree_delete(he-hot_inode_tree, he-i_ino); void *radix_tree_delete(struct radix_tree_root *root, unsigned long index) and he::i_ino is u64, this will not work when sizeof(unsigned long) != sizeof(u64) (iirc this is a known limitation of radix tree implementation). This will work on 64bit only, not sure if this is intentional. Fixed, thanks. + kmem_cache_free(hot_inode_item_cachep, he); +} + +/* Frees the entire hot_inode_tree. */ +static void hot_inode_tree_exit(struct hot_info *root) +{ + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; nitpick, put the declarations on separate lines + + while (1) { + spin_lock(root-lock); + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) { + spin_unlock(root-lock); + break; + } + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) + hot_inode_item_put(hi_nodes[i]); + spin_unlock(root-lock); + } +} + /* * Initialize kmem cache for hot_inode_item and hot_range_item. */ @@ -106,3 +197,36 @@ err: kmem_cache_destroy(hot_inode_item_cachep); } EXPORT_SYMBOL_GPL(hot_cache_init); + +/* + * Initialize the data structures for hot data tracking. + */ +int hot_track_init(struct super_block *sb) +{ + struct hot_info *root; + int ret = -ENOMEM; + + root = kzalloc(sizeof(struct hot_info), GFP_NOFS); + if (!root) { + printk(KERN_ERR %s: Failed to malloc memory for + hot_info\n, __func__); + return ret; minor: you can drop the variable ret and just reurn ENOMEM here + } + + sb-s_hot_root = root; + hot_inode_tree_init(root); + + printk(KERN_INFO VFS: Turning on hot data tracking\n); + + return 0; +} +EXPORT_SYMBOL_GPL(hot_track_init); david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 16/19] btrfs: add hot tracking support
On Wed, Nov 7, 2012 at 8:00 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:58PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hot_track', and add its parsing support. Its usage looks like: mount -o hot_track mount -o nouser,hot_track mount -o nouser,hot_track,loop mount -o hot_track,nouser Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com Reviewed-by: David Sterba dste...@suse.cz thanks for your review. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Wed, Nov 7, 2012 at 6:45 AM, Darrick J. Wong darrick.w...@oracle.com wrote: On Mon, Oct 29, 2012 at 12:30:45PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some util helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 179 ++ fs/hot_tracking.h|7 ++ include/linux/hot_tracking.h |2 + 3 files changed, 188 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 68591f0..0a7d9a3 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) } } +struct hot_inode_item +*hot_inode_item_find(struct hot_info *root, u64 ino) +{ + struct hot_inode_item *he; + int ret; + +again: + spin_lock(root-lock); + he = radix_tree_lookup(root-hot_inode_tree, ino); + if (he) { + kref_get(he-hot_inode.refs); + spin_unlock(root-lock); + return he; + } + spin_unlock(root-lock); + + he = kmem_cache_zalloc(hot_inode_item_cachep, + GFP_KERNEL | GFP_NOFS); + if (!he) + return ERR_PTR(-ENOMEM); + + hot_inode_item_init(he, ino, root-hot_inode_tree); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_inode_item_cachep, he); + return ERR_PTR(ret); + } + + spin_lock(root-lock); + ret = radix_tree_insert(root-hot_inode_tree, ino, he); + if (ret == -EEXIST) { + kmem_cache_free(hot_inode_item_cachep, he); + spin_unlock(root-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(root-lock); + radix_tree_preload_end(); + + kref_get(he-hot_inode.refs); + return he; +} +EXPORT_SYMBOL_GPL(hot_inode_item_find); + +static struct hot_range_item +*hot_range_item_find(struct hot_inode_item *he, + u32 start) +{ + struct hot_range_item *hr; + int ret; + +again: + spin_lock(he-lock); + hr = radix_tree_lookup(he-hot_range_tree, start); + if (hr) { + kref_get(hr-hot_range.refs); + spin_unlock(he-lock); + return hr; + } + spin_unlock(he-lock); + + hr = kmem_cache_zalloc(hot_range_item_cachep, + GFP_KERNEL | GFP_NOFS); + if (!hr) + return ERR_PTR(-ENOMEM); + + hot_range_item_init(hr, start, he); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_range_item_cachep, hr); + return ERR_PTR(ret); + } + + spin_lock(he-lock); + ret = radix_tree_insert(he-hot_range_tree, start, hr); + if (ret == -EEXIST) { + kmem_cache_free(hot_range_item_cachep, hr); + spin_unlock(he-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(he-lock); + radix_tree_preload_end(); + + kref_get(hr-hot_range.refs); + return hr; +} + +/* + * This function does the actual work of updating + * the frequency numbers, whatever they turn out to be. + */ +static u64 hot_average_update(struct timespec old_atime, + struct timespec cur_time, u64 old_avg) +{ + struct timespec delta_ts; + u64 new_avg; + u64 new_delta; + + delta_ts = timespec_sub(cur_time, old_atime); + new_delta = timespec_to_ns(delta_ts) FREQ_POWER; + + new_avg = (old_avg FREQ_POWER) - old_avg + new_delta; + new_avg = new_avg FREQ_POWER; + + return new_avg; +} + +static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write) +{ + struct timespec cur_time = current_kernel_time(); + + if (write) { + freq_data-nr_writes += 1; + freq_data-avg_delta_writes = hot_average_update( + freq_data-last_write_time, + cur_time, + freq_data-avg_delta_writes); + freq_data-last_write_time = cur_time; + } else { + freq_data-nr_reads += 1; + freq_data-avg_delta_reads = hot_average_update( + freq_data-last_read_time, + cur_time, + freq_data-avg_delta_reads); I think you could just pass in a pointer to freq_data-avg_delta_{writes,reads} here... why? + freq_data-last_read_time = cur_time; + } +} + /* * Initialize kmem cache for hot_inode_item and hot_range_item. */ @@ -199,6 +330,54 @@ err: EXPORT_SYMBOL_GPL(hot_cache_init); /* + * Main function to update access frequency from read/writepage
Re: [RFC v4+ hot_track 10/19] vfs: introduce hot func register framework
On Wed, Nov 7, 2012 at 7:30 AM, Darrick J. Wong darrick.w...@oracle.com wrote: On Mon, Oct 29, 2012 at 12:30:52PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one framwork to enable that specific FS can register its own hot tracking functions. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 78 ++ include/linux/hot_tracking.h | 25 + 2 files changed, 96 insertions(+), 7 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 0ef9cad..c6c6138 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -24,6 +24,9 @@ #include linux/limits.h #include hot_tracking.h +static DEFINE_SPINLOCK(hot_func_list_lock); +static LIST_HEAD(hot_func_list); + /* kmem_cache pointers for slab caches */ static struct kmem_cache *hot_inode_item_cachep __read_mostly; static struct kmem_cache *hot_range_item_cachep __read_mostly; @@ -305,20 +308,23 @@ static u64 hot_average_update(struct timespec old_atime, return new_avg; } -static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write) +static void hot_freq_data_update(struct hot_info *root, + struct hot_freq_data *freq_data, bool write) { struct timespec cur_time = current_kernel_time(); if (write) { freq_data-nr_writes += 1; - freq_data-avg_delta_writes = hot_average_update( + freq_data-avg_delta_writes = + root-hot_func_type-ops.hot_rw_freq_calc_fn( freq_data-last_write_time, cur_time, freq_data-avg_delta_writes); freq_data-last_write_time = cur_time; } else { freq_data-nr_reads += 1; - freq_data-avg_delta_reads = hot_average_update( + freq_data-avg_delta_reads = + root-hot_func_type-ops.hot_rw_freq_calc_fn( freq_data-last_read_time, cur_time, freq_data-avg_delta_reads); @@ -430,7 +436,7 @@ static void hot_map_array_update(struct hot_freq_data *freq_data, struct hot_comm_item *comm_item; struct hot_inode_item *he; struct hot_range_item *hr; - u32 temp = hot_temp_calc(freq_data); + u32 temp = root-hot_func_type-ops.hot_temp_calc_fn(freq_data); u8 a_temp = temp (32 - HEAT_MAP_BITS); u8 b_temp = freq_data-last_temp (32 - HEAT_MAP_BITS); @@ -511,7 +517,7 @@ static void hot_range_update(struct hot_inode_item *he, hr_nodes[i]-hot_range.hot_freq_data, root); spin_lock(hr_nodes[i]-hot_range.lock); - obsolete = hot_is_obsolete( + obsolete = root-hot_func_type-ops.hot_is_obsolete_fn( hr_nodes[i]-hot_range.hot_freq_data); spin_unlock(hr_nodes[i]-hot_range.lock); @@ -668,7 +674,7 @@ void hot_update_freqs(struct inode *inode, u64 start, } spin_lock(he-hot_inode.lock); - hot_freq_data_update(he-hot_inode.hot_freq_data, rw); + hot_freq_data_update(root, he-hot_inode.hot_freq_data, rw); spin_unlock(he-hot_inode.lock); /* @@ -685,7 +691,7 @@ void hot_update_freqs(struct inode *inode, u64 start, } spin_lock(hr-hot_range.lock); - hot_freq_data_update(hr-hot_range.hot_freq_data, rw); + hot_freq_data_update(root, hr-hot_range.hot_freq_data, rw); spin_unlock(hr-hot_range.lock); hot_range_item_put(hr); @@ -695,6 +701,61 @@ void hot_update_freqs(struct inode *inode, u64 start, } EXPORT_SYMBOL_GPL(hot_update_freqs); +static struct hot_func_type hot_func_def = { + .hot_func_name = hot_type_def, + .ops = { + .hot_rw_freq_calc_fn = hot_average_update, + .hot_temp_calc_fn= hot_temp_calc, + .hot_is_obsolete_fn = hot_is_obsolete, + }, +}; If these hot_ops are per-filesystem, why not just embed a struct hot_func_ops inside of struct file_system_type? That eliminates this _get function, this _get function is very small, only some loc, if hot_func_ops is embedded in struct file_system_type, i am afraid to introduce some regressions collision avoidance, etc. You can fill in NULL function pointers in fill in NULL func pointer? why? hot_track_init (or just code around them). --D + +static struct hot_func_type *hot_func_get(const char *name) +{ + struct hot_func_type *f, *h = hot_func_def; + + spin_lock(hot_func_list_lock); + list_for_each_entry(f, hot_func_list, list) { + if (!strcmp(f-hot_func_name, name)) + h = f; + } + spin_unlock(hot_func_list_lock
Re: [PATCH] xfs: add hot tracking support.
HI, Dave, I guess that you should add some hot tracking stuff in some xfs_show_xxx function, right? On Tue, Oct 16, 2012 at 8:04 AM, Dave Chinner da...@fromorbit.com wrote: From: Dave Chinner dchin...@redhat.com Connect up the VFS hot tracking support so XFS filesystems can make use of it. Signed-off-by: Dave Chinner dchin...@redhat.com --- fs/xfs/xfs_mount.h |1 + fs/xfs/xfs_super.c |9 + 2 files changed, 10 insertions(+) diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index a631ca3..d5e7277 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -215,6 +215,7 @@ typedef struct xfs_mount { #define XFS_MOUNT_WSYNC(1ULL 0) /* for nfs - all metadata ops must be synchronous except for space allocations */ +#define XFS_MOUNT_HOTTRACK (1ULL 1) /* hot inode tracking */ #define XFS_MOUNT_WAS_CLEAN(1ULL 3) #define XFS_MOUNT_FS_SHUTDOWN (1ULL 4) /* atomic stop of all filesystem operations, typically for diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 56c2537..17786ff 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -61,6 +61,7 @@ #include linux/kthread.h #include linux/freezer.h #include linux/parser.h +#include linux/hot_tracking.h static const struct super_operations xfs_super_operations; static kmem_zone_t *xfs_ioend_zone; @@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool; #define MNTOPT_NODELAYLOG nodelaylog/* Delayed logging disabled */ #define MNTOPT_DISCARDdiscard/* Discard unused blocks */ #define MNTOPT_NODISCARD nodiscard /* Do not discard unused blocks */ +#define MNTOPT_HOTTRACKhot_track /* hot inode tracking */ /* * Table driven mount option parser. @@ -371,6 +373,8 @@ xfs_parseargs( mp-m_flags |= XFS_MOUNT_DISCARD; } else if (!strcmp(this_char, MNTOPT_NODISCARD)) { mp-m_flags = ~XFS_MOUNT_DISCARD; + } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) { + mp-m_flags |= XFS_MOUNT_HOTTRACK; } else if (!strcmp(this_char, ihashsize)) { xfs_warn(mp, ihashsize no longer used, option is deprecated.); @@ -1040,6 +1044,9 @@ xfs_fs_put_super( { struct xfs_mount*mp = XFS_M(sb); + if (mp-m_flags XFS_MOUNT_HOTTRACK) + hot_track_exit(sb); + xfs_filestream_unmount(mp); xfs_unmountfs(mp); @@ -1470,6 +1477,8 @@ xfs_fs_fill_super( error = ENOMEM; goto out_unmount; } + if (mp-m_flags XFS_MOUNT_HOTTRACK) + hot_track_init(sb); return 0; -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 00/19] vfs: hot data tracking
On Mon, Oct 29, 2012 at 12:30 PM, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com NOTE: The patchset can be obtained via my kernel dev git on github: g...@github.com:wuzhy/kernel.git hot_tracking If you're interested, you can also can review them via https://github.com/wuzhy/kernel/commits/hot_tracking hi, guys, The latest code change has been pushed into my above dev git tree. If no further comments are done, i will post next version soon. For more info, please check hot_tracking.txt in Documentation TODO List: 1.) Need to do scalability or performance tests. 2.) Need one simpler but effective temp calc'ing function 3.) How to save the file temperature among the umount to be able to preserve the file tempreture after reboot Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to be very btrfs-specific. I've taken their code and attempted to make it more generic and integrate it at the VFS level. Changelog from v3: 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner] 2.) Refactored workqueue support. [Dave Chinner] 3.) Turn some Micro into be tunable [Zhiyong, Zheng Liu] TIME_TO_KICK, and HEAT_UPDATE_DELAY 4.) Introduce hot func registering framework [Zhiyong] 5.) Remove global variable for hot tracking [Zhiyong] 6.) Add xfs hot tracking support [Dave Chinner] 7.) Add ext4 hot tracking support [Zheng Liu] 8.) Cleanedup a lot of other issues [Dave Chinner] v3: 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] 2.) Added memory shrinker [Dave Chinner] v2: 1.) Converted to one workqueue to update map info periodically [Dave Chinner] 2.) Cleanedup a lot of other issues [Dave Chinner] v1: 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] 2.) Add btrfs hot tracking support [Zhiyong] 3.) The first three patches can probably just be flattened into one. [Marco Stornelli , Dave Chinner] Dave Chinner (1): xfs: add hot tracking support Zheng Liu (1): ext4: add hot tracking support Zhi Yong Wu (17): vfs: introduce private radix tree structures vfs: initialize and free data structures vfs: add I/O frequency update function vfs: add two map arrays vfs: add hooks to enable hot tracking vfs: add temp calculation function vfs: add map info update function vfs: add aging function vfs: add one work queue vfs: introduce hot func register framework vfs: register one shrinker vfs: add one ioctl interface debugfs: introduce one function vfs: add debugfs support sysfs: add two hot_track proc files btrfs: add hot tracking support vfs: add documentation Documentation/filesystems/00-INDEX |2 + Documentation/filesystems/hot_tracking.txt | 262 ++ fs/Makefile|2 +- fs/btrfs/ctree.h |1 + fs/btrfs/super.c | 22 +- fs/compat_ioctl.c |5 + fs/dcache.c|2 + fs/debugfs/inode.c | 26 + fs/direct-io.c |6 + fs/ext4/ext4.h |3 + fs/ext4/super.c| 13 +- fs/hot_tracking.c | 1367 fs/hot_tracking.h | 58 ++ fs/ioctl.c | 78 ++ fs/xfs/xfs_mount.h |1 + fs/xfs/xfs_super.c | 16 + include/linux/debugfs.h|9 + include/linux/fs.h |4 + include/linux/hot_tracking.h | 149 +++ kernel/sysctl.c| 14 + mm/filemap.c |6 + mm/page-writeback.c| 12 + mm/readahead.c |6 + 23 files changed, 2061 insertions(+), 3 deletions(-) create mode 100644 Documentation/filesystems/hot_tracking.txt create mode 100644 fs/hot_tracking.c create mode 100644 fs/hot_tracking.h create mode 100644 include/linux/hot_tracking.h -- 1.7.6.5 -- To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Thu, Nov 8, 2012 at 2:49 AM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Nov 07, 2012 at 04:27:05PM +0800, Zhi Yong Wu wrote: On Wed, Nov 7, 2012 at 6:45 AM, Darrick J. Wong darrick.w...@oracle.com wrote: On Mon, Oct 29, 2012 at 12:30:45PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some util helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 179 ++ fs/hot_tracking.h|7 ++ include/linux/hot_tracking.h |2 + 3 files changed, 188 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 68591f0..0a7d9a3 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) } } +struct hot_inode_item +*hot_inode_item_find(struct hot_info *root, u64 ino) +{ + struct hot_inode_item *he; + int ret; + +again: + spin_lock(root-lock); + he = radix_tree_lookup(root-hot_inode_tree, ino); + if (he) { + kref_get(he-hot_inode.refs); + spin_unlock(root-lock); + return he; + } + spin_unlock(root-lock); + + he = kmem_cache_zalloc(hot_inode_item_cachep, + GFP_KERNEL | GFP_NOFS); + if (!he) + return ERR_PTR(-ENOMEM); + + hot_inode_item_init(he, ino, root-hot_inode_tree); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_inode_item_cachep, he); + return ERR_PTR(ret); + } + + spin_lock(root-lock); + ret = radix_tree_insert(root-hot_inode_tree, ino, he); + if (ret == -EEXIST) { + kmem_cache_free(hot_inode_item_cachep, he); + spin_unlock(root-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(root-lock); + radix_tree_preload_end(); + + kref_get(he-hot_inode.refs); + return he; +} +EXPORT_SYMBOL_GPL(hot_inode_item_find); + +static struct hot_range_item +*hot_range_item_find(struct hot_inode_item *he, + u32 start) +{ + struct hot_range_item *hr; + int ret; + +again: + spin_lock(he-lock); + hr = radix_tree_lookup(he-hot_range_tree, start); + if (hr) { + kref_get(hr-hot_range.refs); + spin_unlock(he-lock); + return hr; + } + spin_unlock(he-lock); + + hr = kmem_cache_zalloc(hot_range_item_cachep, + GFP_KERNEL | GFP_NOFS); + if (!hr) + return ERR_PTR(-ENOMEM); + + hot_range_item_init(hr, start, he); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_range_item_cachep, hr); + return ERR_PTR(ret); + } + + spin_lock(he-lock); + ret = radix_tree_insert(he-hot_range_tree, start, hr); + if (ret == -EEXIST) { + kmem_cache_free(hot_range_item_cachep, hr); + spin_unlock(he-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(he-lock); + radix_tree_preload_end(); + + kref_get(hr-hot_range.refs); + return hr; +} + +/* + * This function does the actual work of updating + * the frequency numbers, whatever they turn out to be. + */ +static u64 hot_average_update(struct timespec old_atime, + struct timespec cur_time, u64 old_avg) +{ + struct timespec delta_ts; + u64 new_avg; + u64 new_delta; + + delta_ts = timespec_sub(cur_time, old_atime); + new_delta = timespec_to_ns(delta_ts) FREQ_POWER; + + new_avg = (old_avg FREQ_POWER) - old_avg + new_delta; + new_avg = new_avg FREQ_POWER; + + return new_avg; +} + +static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write) +{ + struct timespec cur_time = current_kernel_time(); + + if (write) { + freq_data-nr_writes += 1; + freq_data-avg_delta_writes = hot_average_update( + freq_data-last_write_time, + cur_time, + freq_data-avg_delta_writes); + freq_data-last_write_time = cur_time; + } else { + freq_data-nr_reads += 1; + freq_data-avg_delta_reads = hot_average_update( + freq_data-last_read_time, + cur_time, + freq_data-avg_delta_reads); I think you could just pass in a pointer to freq_data-avg_delta_{writes
Re: [RFC v4+ hot_track 10/19] vfs: introduce hot func register framework
On Thu, Nov 8, 2012 at 2:58 AM, Darrick J. Wong darrick.w...@oracle.com wrote: On Wed, Nov 07, 2012 at 04:34:35PM +0800, Zhi Yong Wu wrote: On Wed, Nov 7, 2012 at 7:30 AM, Darrick J. Wong darrick.w...@oracle.com wrote: On Mon, Oct 29, 2012 at 12:30:52PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one framwork to enable that specific FS can register its own hot tracking functions. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 78 ++ include/linux/hot_tracking.h | 25 + 2 files changed, 96 insertions(+), 7 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 0ef9cad..c6c6138 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -24,6 +24,9 @@ #include linux/limits.h #include hot_tracking.h +static DEFINE_SPINLOCK(hot_func_list_lock); +static LIST_HEAD(hot_func_list); + /* kmem_cache pointers for slab caches */ static struct kmem_cache *hot_inode_item_cachep __read_mostly; static struct kmem_cache *hot_range_item_cachep __read_mostly; @@ -305,20 +308,23 @@ static u64 hot_average_update(struct timespec old_atime, return new_avg; } -static void hot_freq_data_update(struct hot_freq_data *freq_data, bool write) +static void hot_freq_data_update(struct hot_info *root, + struct hot_freq_data *freq_data, bool write) { struct timespec cur_time = current_kernel_time(); if (write) { freq_data-nr_writes += 1; - freq_data-avg_delta_writes = hot_average_update( + freq_data-avg_delta_writes = + root-hot_func_type-ops.hot_rw_freq_calc_fn( freq_data-last_write_time, cur_time, freq_data-avg_delta_writes); freq_data-last_write_time = cur_time; } else { freq_data-nr_reads += 1; - freq_data-avg_delta_reads = hot_average_update( + freq_data-avg_delta_reads = + root-hot_func_type-ops.hot_rw_freq_calc_fn( freq_data-last_read_time, cur_time, freq_data-avg_delta_reads); @@ -430,7 +436,7 @@ static void hot_map_array_update(struct hot_freq_data *freq_data, struct hot_comm_item *comm_item; struct hot_inode_item *he; struct hot_range_item *hr; - u32 temp = hot_temp_calc(freq_data); + u32 temp = root-hot_func_type-ops.hot_temp_calc_fn(freq_data); u8 a_temp = temp (32 - HEAT_MAP_BITS); u8 b_temp = freq_data-last_temp (32 - HEAT_MAP_BITS); @@ -511,7 +517,7 @@ static void hot_range_update(struct hot_inode_item *he, hr_nodes[i]-hot_range.hot_freq_data, root); spin_lock(hr_nodes[i]-hot_range.lock); - obsolete = hot_is_obsolete( + obsolete = root-hot_func_type-ops.hot_is_obsolete_fn( hr_nodes[i]-hot_range.hot_freq_data); spin_unlock(hr_nodes[i]-hot_range.lock); @@ -668,7 +674,7 @@ void hot_update_freqs(struct inode *inode, u64 start, } spin_lock(he-hot_inode.lock); - hot_freq_data_update(he-hot_inode.hot_freq_data, rw); + hot_freq_data_update(root, he-hot_inode.hot_freq_data, rw); spin_unlock(he-hot_inode.lock); /* @@ -685,7 +691,7 @@ void hot_update_freqs(struct inode *inode, u64 start, } spin_lock(hr-hot_range.lock); - hot_freq_data_update(hr-hot_range.hot_freq_data, rw); + hot_freq_data_update(root, hr-hot_range.hot_freq_data, rw); spin_unlock(hr-hot_range.lock); hot_range_item_put(hr); @@ -695,6 +701,61 @@ void hot_update_freqs(struct inode *inode, u64 start, } EXPORT_SYMBOL_GPL(hot_update_freqs); +static struct hot_func_type hot_func_def = { + .hot_func_name = hot_type_def, + .ops = { + .hot_rw_freq_calc_fn = hot_average_update, + .hot_temp_calc_fn= hot_temp_calc, + .hot_is_obsolete_fn = hot_is_obsolete, + }, +}; If these hot_ops are per-filesystem, why not just embed a struct hot_func_ops inside of struct file_system_type? That eliminates this _get function, this _get function is very small, only some loc, if hot_func_ops is embedded in struct file_system_type, i am afraid to introduce some regressions What kind of regressions are you afraid of, specifically? I don't think fstype is performance-critical enough to worry about wreaking havoc in the caches due to adding three function pointers. done, thanks. collision
Re: [RFC v4+ hot_track 02/19] vfs: initialize and free data structures
On Wed, Nov 7, 2012 at 6:24 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:44PM +0800, zwu.ker...@gmail.com wrote: +/* Frees the entire hot_range_tree. */ +static void hot_inode_item_free(struct kref *kref) +{ + struct hot_comm_item *comm_item = container_of(kref, + struct hot_comm_item, refs); + struct hot_inode_item *he = container_of(comm_item, + struct hot_inode_item, hot_inode); + + hot_range_tree_free(he); + radix_tree_delete(he-hot_inode_tree, he-i_ino); void *radix_tree_delete(struct radix_tree_root *root, unsigned long index) and he::i_ino is u64, this will not work when sizeof(unsigned long) != sizeof(u64) (iirc this is a known limitation of radix tree implementation). This will work on 64bit only, not sure if this is intentional. i actually also realized this. Do you have a better way to handle this? + kmem_cache_free(hot_inode_item_cachep, he); +} + +/* Frees the entire hot_inode_tree. */ +static void hot_inode_tree_exit(struct hot_info *root) +{ + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; nitpick, put the declarations on separate lines Will it have any issue? It has passed the check of checkpatch.pl. + + while (1) { + spin_lock(root-lock); + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) { + spin_unlock(root-lock); + break; + } + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) + hot_inode_item_put(hi_nodes[i]); + spin_unlock(root-lock); + } +} + /* * Initialize kmem cache for hot_inode_item and hot_range_item. */ @@ -106,3 +197,36 @@ err: kmem_cache_destroy(hot_inode_item_cachep); } EXPORT_SYMBOL_GPL(hot_cache_init); + +/* + * Initialize the data structures for hot data tracking. + */ +int hot_track_init(struct super_block *sb) +{ + struct hot_info *root; + int ret = -ENOMEM; + + root = kzalloc(sizeof(struct hot_info), GFP_NOFS); + if (!root) { + printk(KERN_ERR %s: Failed to malloc memory for + hot_info\n, __func__); + return ret; minor: you can drop the variable ret and just reurn ENOMEM here This variable will also be used in the following patches. + } + + sb-s_hot_root = root; + hot_inode_tree_init(root); + + printk(KERN_INFO VFS: Turning on hot data tracking\n); + + return 0; +} +EXPORT_SYMBOL_GPL(hot_track_init); david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Wed, Nov 7, 2012 at 6:37 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:45PM +0800, zwu.ker...@gmail.com wrote: --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c +struct hot_inode_item +*hot_inode_item_find(struct hot_info *root, u64 ino) +{ + struct hot_inode_item *he; + int ret; + +again: + spin_lock(root-lock); + he = radix_tree_lookup(root-hot_inode_tree, ino); + if (he) { + kref_get(he-hot_inode.refs); + spin_unlock(root-lock); + return he; + } + spin_unlock(root-lock); + + he = kmem_cache_zalloc(hot_inode_item_cachep, + GFP_KERNEL | GFP_NOFS); + if (!he) + return ERR_PTR(-ENOMEM); + + hot_inode_item_init(he, ino, root-hot_inode_tree); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_inode_item_cachep, he); radix_tree_preload_end() + return ERR_PTR(ret); + } + + spin_lock(root-lock); + ret = radix_tree_insert(root-hot_inode_tree, ino, he); + if (ret == -EEXIST) { + kmem_cache_free(hot_inode_item_cachep, he); + spin_unlock(root-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(root-lock); + radix_tree_preload_end(); + + kref_get(he-hot_inode.refs); + return he; +} +EXPORT_SYMBOL_GPL(hot_inode_item_find); + +static struct hot_range_item +*hot_range_item_find(struct hot_inode_item *he, + u32 start) +{ + struct hot_range_item *hr; + int ret; + +again: + spin_lock(he-lock); + hr = radix_tree_lookup(he-hot_range_tree, start); + if (hr) { + kref_get(hr-hot_range.refs); + spin_unlock(he-lock); + return hr; + } + spin_unlock(he-lock); + + hr = kmem_cache_zalloc(hot_range_item_cachep, + GFP_KERNEL | GFP_NOFS); + if (!hr) + return ERR_PTR(-ENOMEM); + + hot_range_item_init(hr, start, he); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_range_item_cachep, hr); radix_tree_preload_end() I checked some kernel existing cases about the usage of radix_tree_preload(), it seems that when radix_tree_preload() fail, its error handling doesn't need call radix_tree_preload_end() any more. + return ERR_PTR(ret); + } + + spin_lock(he-lock); + ret = radix_tree_insert(he-hot_range_tree, start, hr); + if (ret == -EEXIST) { + kmem_cache_free(hot_range_item_cachep, hr); + spin_unlock(he-lock); + radix_tree_preload_end(); ditto. + goto again; + } + spin_unlock(he-lock); + radix_tree_preload_end(); + + kref_get(hr-hot_range.refs); + return hr; +} david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 05/19] vfs: add hooks to enable hot tracking
On Wed, Nov 7, 2012 at 6:51 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:47PM +0800, zwu.ker...@gmail.com wrote: --- a/mm/readahead.c +++ b/mm/readahead.c @@ -19,6 +19,7 @@ #include linux/pagemap.h #include linux/syscalls.h #include linux/file.h +#include linux/hot_tracking.h /* * Initialise a struct file's readahead state. Assumes that the caller has @@ -138,6 +139,11 @@ static int read_pages(struct address_space *mapping, struct file *filp, out: blk_finish_plug(plug); + /* Hot data tracking */ + hot_update_freqs(mapping-host, (u64)(list_entry(pages-prev,\ + struct page, lru)-index) PAGE_CACHE_SHIFT, + (u64)nr_pages * PAGE_CACHE_SIZE, 0); There's a stale \ at the end of the line, and I find this formatting hard to read. Does the following look acceptable? yes, great, thanks. hot_update_freqs(mapping-host, (u64)(list_entry(pages-prev, struct page, lru)-index) PAGE_CACHE_SHIFT, (u64)nr_pages * PAGE_CACHE_SIZE, 0); + return ret; } -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 10/19] vfs: introduce hot func register framework
On Wed, Nov 7, 2012 at 7:14 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:52PM +0800, zwu.ker...@gmail.com wrote: +static struct hot_func_type *hot_func_get(const char *name) +{ + struct hot_func_type *f, *h = hot_func_def; + + spin_lock(hot_func_list_lock); + list_for_each_entry(f, hot_func_list, list) { + if (!strcmp(f-hot_func_name, name)) + h = f; You probably want to break here Good catch, done, thanks. + } + spin_unlock(hot_func_list_lock); + + return h; +} + +int hot_func_register(struct hot_func_type *h) +{ + struct hot_func_type *f, *t = NULL; + + /* register, don't allow duplicate names */ + spin_lock(hot_func_list_lock); + list_for_each_entry(f, hot_func_list, list) { + if (!strcmp(f-hot_func_name, h-hot_func_name)) + t = f; if duplicate names are not allowed, then a warning may make sense to let us know that something is wrong done, thanks. + } + + if (t) { + spin_unlock(hot_func_list_lock); + return -EBUSY; + } + + list_add_tail(h-list, hot_func_list); + spin_unlock(hot_func_list_lock); + + return 0; +} +EXPORT_SYMBOL_GPL(hot_func_register); --- a/include/linux/hot_tracking.h +++ b/include/linux/hot_tracking.h @@ -73,6 +75,25 @@ struct hot_range_item { u32 len; /* length in bytes */ }; +typedef u64 (hot_rw_freq_calc_fn) (struct timespec old_atime, + struct timespec cur_time, u64 old_avg); +typedef u32 (hot_temp_calc_fn) (struct hot_freq_data *freq_data); +typedef bool (hot_is_obsolete_fn) (struct hot_freq_data *freq_data); I'm thinking, whether these typedefs are useful, similar ops structures do not introduce them, also when you pick a struct member names exactly same as the typedefs: +struct hot_func_ops { + hot_rw_freq_calc_fn *hot_rw_freq_calc_fn; + hot_temp_calc_fn *hot_temp_calc_fn; + hot_is_obsolete_fn *hot_is_obsolete_fn; +}; My suggestion is to make the types explicit in the structure. sorry, i don't get your point, can you elaborate it about how to do this? +/* identifies an hot func type */ +struct hot_func_type { + char hot_func_name[HOT_NAME_MAX]; 'name' would be sufficient IMHO done, thanks. + /* fields provided by specific FS */ + struct hot_func_ops ops; + struct list_head list; +}; david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 12/19] vfs: add one ioctl interface
On Wed, Nov 7, 2012 at 7:30 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:54PM +0800, zwu.ker...@gmail.com wrote: +static int ioctl_heat_info(struct file *file, void __user *argp) +{ + struct inode *inode = file-f_dentry-d_inode; + struct hot_heat_info *heat_info; + struct hot_inode_item *he; + int ret = 0; + + heat_info = kmalloc(sizeof(struct hot_heat_info), + GFP_KERNEL | GFP_NOFS); heat_info is small enough to fit onto the stack, so you can avoid the kmalloc, I don't think there are deep callstacks to be expected. ok, done. Nevertheless, if you want to use kmalloc here, then please check the return value and use GFP_KERNEL. thanks for your pointing out. + + if (copy_from_user((void *) heat_info, + argp, + sizeof(struct hot_heat_info)) != 0) { + ret = -EFAULT; + goto err; + } + + he = hot_inode_item_find(inode-i_sb-s_hot_root, inode-i_ino); + if (!he) { + /* we don't have any info on this file yet */ + ret = -ENODATA; + goto err; + } + + spin_lock(he-hot_inode.lock); + heat_info-avg_delta_reads = + (__u64) he-hot_inode.hot_freq_data.avg_delta_reads; + heat_info-avg_delta_writes = + (__u64) he-hot_inode.hot_freq_data.avg_delta_writes; + heat_info-last_read_time = + (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_read_time); + heat_info-last_write_time = + (__u64) timespec_to_ns(he-hot_inode.hot_freq_data.last_write_time); + heat_info-num_reads = + (__u32) he-hot_inode.hot_freq_data.nr_reads; + heat_info-num_writes = + (__u32) he-hot_inode.hot_freq_data.nr_writes; + + if (heat_info-live 0) { + /* + * got a request for live temperature, + * call hot_hash_calc_temperature to recalculate + */ + heat_info-temp = + inode-i_sb-s_hot_root-hot_func_type-ops.hot_temp_calc_fn( + he-hot_inode.hot_freq_data); + } else { + /* not live temperature, get it from the hashlist */ + heat_info-temp = he-hot_inode.hot_freq_data.last_temp; + } + spin_unlock(he-hot_inode.lock); + + hot_inode_item_put(he); + + if (copy_to_user(argp, (void *) heat_info, + sizeof(struct hot_heat_info))) { + ret = -EFAULT; + goto err; + } + +err: + kfree(heat_info); + return ret; +} david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 14/19] vfs: add debugfs support
On Wed, Nov 7, 2012 at 7:45 AM, David Sterba d...@jikos.cz wrote: On Mon, Oct 29, 2012 at 12:30:56PM +0800, zwu.ker...@gmail.com wrote: +static int hot_range_seq_show(struct seq_file *seq, void *v) +{ + struct hot_range_item *hr = v; + struct hot_inode_item *he = hr-hot_inode; + struct hot_freq_data *freq_data = hr-hot_range.hot_freq_data; + + /* Always lock hot_inode_item first */ + spin_lock(he-hot_inode.lock); + spin_lock(hr-hot_range.lock); + seq_printf(seq, inode #%llu, range start \ the # seems unnecessary to me OK, removed. + %llu (range len %u) reads %u, writes %u, + avg read time %llu, avg write time %llu, temp %u\n, compiler will complain if it sees a %llu format and not the expected type of 'unsigned long long' When built, i haven't seen any warning report about this... + he-i_ino, (unsigned long long)he-i_ino, + (u64)hr-start * RANGE_SIZE, + hr-len, + freq_data-nr_reads, + freq_data-nr_writes, + freq_data-avg_delta_reads / NSEC_PER_MSEC, + freq_data-avg_delta_writes / NSEC_PER_MSEC, + freq_data-last_temp (32 - HEAT_MAP_BITS)); + spin_unlock(hr-hot_range.lock); + spin_unlock(he-hot_inode.lock); + + return 0; +} + +static int hot_inode_seq_show(struct seq_file *seq, void *v) +{ + struct hot_inode_item *he = v; + struct hot_freq_data *freq_data = he-hot_inode.hot_freq_data; + + spin_lock(he-hot_inode.lock); + seq_printf(seq, inode #%llu, reads %u, writes %u, \ + avg read time %llu, avg write time %llu, temp %u\n, (same here) ditto. + he-i_ino, + freq_data-nr_reads, + freq_data-nr_writes, + freq_data-avg_delta_reads / NSEC_PER_MSEC, + freq_data-avg_delta_writes / NSEC_PER_MSEC, + freq_data-last_temp (32 - HEAT_MAP_BITS)); + spin_unlock(he-hot_inode.lock); + + return 0; +} +static void *hot_spot_range_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct hot_info *root = seq-private; + struct hot_range_item *hr_next, *hr = v; + struct hot_comm_item *comm_item; + struct list_head *n_list; + int i = + hr-hot_range.hot_freq_data.last_temp (32 - HEAT_MAP_BITS); now I have noticed that I've seen the ... (32 - HEAT_MAP_BITS) expression so many times that it tend to think it deserves a helper function This helper function has existed, hot_raw_shift(), i will replace this with it. + + n_list = seq_list_next(hr-hot_range.n_list, + root-heat_range_map[i].node_list, pos); + hot_range_item_put(hr); +next: + if (n_list) { + comm_item = container_of(n_list, + struct hot_comm_item, n_list); + hr_next = container_of(comm_item, + struct hot_range_item, hot_range); + kref_get(hr_next-hot_range.refs); + return hr_next; + } else if (--i = 0) { + n_list = seq_list_next(root-heat_range_map[i].node_list, + root-heat_range_map[i].node_list, pos); + goto next; + } + + return NULL; +} + +static void hot_debugfs_exit(struct super_block *sb) +{ + struct dentry *vol_dentry; + + vol_dentry = debugfs_get_dentry(sb-s_id, + sb-s_hot_root-debugfs_root, strlen(sb-s_id)); + /* remove all debugfs entries recursively from the volume root */ + if (vol_dentry) + debugfs_remove_recursive(vol_dentry); + else + BUG_ON(1); BUG() done, thanks. + + if (list_empty(sb-s_hot_root-debugfs_root-d_subdirs)) + debugfs_remove(sb-s_hot_root-debugfs_root); +} + +/* david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Mon, Nov 5, 2012 at 7:07 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some util helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 179 ++ fs/hot_tracking.h|7 ++ include/linux/hot_tracking.h |2 + 3 files changed, 188 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 68591f0..0a7d9a3 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) } } +struct hot_inode_item +*hot_inode_item_find(struct hot_info *root, u64 ino) +{ + struct hot_inode_item *he; + int ret; + +again: + spin_lock(root-lock); + he = radix_tree_lookup(root-hot_inode_tree, ino); + if (he) { + kref_get(he-hot_inode.refs); + spin_unlock(root-lock); + return he; + } + spin_unlock(root-lock); + + he = kmem_cache_zalloc(hot_inode_item_cachep, + GFP_KERNEL | GFP_NOFS); This doesn't look quite right... which of these two did you mean? I assume probably just GFP_NOFS Yes, good catch, thanks. + if (!he) + return ERR_PTR(-ENOMEM); + + hot_inode_item_init(he, ino, root-hot_inode_tree); + + ret = radix_tree_preload(GFP_NOFS ~__GFP_HIGHMEM); + if (ret) { + kmem_cache_free(hot_inode_item_cachep, he); + return ERR_PTR(ret); + } + + spin_lock(root-lock); + ret = radix_tree_insert(root-hot_inode_tree, ino, he); + if (ret == -EEXIST) { + kmem_cache_free(hot_inode_item_cachep, he); + spin_unlock(root-lock); + radix_tree_preload_end(); + goto again; + } + spin_unlock(root-lock); + radix_tree_preload_end(); + + kref_get(he-hot_inode.refs); + return he; +} +EXPORT_SYMBOL_GPL(hot_inode_item_find); + +static struct hot_range_item +*hot_range_item_find(struct hot_inode_item *he, + u32 start) +{ + struct hot_range_item *hr; + int ret; + +again: + spin_lock(he-lock); + hr = radix_tree_lookup(he-hot_range_tree, start); + if (hr) { + kref_get(hr-hot_range.refs); + spin_unlock(he-lock); + return hr; + } + spin_unlock(he-lock); + + hr = kmem_cache_zalloc(hot_range_item_cachep, + GFP_KERNEL | GFP_NOFS); Likewise, here too. ditto Steve. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a per-superblock workqueue and a delayed_work to run periodic work to update map info on each superblock. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 85 ++ fs/hot_tracking.h|3 + include/linux/hot_tracking.h |3 + 3 files changed, 91 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fff0038..0ef9cad 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -15,9 +15,12 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/kthread.h +#include linux/freezer.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h +#include linux/list_sort.h #include linux/limits.h #include hot_tracking.h @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) } } +/* Temperature compare function*/ +static int hot_temp_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct hot_comm_item *ap = + container_of(a, struct hot_comm_item, n_list); + struct hot_comm_item *bp = + container_of(b, struct hot_comm_item, n_list); + + int diff = ap-hot_freq_data.last_temp + - bp-hot_freq_data.last_temp; + if (diff 0) + return -1; + if (diff 0) + return 1; + return 0; +} + +/* + * Every sync period we update temperatures for + * each hot inode item and hot range item for aging + * purposes. + */ +static void hot_update_worker(struct work_struct *work) +{ + struct hot_info *root = container_of(to_delayed_work(work), + struct hot_info, update_work); + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; + + while (1) { + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) + break; + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) { + kref_get(hi_nodes[i]-hot_inode.refs); + hot_map_array_update( + hi_nodes[i]-hot_inode.hot_freq_data, root); + hot_range_update(hi_nodes[i], root); + hot_inode_item_put(hi_nodes[i]); + } + } + + /* Sort temperature map info */ + for (i = 0; i HEAT_MAP_SIZE; i++) { + list_sort(NULL, root-heat_inode_map[i].node_list, + hot_temp_cmp); + list_sort(NULL, root-heat_range_map[i].node_list, + hot_temp_cmp); + } + If this list can potentially have one (or more) entries per inode, then Only one hot_inode_item per inode, while maybe multiple hot_range_items per inode. filesystems with a lot of inodes (millions) may potentially exceed the max size of list which list_sort() can handle. If that happens it still works, but you'll get a warning message and it won't be as efficient. I haven't do so large scale test. If we want to find that issue, we need to do large scale performance test, before that, i want to make sure the code change is correct at first. To be honest, for that issue you pointed to, i also have such concern.But list_sort() performance looks good from the test result of the following URL: https://lkml.org/lkml/2010/1/20/485 It is something that we've run into with list_sort() and GFS2, but it only happens very rarely, Beside list_sort(), do you have any other way to share? For this concern, how does GFS2 resolve it? Steve. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
On Mon, Nov 5, 2012 at 8:07 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-11-05 at 19:55 +0800, Zhi Yong Wu wrote: On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse swhit...@redhat.com wrote: Hi, On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a per-superblock workqueue and a delayed_work to run periodic work to update map info on each superblock. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 85 ++ fs/hot_tracking.h|3 + include/linux/hot_tracking.h |3 + 3 files changed, 91 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fff0038..0ef9cad 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -15,9 +15,12 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/kthread.h +#include linux/freezer.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h +#include linux/list_sort.h #include linux/limits.h #include hot_tracking.h @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) } } +/* Temperature compare function*/ +static int hot_temp_cmp(void *priv, struct list_head *a, + struct list_head *b) +{ + struct hot_comm_item *ap = + container_of(a, struct hot_comm_item, n_list); + struct hot_comm_item *bp = + container_of(b, struct hot_comm_item, n_list); + + int diff = ap-hot_freq_data.last_temp + - bp-hot_freq_data.last_temp; + if (diff 0) + return -1; + if (diff 0) + return 1; + return 0; +} + +/* + * Every sync period we update temperatures for + * each hot inode item and hot range item for aging + * purposes. + */ +static void hot_update_worker(struct work_struct *work) +{ + struct hot_info *root = container_of(to_delayed_work(work), + struct hot_info, update_work); + struct hot_inode_item *hi_nodes[8]; + u64 ino = 0; + int i, n; + + while (1) { + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) + break; + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) { + kref_get(hi_nodes[i]-hot_inode.refs); + hot_map_array_update( + hi_nodes[i]-hot_inode.hot_freq_data, root); + hot_range_update(hi_nodes[i], root); + hot_inode_item_put(hi_nodes[i]); + } + } + + /* Sort temperature map info */ + for (i = 0; i HEAT_MAP_SIZE; i++) { + list_sort(NULL, root-heat_inode_map[i].node_list, + hot_temp_cmp); + list_sort(NULL, root-heat_range_map[i].node_list, + hot_temp_cmp); + } + If this list can potentially have one (or more) entries per inode, then Only one hot_inode_item per inode, while maybe multiple hot_range_items per inode. filesystems with a lot of inodes (millions) may potentially exceed the max size of list which list_sort() can handle. If that happens it still works, but you'll get a warning message and it won't be as efficient. I haven't do so large scale test. If we want to find that issue, we need to do large scale performance test, before that, i want to make sure the code change is correct at first. To be honest, for that issue you pointed to, i also have such concern.But list_sort() performance looks good from the test result of the following URL: https://lkml.org/lkml/2010/1/20/485 Yes, I think it is good. Also, even when it says that it's performance is poor (via the warning message) it is still much better than the alternative (of not sorting) in the GFS2 case. So currently our workaround is to ignore the warning. Due to what we using it for (sorting the data blocks for ordered writeback) we only see it very occasionally when there has been lots of data write activity with little journal activity on a node with lots of RAM. OK. It is something that we've run into with list_sort() and GFS2, but it only happens very rarely, Beside list_sort(), do you have any other way to share? For this concern, how does GFS2 resolve it? That is an ongoing investigation :-) I've pondered various options... increase temp variable space in list_sort(), not using list_sort() and insertion sorting the blocks instead, flushing the ordered write data early if the list gets too long, figuring
Re: [RFC v4+ hot_track 00/19] vfs: hot data tracking
On Mon, Oct 29, 2012 at 6:30 PM, Andi Kleen a...@firstfloor.org wrote: zwu.ker...@gmail.com writes: TODO List: 1.) Need to do scalability or performance tests. You're changing some of the most performance critical code in the kernel. This step is absolutely not optional. ah, i know, but now i need to make sure all the codes are correct at first, then do these tests. -Andi -- a...@linux.intel.com -- Speaking for myself only -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 13/19] debugfs: introduce one function
On Tue, Oct 30, 2012 at 2:11 AM, Greg KH gre...@linuxfoundation.org wrote: On Mon, Oct 29, 2012 at 12:30:55PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The debugfs function is used to get expected dentry. Huh? Why do you need this? Why haven't you added documentation for the It is used to determine if one sysfs directory has been created. OK, i will add some doc, thanks for your suggestion. function saying what it does? confused, greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 13/19] debugfs: introduce one function
On Tue, Oct 30, 2012 at 6:34 AM, Greg KH gre...@linuxfoundation.org wrote: On Tue, Oct 30, 2012 at 06:25:50AM +0800, Zhi Yong Wu wrote: On Tue, Oct 30, 2012 at 2:11 AM, Greg KH gre...@linuxfoundation.org wrote: On Mon, Oct 29, 2012 at 12:30:55PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The debugfs function is used to get expected dentry. Huh? Why do you need this? Why haven't you added documentation for the It is used to determine if one sysfs directory has been created. OK, i will add some doc, thanks for your suggestion. You didn't answer the why part here. How come you think you need ah, Let me say its scenario at first. If we do two mount ops as below: 1.) mount -o loop,hot_track image1 /data1 2.) mount -o loop,hot_track image2 /data2 The mount -o hot_track operation will automatically create one sysfs directory /sys/kernel/debug/hot_track. To prevent this dir being created again when 2.) is done, we need to know if it has existed at first. In my patch, i at first get its dentry by this new function, then determine if its d_inode field is NULL, if no, it means that this sysfs dir has existed. This is the reason that i want to add one new function. this? Can't you just save off the dentry you created somewhere so you don't need to look it up again? Because i can't find one appropriate place to save it. greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4+ hot_track 13/19] debugfs: introduce one function
On Tue, Oct 30, 2012 at 6:54 AM, Greg KH gre...@linuxfoundation.org wrote: On Tue, Oct 30, 2012 at 06:45:19AM +0800, Zhi Yong Wu wrote: On Tue, Oct 30, 2012 at 6:34 AM, Greg KH gre...@linuxfoundation.org wrote: On Tue, Oct 30, 2012 at 06:25:50AM +0800, Zhi Yong Wu wrote: On Tue, Oct 30, 2012 at 2:11 AM, Greg KH gre...@linuxfoundation.org wrote: On Mon, Oct 29, 2012 at 12:30:55PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com The debugfs function is used to get expected dentry. Huh? Why do you need this? Why haven't you added documentation for the It is used to determine if one sysfs directory has been created. OK, i will add some doc, thanks for your suggestion. You didn't answer the why part here. How come you think you need ah, Let me say its scenario at first. If we do two mount ops as below: 1.) mount -o loop,hot_track image1 /data1 2.) mount -o loop,hot_track image2 /data2 The mount -o hot_track operation will automatically create one sysfs directory /sys/kernel/debug/hot_track. To prevent this dir being created again when 2.) is done, we need to know if it has existed at first. In my patch, i at first get its dentry by this new function, then determine if its d_inode field is NULL, if no, it means that this sysfs dir has existed. This is the reason that i want to add one new function. Why not do like the rest of the kernel does and just have a: static dentry *hot_track_root; and use that as your root debugfs directory dentry: ah, i'm one newbie, don't get familar with other kernel part, but this is one good point, i will apply it, thanks. if (!hot_track_root) { /* Create root directory */ hot_track_root = debugfs_create(...); } No need to look anything up :) thanks, greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency
On Sun, Oct 28, 2012 at 3:55 PM, Zheng Liu gnehzuil@gmail.com wrote: Hi Zhiyong, On Thu, Oct 25, 2012 at 11:08:55PM +0800, zwu.ker...@gmail.com wrote: [snip] @@ -199,6 +342,54 @@ err: } /* + * Main function to update access frequency from read/writepage(s) hooks + */ +inline void hot_update_freqs(struct inode *inode, u64 start, + u64 len, int rw) This function seems too big. So we really need to inline this function? As Dave said in his comments, it will add a function call overhead even when tracking is not enabled. a static inline function will just result in no extra overhead other than the if statement Regards, Zheng +{ + struct hot_info *root = inode-i_sb-s_hot_root; + struct hot_inode_item *he; + struct hot_range_item *hr; + u32 cur, end; + + if (!root || (len == 0)) + return; + + he = hot_inode_item_find(root, inode-i_ino); + if (IS_ERR(he)) { + WARN_ON(1); + return; + } + + spin_lock(he-hot_inode.lock); + hot_freq_data_update(he-hot_inode.hot_freq_data, rw); + spin_unlock(he-hot_inode.lock); + + /* + * Align ranges on RANGE_SIZE boundary + * to prevent proliferation of range structs + */ + end = (start + len + RANGE_SIZE - 1) RANGE_BITS; + for (cur = (start RANGE_BITS); cur end; cur++) { + hr = hot_range_item_find(he, cur); + if (IS_ERR(hr)) { + WARN_ON(1); + hot_inode_item_put(he); + return; + } + + spin_lock(hr-hot_range.lock); + hot_freq_data_update(hr-hot_range.hot_freq_data, rw); + spin_unlock(hr-hot_range.lock); + + hot_range_item_put(hr); + } + + hot_inode_item_put(he); +} +EXPORT_SYMBOL_GPL(hot_update_freqs); -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency
On Mon, Oct 29, 2012 at 10:01 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Oct 28, 2012 at 09:51:48PM +0800, Zhi Yong Wu wrote: On Sun, Oct 28, 2012 at 3:55 PM, Zheng Liu gnehzuil@gmail.com wrote: Hi Zhiyong, On Thu, Oct 25, 2012 at 11:08:55PM +0800, zwu.ker...@gmail.com wrote: [snip] @@ -199,6 +342,54 @@ err: } /* + * Main function to update access frequency from read/writepage(s) hooks + */ +inline void hot_update_freqs(struct inode *inode, u64 start, + u64 len, int rw) This function seems too big. So we really need to inline this function? As Dave said in his comments, it will add a function call overhead even when tracking is not enabled. a static inline function will just result in no extra overhead other than the if statement I don't think I said that with respect to this code. I think I said it w.r.t. a define or a small wrapper that decides to call hot_update_freqs(). A static inline fucntion should only be a couple of lines of code at most. A static function, OTOH, can be inlined by the compiler if the compiler thinks that is a win. But thanks for your explaination at first. . +EXPORT_SYMBOL_GPL(hot_update_freqs); ... it's an exported function, so it can't be inline or static, so using inline is wrong whatever way you look at it. ;) ah, but i' m surprised by why the compiler find this error. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-ext4 in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com FS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a I think you mean hot_freq_data :P calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. To get the heat info for a specific file you have to know what file you want to get that info for, right? I can see the usefulness of asking for the heat data on a specific file, but how do you find the hot files in the first place? i.e. the big question the user interface needs to answer is what files are hot?. Once userspace knows what the hottest files are, it can open them and query the data via the above ioctl, but expecting userspace to iterate millions of inodes in a filesystem to find hot files is very inefficient. FWIW, if you were to return file handles to the hottest files, then Good idea. I am not very clear about how to implement it. file handles mean file_handle?? How to return them to the application? via debugfs? How many hottest files should be returned?? Top 100? the application could open and query them without even needing to know the path name to them. This woul dbe exceedingly useful for defragmentation programs, especially as that is the way xfs_fsr already operates on candidate files.(*) IOWs, sometimes the pathname is irrelevant to the operations that applications want to perform - all they care about having an efficient method of finding the inode they want and getting a file descriptor that points to the file. Given the heat map info fits right in to the sort of operations defrag and data mover tools already do, it kind of makes sense to optimise the interface towards those uses (*) i.e. finds them via bulkstat which returns handle information along with all the other inode data, then opens the file by handle to do the defrag work FS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. I can't see how this is a manageable interface. It is not persistent, so after every filesystem mount you'd have to set the flag on all your inodes again. Hence, for the moment, I'd suggest that dropping per-inode tracking control until all the core issues are sorted out Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 00/13] vfs: hot data tracking
On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com NOTE: The patchset is currently post out mainly to make sure it is going in the correct direction and hope to get some helpful comments from other guys. For more infomation, please check hot_tracking.txt in Documentation TODO List: 1) Fix OOM issues - the hot inode tracking caches grow very large and don't get trimmed under memory pressure. From slabtop, after creating roughly 24 million single byte files(*) on a machine with 8GB RAM: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 23859510 23859476 99%0.12K 795317 30 3181268K hot_range_item 23859441 23859439 99%0.16K 1037367 23 4149468K hot_inode_item 572530 572530 100%0.55K 817907327160K radix_tree_node 241706 241406 99%0.22K 14218 17 56872K xfs_ili 241206 241204 99%1.06K 804023321608K xfs_inode The inode tracking is trying to track all 24 million inodes even though they have been written only once, and there are only 240,000 inodes in the cache at this point in time. That was the last update that slabtop got, so it is indicative of the impending OOM situation that occurred. Changelog from v2: 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] 2.) Added memory shrinker [Dave Chinner] I haven't looked at the shrinker, but clearly it is not working, HI, Dave, Some guys suggest that when inode slab cache is shrinked, the hot_inode[range]_item slab is accordingly also shrinked, this will make hot tracking don't need to register its own shrinker. Do you think of it? If you don't like above idea. Do you have any good suggestion on how to remove hot_inode_item and hot_range_item? otherwise the above OOM situation would not be occurring. Cheers, Dave. (*) Tested on an empty 17TB XFS filesystem with: $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch $ sudo chmod 777 /mnt/scratch $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 . 0 21601 16679.3 12552262 0 22401 15412.4 12588587 0 23201 16367.6 14199322 0 24001 15680.4 15741205 hangs here w/ OOM -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
On Tue, Oct 16, 2012 at 8:27 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:31PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a per-superblock workqueue and a work_struct to run periodic work to update map info on each superblock. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 94 ++ fs/hot_tracking.h|3 + include/linux/hot_tracking.h |2 + 3 files changed, 99 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index a8dc599..f333c47 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -15,6 +15,8 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/kthread.h +#include linux/freezer.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h @@ -623,6 +625,88 @@ static void hot_map_array_exit(struct hot_info *root) } /* + * Update temperatures for each hot inode item and + * hot range item for aging purposes + */ +static void hot_temperature_update_work(struct work_struct *work) +{ + struct hot_update_work *hot_work = + container_of(work, struct hot_update_work, work); + struct hot_info *root = hot_work-hot_info; + struct hot_inode_item *hi_nodes[8]; + unsigned long delay = HZ * HEAT_UPDATE_DELAY; + u64 ino = 0; + int i, n; + + do { + while (1) { + spin_lock(root-lock); + n = radix_tree_gang_lookup(root-hot_inode_tree, +(void **)hi_nodes, ino, +ARRAY_SIZE(hi_nodes)); + if (!n) { + spin_unlock(root-lock); + break; + } + + ino = hi_nodes[n - 1]-i_ino + 1; + for (i = 0; i n; i++) { + kref_get(hi_nodes[i]-hot_inode.refs); + hot_map_array_update( + hi_nodes[i]-hot_inode.hot_freq_data, root); + hot_range_update(hi_nodes[i], root); + hot_inode_item_put(hi_nodes[i]); + } + spin_unlock(root-lock); This is a lot of work to do under a spin lock. Perhaps you should get a reference on all the nodes, then drop the root-lock and then update all the nodes in a separate loop. OK, done + } + + if (unlikely(freezing(current))) { + __refrigerator(true); + } else { + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop()) { + schedule_timeout(delay); + } + __set_current_state(TASK_RUNNING); + } + } while (!kthread_should_stop()); I don't think you understand workqueues fully. A work queue worker function is not something that executes endlessly. It is a one-shot function that does the work once, not an endless loop that has to delay it's execution for periodic work. ah, i have done this based on your following suggestions, thanks. If you need periodic work, then you should use a struct delayed_work and queue the next work iteration to be run a later time. See, for example, xfs_syncd_worker() and xfs_syncd_queue_sync() and how that reschedules itself for periodic work. It also means you don't have to handle kthread freezing, as the WQ infrastructure takes care of that for you. ditto. This is why unmount is hanging for me - this work never completes, so flush_workqueue() will never return. got it, thanks. +} + +static int hot_wq_init(struct hot_info *root) +{ + struct hot_update_work *hot_work; + int ret = 0; + + root-update_wq = alloc_workqueue( + hot_temperature_update, WQ_MEM_RECLAIM | WQ_UNBOUND, 1); + if (!root-update_wq) { + printk(KERN_ERR %s: failed to create + temperature update workqueue\n, + __func__); + return 1; + } + + hot_work = kmalloc(sizeof(*hot_work), GFP_NOFS); + if (hot_work) { + hot_work-hot_info = root; + INIT_WORK(hot_work-work, hot_temperature_update_work); + queue_work(root-update_wq, hot_work-work); + } else { + printk(KERN_ERR %s: failed to create update work\n, + __func__); + ret = 1; + } I don't understand why you need a separate hot_work structure. just embed a struct delayed_work in the struct hot_info and use container_of() to get the struct hot_info from the work structure. As such, there's no need for a separate function just
Re: [RFC v3 00/13] vfs: hot data tracking
On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com NOTE: The patchset is currently post out mainly to make sure it is going in the correct direction and hope to get some helpful comments from other guys. For more infomation, please check hot_tracking.txt in Documentation TODO List: 1) Fix OOM issues - the hot inode tracking caches grow very large and don't get trimmed under memory pressure. From slabtop, after creating roughly 24 million single byte files(*) on a machine with 8GB RAM: OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 23859510 23859476 99%0.12K 795317 30 3181268K hot_range_item 23859441 23859439 99%0.16K 1037367 23 4149468K hot_inode_item 572530 572530 100%0.55K 817907327160K radix_tree_node 241706 241406 99%0.22K 14218 17 56872K xfs_ili 241206 241204 99%1.06K 804023321608K xfs_inode The inode tracking is trying to track all 24 million inodes even though they have been written only once, and there are only 240,000 inodes in the cache at this point in time. That was the last update that slabtop got, so it is indicative of the impending OOM situation that occurred. Changelog from v2: 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] 2.) Added memory shrinker [Dave Chinner] I haven't looked at the shrinker, but clearly it is not working, otherwise the above OOM situation would not be occurring. Cheers, Dave. (*) Tested on an empty 17TB XFS filesystem with: $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch $ sudo chmod 777 /mnt/scratch $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 . 0 21601 16679.3 12552262 0 22401 15412.4 12588587 0 23201 16367.6 14199322 0 24001 15680.4 15741205 hangs here w/ OOM In this test, i haven't see you enable hot_track function via mount, why did it meet OOM? -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
On Thu, Oct 18, 2012 at 10:25 AM, Zheng Liu gnehzuil@gmail.com wrote: On Wed, Oct 17, 2012 at 02:34:15PM +0800, Zhi Yong Wu wrote: diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h index d19e64a..7a79a6d 100644 --- a/fs/hot_tracking.h +++ b/fs/hot_tracking.h @@ -36,6 +36,9 @@ */ #define TIME_TO_KICK 400 +/* set how often to update temperatures (seconds) */ +#define HEAT_UPDATE_DELAY 400 FWIW, 400 seconds is an unusual time period. It's expected that periodic work might take place at intervals of 5 minutes, 10 minutes, etc, not 6m40s. It's much easier to predict and understand behaviour if it's at a interval of whole units like minutes, especially when looking at timestamped event traces. Hence 300s (5 minutes) makes a lot more sense as a period for updates... got it. thanks. Hi Zhi Yong, IMHO we'd better to make this value parameterized, and then the user can adjust this value dynamically. Yeah, this has been in my TODO list. But i want to make the core function can work at first. Regards, Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 00/13] vfs: hot data tracking
On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote: On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com (*) Tested on an empty 17TB XFS filesystem with: $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch $ sudo chmod 777 /mnt/scratch $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 . 0 21601 16679.3 12552262 0 22401 15412.4 12588587 0 23201 16367.6 14199322 0 24001 15680.4 15741205 hangs here w/ OOM In this test, i haven't see you enable hot_track function via mount, why did it meet OOM? I copied the wrong mount command. It was definitely enabled. OK, BTW: fs_mark is the script written by you? After xfsprogs is installed, i haven't found this command. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 00/13] vfs: hot data tracking
On Thu, Oct 18, 2012 at 1:17 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Oct 18, 2012 at 12:44:47PM +0800, Zhi Yong Wu wrote: On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote: On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com (*) Tested on an empty 17TB XFS filesystem with: $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=0 data = bsize=4096 blocks=4563402735, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=131072, version=2 = sectsz=512 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch $ sudo chmod 777 /mnt/scratch $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ /mnt/scratch/6 -d /mnt/scratch/7 . 0 21601 16679.3 12552262 0 22401 15412.4 12588587 0 23201 16367.6 14199322 0 24001 15680.4 15741205 hangs here w/ OOM In this test, i haven't see you enable hot_track function via mount, why did it meet OOM? I copied the wrong mount command. It was definitely enabled. OK, BTW: fs_mark is the script written by you? After xfsprogs is installed, i haven't found this command. # apt-get install fsmark thanks. let me try. Or get the source here: http://sourceforge.net/projects/fsmark/ Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 13/13] vfs: add documentation
On Mon, Oct 15, 2012 at 8:35 AM, Zheng Liu gnehzuil@gmail.com wrote: Hi Zhi Yong, [cut...] +3. The Design + +These include the following parts: + +* Hooks in existing vfs functions to track data access frequency + +* New rbtrees for tracking access frequency of inodes and sub-file ^^^ s/rbtrees/radix-trees +ranges (hot_rb.c) Now it seems that all codes are in the same file. HI, Zheng, Good catch, i will update them, thanks. Regards, Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 00/13] vfs: hot data tracking
On Mon, Oct 15, 2012 at 8:39 AM, Zheng Liu gnehzuil@gmail.com wrote: On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com NOTE: The patchset is currently post out mainly to make sure it is going in the correct direction and hope to get some helpful comments from other guys. For more infomation, please check hot_tracking.txt in Documentation Hi Zhi Yong, If I want to use this patch set in ext4, could I apply this patch set directly or I need to call some functions like in btrfs. Thanks. Hi ,Zheng, It is the latter. If you have any questions, please feel free to catch me. Regards, Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Mon, Oct 15, 2012 at 3:48 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com FS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. FS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. . +struct hot_heat_info { + __u64 avg_delta_reads; + __u64 avg_delta_writes; + __u64 last_read_time; + __u64 last_write_time; + __u32 num_reads; + __u32 num_writes; + __u32 temperature; + __u8 live; + char filename[PATH_MAX]; Don't put the filename in the ioctl and open the file in the kernel. Have userspace open the file directly and issue the ioctl on the fd that is returned. OK, thanks. By the way, do you think that it is necessary to provide another new ioctl interface to set the temperature value? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 12/13] vfs: add debugfs support
On Mon, Oct 15, 2012 at 3:55 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a /sys/kernel/debug/hot_track/device_name/ directory for each volume that contains two files. The first, `inode_data', contains the heat information for inodes that have been brought into the hot data map structures. The second, `range_data', contains similar information for subfile ranges. + /* create debugfs range_data file */ + debugfs_range_entry = debugfs_create_file(range_data, + S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO, + debugfs_volume_entry, + (void *) range_data, + hot_debugfs_range_fops); These should not be world readable. 0600 is probably the correct permissions for them as we do not want random users to be able to infer what files users are accessing from this information. Good catch, its mode should be S_IFREG | S_IRUSR | S_IWUSR Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 12/13] vfs: add debugfs support
On Mon, Oct 15, 2012 at 4:04 PM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add a /sys/kernel/debug/hot_track/device_name/ directory for each volume that contains two files. The first, `inode_data', contains the heat information for inodes that have been brought into the hot data map structures. The second, `range_data', contains similar information for subfile ranges. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c | 462 + fs/hot_tracking.h | 43 + 2 files changed, 505 insertions(+), 0 deletions(-) . +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int len) +{ + struct lstring *debugfs_log = data-debugfs_log; + uint new_log_alloc_size; + char *new_log; + static char err_msg[] = No more memory!\n; + + if (len = data-log_alloc_size - debugfs_log-len) { .. + } + + memcpy(debugfs_log-str + debugfs_log-len, data-log_work_buff, len); + debugfs_log-len += (unsigned long) len; + + return len; +} + +/* Returns the number of bytes written to the log. */ +static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, ...) +{ + struct lstring *debugfs_log = data-debugfs_log; + va_list args; + int len; + static char trunc_msg[] = + The next message has been truncated.\n; + + if (debugfs_log-str == NULL) + return -1; + + spin_lock(data-log_lock); + + va_start(args, fmt); + len = vsnprintf(data-log_work_buff, + sizeof(data-log_work_buff), fmt, args); + va_end(args); + + if (len = sizeof(data-log_work_buff)) { + hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg)); + } + + len = hot_debugfs_copy(data, data-log_work_buff, len); + spin_unlock(data-log_lock); + + return len; +} Aren't you just recreating seq_printf() here? i.e. can't you replace all this complexity with generic seq_file/seq_operations constructs? It seems to be a good suggestion, let me try it. thanks. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com FS_IOC_GET_HEAT_INFO: return a struct containing the various metrics collected in btrfs_freq_data structs, and also return a I think you mean hot_freq_data :P Yeah, sorry. calculated data temperature based on those metrics. Optionally, retrieve the temperature from the hot data hash list instead of recalculating it. To get the heat info for a specific file you have to know what file you want to get that info for, right? I can see the usefulness of Yes. asking for the heat data on a specific file, but how do you find the hot files in the first place? i.e. the big question the user interface needs to answer is what files are hot?. We only tell the user what the files' temperatures are, not what files are hot. Their temperatures are in the output of debugfs. Once userspace knows what the hottest files are, it can open them If the user need to know this type of info, it is easy for us to provide it. But i don't know what way the user hope to get it via. and query the data via the above ioctl, but expecting userspace to iterate millions of inodes in a filesystem to find hot files is very inefficient. FWIW, if you were to return file handles to the hottest files, then the application could open and query them without even needing to know the path name to them. This woul dbe exceedingly useful for defragmentation programs, especially as that is the way xfs_fsr already operates on candidate files.(*) ah. IOWs, sometimes the pathname is irrelevant to the operations that applications want to perform - all they care about having an efficient method of finding the inode they want and getting a file descriptor that points to the file. Given the heat map info fits right in to the sort of operations defrag and data mover tools already do, it kind of makes sense to optimise the interface towards those uses (*) i.e. finds them via bulkstat which returns handle information along with all the other inode data, then opens the file by handle to do the defrag work OK. FS_IOC_GET_HEAT_OPTS: return an integer representing the current state of hot data tracking and migration: 0 = do nothing 1 = track frequency of access FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and migration, as described above. I can't see how this is a manageable interface. It is not persistent, so after every filesystem mount you'd have to set the flag on all your inodes again. Hence, for the moment, I'd suggest that dropping per-inode tracking control until all the core issues are sorted out OK. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 02/13] vfs: introduce private radix tree structures
On Wed, Oct 10, 2012 at 11:34 PM, David Sterba d...@jikos.cz wrote: On Wed, Oct 10, 2012 at 06:07:24PM +0800, zwu.ker...@gmail.com wrote: +void hot_track_init(struct super_block *sb) +{ ... +} +void hot_track_exit(struct super_block *sb) +{ + hot_cache_exit(); +} Needs to be exported if btrfs is built as a module, otherwise does not link LDS arch/x86/boot/compressed/vmlinux.lds AS arch/x86/boot/compressed/head_64.o CC arch/x86/boot/compressed/misc.o CC arch/x86/boot/compressed/string.o CC arch/x86/boot/compressed/cmdline.o CC arch/x86/boot/compressed/early_serial_console.o OBJCOPY arch/x86/boot/compressed/vmlinux.bin ERROR: hot_track_init [fs/btrfs/btrfs.ko] undefined! ERROR: hot_track_exit [fs/btrfs/btrfs.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 make: *** Waiting for unfinished jobs Sorry for late response at first. Great, thanks. david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Thu, Oct 11, 2012 at 12:28 AM, David Sterba d...@jikos.cz wrote: Hi, On Wed, Oct 10, 2012 at 06:07:23PM +0800, zwu.ker...@gmail.com wrote: --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY (1 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 22) +#define BTRFS_MOUNT_HOT_TRACK(1 23) Please don't forget to add new options to btrfs_show_options(), otherwise we can't tell what filesystems have hot tracking enabled. Great catch, thanks. --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -303,7 +304,7 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, Please add the new option to the end. OK. Opt_check_integrity, Opt_check_integrity_including_extent_data, Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_err, @@ -342,6 +343,7 @@ static match_table_t tokens = { {Opt_no_space_cache, nospace_cache}, {Opt_recovery, recovery}, {Opt_skip_balance, skip_balance}, + {Opt_hot_track, hot_track}, (also this one) ditto. {Opt_check_integrity, check_int}, {Opt_check_integrity_including_extent_data, check_int_data}, {Opt_check_integrity_print_mask, check_int_print_mask=%d}, @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_skip_balance: btrfs_set_opt(info-mount_opt, SKIP_BALANCE); break; + case Opt_hot_track: It's a common patter in the surrounding code that a message is printed when enabling options, but the vfs prints its own, so I'm not sure if it's needed here as well. Just thinking, leave it as it is now. OK + btrfs_set_opt(info-mount_opt, HOT_TRACK); + break; #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY case Opt_check_integrity_including_extent_data: printk(KERN_INFO btrfs: enabling check integrity david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Thu, Oct 11, 2012 at 12:28 AM, David Sterba d...@jikos.cz wrote: Hi, On Wed, Oct 10, 2012 at 06:07:23PM +0800, zwu.ker...@gmail.com wrote: --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY (1 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 22) +#define BTRFS_MOUNT_HOT_TRACK(1 23) Please don't forget to add new options to btrfs_show_options(), otherwise we can't tell what filesystems have hot tracking enabled. --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -303,7 +304,7 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, Please add the new option to the end. To be honest, it can't be added to the end, if you check Opt_err's pattern value, you will find it is NULL, it will cause match_one() return 1. So if we add Opt_hot_track to the end of this array, it will be covered by match_token(), so i prefer to add it to Opt_fatal_errors. Do you think of it? Opt_check_integrity, Opt_check_integrity_including_extent_data, Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_err, @@ -342,6 +343,7 @@ static match_table_t tokens = { {Opt_no_space_cache, nospace_cache}, {Opt_recovery, recovery}, {Opt_skip_balance, skip_balance}, + {Opt_hot_track, hot_track}, (also this one) {Opt_check_integrity, check_int}, {Opt_check_integrity_including_extent_data, check_int_data}, {Opt_check_integrity_print_mask, check_int_print_mask=%d}, @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_skip_balance: btrfs_set_opt(info-mount_opt, SKIP_BALANCE); break; + case Opt_hot_track: It's a common patter in the surrounding code that a message is printed when enabling options, but the vfs prints its own, so I'm not sure if it's needed here as well. Just thinking, leave it as it is now. + btrfs_set_opt(info-mount_opt, HOT_TRACK); + break; #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY case Opt_check_integrity_including_extent_data: printk(KERN_INFO btrfs: enabling check integrity david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Thu, Oct 11, 2012 at 10:41 PM, David Sterba d...@jikos.cz wrote: On Thu, Oct 11, 2012 at 10:35:28PM +0800, Zhi Yong Wu wrote: --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -303,7 +304,7 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, Please add the new option to the end. To be honest, it can't be added to the end, if you check Opt_err's pattern value, you will find it is NULL, it will cause match_one() return 1. So if we add Opt_hot_track to the end of this array, it will be covered by match_token(), so i prefer to add it to Opt_fatal_errors. Do you think of it? Ah, sorry, I was not clear what the 'end' means here. The Opt_err is a end-of-the-list token, so yes please add hot_track between Opt_fatal_errors and Opt_err. Done, thanks. It will be appreciated if you can make comments on other patches of this series.:) david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner lczer...@redhat.com wrote: On Wed, 10 Oct 2012, zwu.ker...@gmail.com wrote: Date: Wed, 10 Oct 2012 18:07:23 +0800 From: zwu.ker...@gmail.com To: linux-fsde...@vger.kernel.org Cc: linux-e...@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-ker...@vger.kernel.org, linux...@linux.vnet.ibm.com, v...@zeniv.linux.org.uk, da...@fromorbit.com, d...@jikos.cz, ty...@mit.edu, c...@us.ibm.com, Zhi Yong Wu wu...@linux.vnet.ibm.com Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hot_track', and add its parsing support. Its usage looks like: mount -o hot_track mount -o nouser,hot_track mount -o nouser,hot_track,loop mount -o hot_track,nouser This patch should probably be at the end of the series. Can you let me know your reason? I think that it is not necessary to be at the end of the series. -Lukas Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/btrfs/ctree.h |1 + fs/btrfs/super.c |7 ++- 2 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 9821b67..094bec6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY (1 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 22) +#define BTRFS_MOUNT_HOT_TRACK(1 23) #define btrfs_clear_opt(o, opt) ((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 83d6f9f..00be9e3 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -41,6 +41,7 @@ #include linux/slab.h #include linux/cleancache.h #include linux/ratelimit.h +#include linux/hot_tracking.h #include compat.h #include delayed-inode.h #include ctree.h @@ -303,7 +304,7 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, Opt_check_integrity, Opt_check_integrity_including_extent_data, Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_err, @@ -342,6 +343,7 @@ static match_table_t tokens = { {Opt_no_space_cache, nospace_cache}, {Opt_recovery, recovery}, {Opt_skip_balance, skip_balance}, + {Opt_hot_track, hot_track}, {Opt_check_integrity, check_int}, {Opt_check_integrity_including_extent_data, check_int_data}, {Opt_check_integrity_print_mask, check_int_print_mask=%d}, @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_skip_balance: btrfs_set_opt(info-mount_opt, SKIP_BALANCE); break; + case Opt_hot_track: + btrfs_set_opt(info-mount_opt, HOT_TRACK); + break; #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY case Opt_check_integrity_including_extent_data: printk(KERN_INFO btrfs: enabling check integrity -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Wed, Oct 10, 2012 at 9:11 PM, Lukáš Czerner lczer...@redhat.com wrote: On Wed, 10 Oct 2012, Zhi Yong Wu wrote: Date: Wed, 10 Oct 2012 20:21:48 +0800 From: Zhi Yong Wu zwu.ker...@gmail.com To: Lukáš Czerner lczer...@redhat.com Cc: linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-ker...@vger.kernel.org, linux...@linux.vnet.ibm.com, v...@zeniv.linux.org.uk, da...@fromorbit.com, d...@jikos.cz, ty...@mit.edu, c...@us.ibm.com, Zhi Yong Wu wu...@linux.vnet.ibm.com Subject: Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner lczer...@redhat.com wrote: On Wed, 10 Oct 2012, zwu.ker...@gmail.com wrote: Date: Wed, 10 Oct 2012 18:07:23 +0800 From: zwu.ker...@gmail.com To: linux-fsde...@vger.kernel.org Cc: linux-e...@vger.kernel.org, linux-btrfs@vger.kernel.org, linux-ker...@vger.kernel.org, linux...@linux.vnet.ibm.com, v...@zeniv.linux.org.uk, da...@fromorbit.com, d...@jikos.cz, ty...@mit.edu, c...@us.ibm.com, Zhi Yong Wu wu...@linux.vnet.ibm.com Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hot_track', and add its parsing support. Its usage looks like: mount -o hot_track mount -o nouser,hot_track mount -o nouser,hot_track,loop mount -o hot_track,nouser This patch should probably be at the end of the series. Can you let me know your reason? I think that it is not necessary to be at the end of the series. Simply because you're adding the mount option which does not do anything yet. Moreover you change the implementation of the hot track as you go. You should enable this once it is ready to use, not the other way around. So, please move this at the end of the patch set when the feature is supposed to be ready to use. OK, done, thanks. If you have comments on other patches, it will be appreciated. Thanks! -Lukas -Lukas Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/btrfs/ctree.h |1 + fs/btrfs/super.c |7 ++- 2 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 9821b67..094bec6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_MOUNT_CHECK_INTEGRITY (1 20) #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 21) #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 22) +#define BTRFS_MOUNT_HOT_TRACK(1 23) #define btrfs_clear_opt(o, opt) ((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 83d6f9f..00be9e3 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -41,6 +41,7 @@ #include linux/slab.h #include linux/cleancache.h #include linux/ratelimit.h +#include linux/hot_tracking.h #include compat.h #include delayed-inode.h #include ctree.h @@ -303,7 +304,7 @@ enum { Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, Opt_check_integrity, Opt_check_integrity_including_extent_data, Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_err, @@ -342,6 +343,7 @@ static match_table_t tokens = { {Opt_no_space_cache, nospace_cache}, {Opt_recovery, recovery}, {Opt_skip_balance, skip_balance}, + {Opt_hot_track, hot_track}, {Opt_check_integrity, check_int}, {Opt_check_integrity_including_extent_data, check_int_data}, {Opt_check_integrity_print_mask, check_int_print_mask=%d}, @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_skip_balance: btrfs_set_opt(info-mount_opt, SKIP_BALANCE); break; + case Opt_hot_track: + btrfs_set_opt(info-mount_opt, HOT_TRACK); + break; #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY case Opt_check_integrity_including_extent_data: printk(KERN_INFO btrfs: enabling check integrity -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 05/10] vfs: introduce one hash table
On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. So, let me see if I've got the relationship straight: - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode - hot_inode_item contains access frequency data for that inode - hot_inode_item holds a heat hash node to index the access frequency data for that inode - hot_inode_item.hot_range_tree indexes hot_range_items for that inode - hot_range_item contains access frequency data for that range - hot_range_item holds a heat hash node to index the access frequency data for that range - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes Correct. How about some ascii art? :) Just looking at the hot inode item case (the range item case is the same pattern, though), we have: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ +---+ | frequency data | | V^ V | ...--hot_inode_item--... |...--hot_inode_item-- | frequency data | frequency data | ^| ^ | || | | || | +--hot_hash_node--hot_hash_node--hot_hash_node-- Great, can we put them in hot_tracking.txt in Documentation? There's no actual data stored in the hot_hash_node, just pointer back to the frequency data, a hlist_node and a pointer to the hashlist head. IOWs, I agree with Ram that this does not need to exist and just embedding a hlist_node inside the hot_inode_item is all that is needed. i.e: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ | | frequency data | +---+ | hlist_node | | V^ | V | ...--hot_inode_item--... | | ...--hot_inode_item-- | frequency data | |frequency data +--hlist_node---+ +---hlist_node---. There's no need for separate allocations, initialisations, locks and reference counting - all that is already in the hot_inode_item. The items have the same lifecycle limitations - a hot_hash_node must be torn down before the frequency data it points to is freed. Finally, there's no difference in how you move it between lists. How will you know if one hot_inode_item should be moved between lists when its freq data is changed? Indeed, calling it a hash is wrong - there's not hashing at all - it keeping an array of list where each entry corresponds to a specific temperature. It is a *heat map*, not a hash list. i.e. inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE. OK. As it is, there aren't any users of the heat maps that are generated in this patch set - it's not even exported to userspace or to debugfs, so I'm not sure how it will be used yet. How are these heat maps going to be used by filesystems, Zhi? In hot_hash_calc_temperature(), you can see that one hot_inode or hot_range's freq data will be distilled into one temperature value, then it will be inserted to the heat map based on its temperature. When the file corresponding to the inode or range got hotter or cold, its location will be changed in the heat map based on its new temperature in hot_hash_update_hash_table(). And the user will retrieve those freq data and temperature info via debugfs or ioctl interfaces. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 06/10] vfs: enable hot data tracking
On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Miscellaneous features that implement hot data tracking and generally make the hot data functions a bit more friendly. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/direct-io.c | 10 ++ include/linux/hot_tracking.h | 11 +++ mm/filemap.c |8 mm/page-writeback.c | 21 + mm/readahead.c |9 + 5 files changed, 59 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index f86c720..3773f44 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -37,6 +37,7 @@ #include linux/uio.h #include linux/atomic.h #include linux/prefetch.h +#include hot_tracking.h /* * How many user pages to map in one call to get_user_pages(). This determines @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, prefetch(bdev-bd_queue); prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES); + /* Hot data tracking */ + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host) + iov_length(iov, nr_segs) 0) { + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host, + (u64)offset, + (u64)iov_length(iov, nr_segs), + rw WRITE); + } That's a bit messy. I'd prefer a static inline function that hides all this. e.g. Do you think of moving the condition into hot_inode_udate_freqs(), not adding another new function? track_hot_inode_ranges(inode, offset, length, rw) { if (inode-i_sb-s_flags MS_HOT_TRACKING) hot_inode_freq_update(inode, offset, length, rw); } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5ad5ce2..552c861 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -35,6 +35,7 @@ #include linux/buffer_head.h /* __set_page_dirty_buffers */ #include linux/pagevec.h #include linux/timer.h +#include linux/hot_tracking.h #include trace/events/writeback.h /* @@ -1895,13 +1896,33 @@ EXPORT_SYMBOL(generic_writepages); int do_writepages(struct address_space *mapping, struct writeback_control *wbc) { int ret; + pgoff_t start = 0; + u64 prev_count = 0, count = 0; if (wbc-nr_to_write = 0) return 0; + + /* Hot data tracking */ + if (TRACK_THIS_INODE(mapping-host) + wbc-range_cyclic) { + start = mapping-writeback_index PAGE_CACHE_SHIFT; + prev_count = (u64)wbc-nr_to_write; + } Why only wbc-range_cyclic? This won't record things like synchronous writes or fsync-triggered writes, are are far more likely to be to hot ranges in a file... sorry, i don't undersand what wbc-range_cyclic means. OK, i will fix it in next version. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Fork and run one kernel kthread to calculate that temperature based on some metrics kept in custom frequency data structs, and store the info in the hash table. No new kthreads, please. Use a per-superblock workqueue and a struct delayed_work to run periodic work on each superblock. If no new kthread is created, which kthread will work on these delayed_work tasks? That will also remove all the nasty, nasty !hot_track_temperature_update_kthread checks from the code, too. Also, I'd separate the work that the workqueue does from the patch that introduces the work queue. That way there is only one new thing to comment on in the patch. Further, I'd separate the aging code from the code that updates the temperature map into it's own patch as well.. Finally, you're going to need a shrinker to control the amount of memory that is used in tracking hot regions - if we are throwing inodes out of memory due to memory pressure, we most definitely are going to need to reduce the amount of memory the tracking code is using, even if it means losing useful information (i.e. the shrinker accelerates the aging process). Great, I agree with you. Given the above, and the other comments earlier in the series, there's not a lot of point in me spending time commenting on ethe code in detail here as it will change significantly as a result of all the earlier comments OK, i will complete the code change based on all your earlier comments ASAP. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 05/10] vfs: introduce one hash table
On Thu, Sep 27, 2012 at 2:57 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Sep 27, 2012 at 02:23:16PM +0800, Zhi Yong Wu wrote: On Thu, Sep 27, 2012 at 11:43 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. So, let me see if I've got the relationship straight: - sb-s_hot_info.hot_inode_tree indexes hot_inode_items, one per inode - hot_inode_item contains access frequency data for that inode - hot_inode_item holds a heat hash node to index the access frequency data for that inode - hot_inode_item.hot_range_tree indexes hot_range_items for that inode - hot_range_item contains access frequency data for that range - hot_range_item holds a heat hash node to index the access frequency data for that range - sb-s_hot_info.heat_inode_hl indexes per-inode heat hash nodes - sb-s_hot_info.heat_range_hl indexes per-range heat hash nodes Correct. How about some ascii art? :) Just looking at the hot inode item case (the range item case is the same pattern, though), we have: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ +---+ | frequency data | | V^ V | ...--hot_inode_item--... |...--hot_inode_item-- | frequency data | frequency data | ^| ^ | || | | || | +--hot_hash_node--hot_hash_node--hot_hash_node-- Great, can we put them in hot_tracking.txt in Documentation? There's no actual data stored in the hot_hash_node, just pointer back to the frequency data, a hlist_node and a pointer to the hashlist head. IOWs, I agree with Ram that this does not need to exist and just embedding a hlist_node inside the hot_inode_item is all that is needed. i.e: heat_inode_hl hot_inode_tree | | | V | +---hot_inode_item---+ | | frequency data | +---+ | hlist_node | | V^ | V | ...--hot_inode_item--... | | ...--hot_inode_item-- | frequency data | |frequency data +--hlist_node---+ +---hlist_node---. There's no need for separate allocations, initialisations, locks and reference counting - all that is already in the hot_inode_item. The items have the same lifecycle limitations - a hot_hash_node must be torn down before the frequency data it points to is freed. Finally, there's no difference in how you move it between lists. How will you know if one hot_inode_item should be moved between lists when its freq data is changed? Record the current temperature in the frequency data, and if it I know how to do it, thanks. changes, change the list it is on. Indeed, calling it a hash is wrong - there's not hashing at all - it keeping an array of list where each entry corresponds to a specific temperature. It is a *heat map*, not a hash list. i.e. inode_heat_map, not heat_inode_hl. HEAT_MAP_SIZE, not HASH_SIZE. OK. As it is, there aren't any users of the heat maps that are generated in this patch set - it's not even exported to userspace or to debugfs, so I'm not sure how it will be used yet. How are these heat maps going to be used by filesystems, Zhi? In hot_hash_calc_temperature(), you can see that one hot_inode or hot_range's freq data will be distilled into one temperature value, then it will be inserted to the heat map based on its temperature. When the file corresponding to the inode or range got hotter or cold, its location will be changed in the heat map based on its new temperature in hot_hash_update_hash_table(). Yes, but a hot_inode_item or hot_range_item can only have one location in the heat map, right? So it doesn't need external Yes. structure to point to the frequency data to track this OK. And the user will retrieve those freq data and temperature info via debugfs or ioctl interfaces. Right - but that data is only extracted after an initial hot_inode_tree lookup - The heat map itself is never directly used for lookups. If it's not used for lookups based on temperature, why is it needed? You mean we don't need hot_inode_tree? You know, after those hook functions collect the freq data for inode, they will store those raw info in hot_inode_tree. One private kthread will iterate
Re: [RFC v2 06/10] vfs: enable hot data tracking
On Thu, Sep 27, 2012 at 2:59 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Sep 27, 2012 at 02:28:12PM +0800, Zhi Yong Wu wrote: On Thu, Sep 27, 2012 at 11:54 AM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:31PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Miscellaneous features that implement hot data tracking and generally make the hot data functions a bit more friendly. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/direct-io.c | 10 ++ include/linux/hot_tracking.h | 11 +++ mm/filemap.c |8 mm/page-writeback.c | 21 + mm/readahead.c |9 + 5 files changed, 59 insertions(+), 0 deletions(-) diff --git a/fs/direct-io.c b/fs/direct-io.c index f86c720..3773f44 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -37,6 +37,7 @@ #include linux/uio.h #include linux/atomic.h #include linux/prefetch.h +#include hot_tracking.h /* * How many user pages to map in one call to get_user_pages(). This determines @@ -1297,6 +1298,15 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, prefetch(bdev-bd_queue); prefetch((char *)bdev-bd_queue + SMP_CACHE_BYTES); + /* Hot data tracking */ + if (TRACK_THIS_INODE(iocb-ki_filp-f_mapping-host) + iov_length(iov, nr_segs) 0) { + hot_rb_update_freqs(iocb-ki_filp-f_mapping-host, + (u64)offset, + (u64)iov_length(iov, nr_segs), + rw WRITE); + } That's a bit messy. I'd prefer a static inline function that hides all this. e.g. Do you think of moving the condition into hot_inode_udate_freqs(), not adding another new function? Moving it into hot_inode_udate_freqs() will add a function call overhead even when tracking is not enabled. a static inline function Can we not directly define hot_inode_udate_freqs to be a static inline?:) will just result in no extra overhead other than the if statement Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 07/10] vfs: fork one kthread to update data temperature
On Thu, Sep 27, 2012 at 3:01 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Sep 27, 2012 at 02:54:22PM +0800, Zhi Yong Wu wrote: On Thu, Sep 27, 2012 at 12:03 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:32PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Fork and run one kernel kthread to calculate that temperature based on some metrics kept in custom frequency data structs, and store the info in the hash table. No new kthreads, please. Use a per-superblock workqueue and a struct delayed_work to run periodic work on each superblock. If no new kthread is created, which kthread will work on these delayed_work tasks? One of the kworker threads that service the workqueue infrastructure. Got it, thanks Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Thu, Sep 27, 2012 at 3:05 PM, Dave Chinner da...@fromorbit.com wrote: On Thu, Sep 27, 2012 at 01:25:34PM +0800, Zhi Yong Wu wrote: On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able If we don't allocate one sb-s_hot_info, where will those hash list head and btree roots locate? I wrote that thinking (mistakenly) that s-hot)info was dynamically allocated rather than being embedded in the struct super_block. Indeed, if the mount option is held in s_flags, then it could be dynamically allocated, but I don't think that's really necessary... ah, you prefer allocating it, OK, let me try. thanks for your explaination. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 02/10] vfs: add support for updating access frequency
On Thu, Sep 27, 2012 at 10:19 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Sep 26, 2012 at 10:53:07AM +0800, Zhi Yong Wu wrote: On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote: I note that the code will always insert range items of a length RANGE_SIZE. This means you have a fixed object granularity and hence you have no need for a range based search. That is, you could use a radix tree where each entry in the radix tree points directly to the range object similar to how the page cache uses a radix tree for indexing pages. That brings the possibility of lockless range item lookups Great suggestion, but can we temporarily put it in TODO list? because it will bring one big code change. Sure. I just wanted to point out that there are better choices for indexing fixed size elements than rb-trees and why it might make sense to use a different type of tree. Got it, thanks. Moreover, it should also be better to use radix tree to hold hot_inode, not only hot_range. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Thu, Sep 27, 2012 at 10:20 AM, Dave Chinner da...@fromorbit.com wrote: On Wed, Sep 26, 2012 at 10:56:08AM +0800, Zhi Yong Wu wrote: On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it's fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers I can see what you mean, but don't know if other guys also agree with this. If so, all FS specific code which use hot tracking feature wll have to add the same chunk of code in it fill_super method. Is it good? Most filesystems will only need to add 3-4 lines of code to their existing parser, so it's not a big deal I think... OK, i try to do it. thanks a lot. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able If we don't allocate one sb-s_hot_info, where will those hash list head and btree roots locate? to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it's fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com Just a coupl eof minor formatting things first up - I'll have more comments as I get deeper into the series. All comments are very reasonable, and will be applied. thanks for your review. +/* + * Initialize the inode tree. Should be called for each new inode + * access or other user of the hot_inode interface. + */ +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) The names of these are a bit clunky. You probably don't need the _rb_ in the function name. i.e. hot_inode_tree_init() is sufficient, and if we every want to change in the tree type we don't have to rename every single function... . +/* + * Initialize a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() + */ +void hot_rb_inode_item_init(void *_item) +{ The usual naming convention for slab initialiser functions is to use a suffix of _once to indicate it is only ever called once per slab object instantiation, not every time the object is allocated fom the slab. See, for example, inode_init_once() and inode_init_always(). so, that would make this function hot_inode_item_init_once(). +/* init hot_inode_item and hot_range_item kmem cache */ +static int __init hot_rb_item_cache_init(void) +{ + hot_inode_item_cache = kmem_cache_create(hot_inode_item, + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_inode_item_init); + if (!hot_inode_item_cache) + goto inode_err; + + hot_range_item_cache = kmem_cache_create(hot_range_item, + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_range_item_init); + if (!hot_range_item_cache) + goto range_err; + + return 0; + +range_err: + kmem_cache_destroy(hot_inode_item_cache); +inode_err: + return -ENOMEM; +} + +/* + * Initialize kmem cache for hot_inode_item + * and hot_range_item + */ +void __init hot_track_cache_init(void) +{ + if (hot_rb_item_cache_init()) + return; No real need to have a hot_rb_item_cache_init() function here - just open code it all in the hot_track_cache_init() function. +} diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h new file mode 100644 index 000..269b67a --- /dev/null +++ b/fs/hot_tracking.h @@ -0,0 +1,27 @@ +/* + * fs/hot_tracking.h + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com + *Ben Chociej bchoc...@gmail.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_TRACKING__ +#define __HOT_TRACKING__ + +#include linux/rbtree.h +#include linux/hot_tracking.h + +/* values for hot_freq_data flags */ +/* freq data struct is for an inode */ +#define FREQ_DATA_TYPE_INODE (1 0) +/* freq data struct is for a range */ +#define FREQ_DATA_TYPE_RANGE (1 1) The comments are redundant - the name of the object documents it's use sufficiently. ie. /* values for hot_freq_data flags */ #define FREQ_DATA_TYPE_INODE (1 0) #define FREQ_DATA_TYPE_RANGE (1 1) is just fine by itself. +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com Just a coupl eof minor formatting things first up - I'll have more comments as I get deeper into the series. OK, very look forward to seeing more on other patches, indeed thanks again. +/* + * Initialize the inode tree. Should be called for each new inode + * access or other user of the hot_inode interface. + */ +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) The names of these are a bit clunky. You probably don't need the _rb_ in the function name. i.e. hot_inode_tree_init() is sufficient, and if we every want to change in the tree type we don't have to rename every single function... . +/* + * Initialize a new hot_inode_item structure. The new structure is + * returned with a reference count of one and needs to be + * freed using free_inode_item() + */ +void hot_rb_inode_item_init(void *_item) +{ The usual naming convention for slab initialiser functions is to use a suffix of _once to indicate it is only ever called once per slab object instantiation, not every time the object is allocated fom the slab. See, for example, inode_init_once() and inode_init_always(). so, that would make this function hot_inode_item_init_once(). +/* init hot_inode_item and hot_range_item kmem cache */ +static int __init hot_rb_item_cache_init(void) +{ + hot_inode_item_cache = kmem_cache_create(hot_inode_item, + sizeof(struct hot_inode_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_inode_item_init); + if (!hot_inode_item_cache) + goto inode_err; + + hot_range_item_cache = kmem_cache_create(hot_range_item, + sizeof(struct hot_range_item), 0, + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, + hot_rb_range_item_init); + if (!hot_range_item_cache) + goto range_err; + + return 0; + +range_err: + kmem_cache_destroy(hot_inode_item_cache); +inode_err: + return -ENOMEM; +} + +/* + * Initialize kmem cache for hot_inode_item + * and hot_range_item + */ +void __init hot_track_cache_init(void) +{ + if (hot_rb_item_cache_init()) + return; No real need to have a hot_rb_item_cache_init() function here - just open code it all in the hot_track_cache_init() function. +} diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h new file mode 100644 index 000..269b67a --- /dev/null +++ b/fs/hot_tracking.h @@ -0,0 +1,27 @@ +/* + * fs/hot_tracking.h + * + * Copyright (C) 2012 IBM Corp. All rights reserved. + * Written by Zhi Yong Wu wu...@linux.vnet.ibm.com + *Ben Chociej bchoc...@gmail.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + */ + +#ifndef __HOT_TRACKING__ +#define __HOT_TRACKING__ + +#include linux/rbtree.h +#include linux/hot_tracking.h + +/* values for hot_freq_data flags */ +/* freq data struct is for an inode */ +#define FREQ_DATA_TYPE_INODE (1 0) +/* freq data struct is for a range */ +#define FREQ_DATA_TYPE_RANGE (1 1) The comments are redundant - the name of the object documents it's use sufficiently. ie. /* values for hot_freq_data flags */ #define FREQ_DATA_TYPE_INODE (1 0) #define FREQ_DATA_TYPE_RANGE (1 1) is just fine by itself. +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On Tue, Sep 25, 2012 at 6:14 PM, David Sterba d...@jikos.cz wrote: On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Some code pathes forget to free memory on exit. Same as with the fd's, kernel will free all memory for us at exit(). hi, can you let me know the pointer to exit() said by you? If there's lots of memory allocated, it may be even faster to leave the unallocation process to kernel as it will do it in one go, while the application would unnecessarily free it chunk by chunk. got it, thanks. david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 02/10] vfs: add support for updating access frequency
thanks a lot for your review in my heart, Dave. It is very helpful to me. On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Add some utils helpers to update access frequencies for one file or its range. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c | 359 + fs/hot_tracking.h | 15 +++ 2 files changed, 374 insertions(+), 0 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index 173054b..52ed926 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -106,6 +106,365 @@ inode_err: } /* + * Drops the reference out on hot_inode_item by one and free the structure + * if the reference count hits zero + */ +void hot_rb_free_hot_inode_item(struct hot_inode_item *he) hot_inode_item_put() +{ + if (!he) + return; It's a bug to call a put function on a kref counted item with a null pointer. Let the kernel crash so it is noticed and fixed. OK, will remove it. + + if (atomic_dec_and_test(he-refs.refcount)) { + WARN_ON(he-in_tree); + kmem_cache_free(hot_inode_item_cache, he); + } Isn't this abusing krefs? i.e. this should be: Sorry, thanks for your explaination as below: hot_inode_item_free() { WARN_ON(he-in_tree); kmem_cache_free(hot_inode_item_cache, he); } hot_inode_item_put() { kref_put(he-refs, hot_inode_item_free) } +/* + * Drops the reference out on hot_range_item by one and free the structure + * if the reference count hits zero + */ +static void hot_rb_free_hot_range_item(struct hot_range_item *hr) same comments as above. OK, thanks. +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, + unsigned long inode_num, + struct rb_node *node) static struct rb_node * hot_inode_item_find(. ) OK, thanks. +{ + struct rb_node **p = root-rb_node; + struct rb_node *parent = NULL; + struct hot_inode_item *entry; + + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_inode_item, rb_node); + + if (inode_num entry-i_ino) + p = (*p)-rb_left; + else if (inode_num entry-i_ino) + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_inode_item, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); So the hot inode item search key is the inode number? Why use an Yes rb-tree then? Wouldn't a btree be a more memory efficient way to hold a sparse index that has fixed key and pointer sizes? Yes, i know, but if we don't use btree, what will be better? Radix tree? Also, the API seems quite strange. you pass in the the rb_node and the inode number which instead of passing in the hot inode item that already holds this information. You then convert the rb_node back to a hot inode item to set the in_tree variable. So why not just pass in the struct hot_inode_item in the first place? Good catch, thanks for your remider. +static u64 hot_rb_range_end(struct hot_range_item *hr) hot_range_item_end() OK +{ + if (hr-start + hr-len hr-start) + return (u64)-1; + + return hr-start + hr-len - 1; +} + +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root, hot_range_item_find() OK + u64 start, + struct rb_node *node) +{ + struct rb_node **p = root-rb_node; + struct rb_node *parent = NULL; + struct hot_range_item *entry; + + /* ensure start is on a range boundary */ + start = start RANGE_SIZE_MASK; + /* walk tree to find insertion point */ + while (*p) { + parent = *p; + entry = rb_entry(parent, struct hot_range_item, rb_node); + + if (start entry-start) + p = (*p)-rb_left; + else if (start = hot_rb_range_end(entry)) Shouldn't an aligned end always be one byte short of the start offset of the next aligned region? i.e. start == hot_rb_range_end(entry) is an indication of an off-by one bug somewhere? This is really one good catch, thanks. + p = (*p)-rb_right; + else + return parent; + } + + entry = rb_entry(node, struct hot_range_item, rb_node); + entry-in_tree = 1; + rb_link_node(node, parent, p); + rb_insert_color(node, root); Same comments as the hot_inode_range. OK. Also, the start offset
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner da...@fromorbit.com wrote: On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Introduce one new mount option '-o hottrack', and add its parsing support. Its usage looks like: mount -o hottrack mount -o nouser,hottrack mount -o nouser,hottrack,loop mount -o hottrack,nouser I think that this option parsing should be done by the filesystem, even though the tracking functionality is in the VFS. That way ony the filesystems that can use the tracking information will turn it on, rather than being able to turn it on for everything regardless of whether it is useful or not. Along those lines, just using a normal superblock flag to indicate it is active (e.g. MS_HOT_INODE_TRACKING in sb-s_flags) means you don't need to allocate the sb-s_hot_info structure just to be able to check whether we are tracking hot inodes or not. This then means the hot inode tracking for the superblock can be initialised by the filesystem as part of it's fill_super method, along with the filesystem specific code that will use the hot tracking information the VFS gathers I can see what you mean, but don't know if other guys also agree with this. If so, all FS specific code which use hot tracking feature wll have to add the same chunk of code in it fill_super method. Is it good? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On Wed, Sep 26, 2012 at 1:14 AM, Goffredo Baroncelli kreij...@libero.it wrote: On 09/25/2012 12:14 PM, David Sterba wrote: On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wuwu...@linux.vnet.ibm.com Some code pathes forget to free memory on exit. Same as with the fd's, kernel will free all memory for us at exit(). I strongly disagree with this approach. The callee often don't know what happen after and before the call. The same is true for the programmer, because the code is quite often updated by several people. A clean exit() is the right thing to do as general rule. I don't see any valid reason (in the btrfs context) to do otherwise. Relying on the exit() for a proper clean-up increase the likelihood of bug when the code evolves (see my patch [RESPOST][BTRFS-PROGS][PATCH] btrfs_read_dev_super(): uninitialized variable for an example of what means an incorrect deallocation of resource). If there's lots of memory allocated, it may be even faster to leave the unallocation process to kernel as it will do it in one go, while the application would unnecessarily free it chunk by chunk. May be I am wrong, but I don't think that the increase of speed of the btrfs command is even measurable relying on exit instead of free()-ing each chunk of memory one at time The same should be true for the open()/close() I fully agree with you. In one same function, i find that some code path free system sources, while other code path doesn't. This is one nice way. My 2¢ BR G.Baroncelli david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 6:20 PM, Ram Pai linux...@us.ibm.com wrote: On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com One root structure hot_info is defined, is hooked up in super_block, and will be used to hold rb trees root, hash list root and some other information, etc. Adds hot_inode_tree struct to keep track of frequently accessed files, and be keyed by {inode, offset}. Trees contain hot_inode_items representing those files and ranges. Having these trees means that vfs can quickly determine the temperature of some data by doing some calculations on the hot_freq_data struct that hangs off of the tree item. Define two items hot_inode_item and hot_range_item, one of them represents one tracked file to keep track of its access frequency and the tree of ranges in this file, while the latter represents a file range of one inode. Each of the two structures contains a hot_freq_data struct with its frequency of access metrics (number of {reads, writes}, last {read,write} time, frequency of {reads,writes}). Also, each hot_inode_item contains one hot_range_tree struct which is keyed by {inode, offset, length} and used to keep track of all the ranges in this file. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- + ..snip.. +/* A tree that sits on the hot_info */ +struct hot_inode_tree { + struct rb_root map; + rwlock_t lock; +}; + +/* A tree of ranges for each inode in the hot_inode_tree */ +struct hot_range_tree { + struct rb_root map; + rwlock_t lock; +}; Can as well have a generic datastructure called hot_tree instead of having two different datastructure which basically are the same. OK. + +/* A frequency data struct holds values that are used to + * determine temperature of files and file ranges. These structs + * are members of hot_inode_item and hot_range_item + */ +struct hot_freq_data { + struct timespec last_read_time; + struct timespec last_write_time; + u32 nr_reads; + u32 nr_writes; + u64 avg_delta_reads; + u64 avg_delta_writes; + u8 flags; + u32 last_temperature; +}; + +/* An item representing an inode and its access frequency */ +struct hot_inode_item { + /* node for hot_inode_tree rb_tree */ + struct rb_node rb_node; + /* tree of ranges in this inode */ + struct hot_range_tree hot_range_tree; + /* frequency data for this inode */ + struct hot_freq_data hot_freq_data; + /* inode number, copied from inode */ + unsigned long i_ino; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, i_no, in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; + +/* + * An item representing a range inside of an inode whose frequency + * is being tracked + */ +struct hot_range_item { + /* node for hot_range_tree rb_tree */ + struct rb_node rb_node; + /* frequency data for this range */ + struct hot_freq_data hot_freq_data; + /* the hot_inode_item associated with this hot_range_item */ + struct hot_inode_item *hot_inode; + /* starting offset of this range */ + u64 start; + /* length of this range */ + u64 len; + /* used to check for errors in ref counting */ + u8 in_tree; + /* protects hot_freq_data, start, len, and in_tree */ + spinlock_t lock; + /* prevents kfree */ + struct kref refs; +}; might as well have just one generic datastructure called hot_item with all the common fields and then have struct hot_inode_item { struct hot_item hot_inode; struct hot_tree hot_range_tree; unsigned long i_ino; } and struct hot_range_item { struct hot_item hot_range; u64 start; u64 len;/* length of this range */ } This should help you eliminate some duplicate code as well. OK, i will try to apply them. thanks. RP -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2 05/10] vfs: introduce one hash table
On Tue, Sep 25, 2012 at 5:54 PM, Ram Pai linux...@us.ibm.com wrote: On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Adds a hash table structure which contains a lot of hash list and is used to efficiently look up the data temperature of a file or its ranges. In each hash list of hash table, the hash node will keep track of temperature info. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- fs/hot_tracking.c| 77 - include/linux/hot_tracking.h | 35 +++ 2 files changed, 110 insertions(+), 2 deletions(-) diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c index fa89f70..5f96442 100644 --- a/fs/hot_tracking.c +++ b/fs/hot_tracking.c @@ -16,6 +16,7 @@ #include linux/module.h #include linux/spinlock.h #include linux/hardirq.h +#include linux/hash.h #include linux/fs.h #include linux/blkdev.h #include linux/types.h @@ -24,6 +25,9 @@ ...snip... +/* Hash list heads for hot hash table */ +struct hot_hash_head { + struct hlist_head hashhead; + rwlock_t rwlock; + u32 temperature; +}; + +/* Nodes stored in each hash list of hash table */ +struct hot_hash_node { + struct hlist_node hashnode; + struct list_head node; + struct hot_freq_data *hot_freq_data; + struct hot_hash_head *hlist; + spinlock_t lock; /* protects hlist */ + + /* + * number of references to this node + * equals 1 (hashlist entry) + */ + struct kref refs; +}; Dont see why you need yet another datastructure to hash the inode_item and the range_item into a hash list. You can just add another If there's such one structure, we can rapidly know which hash bucket one inode_item is currently linking to via its hlist field. This is useful if one inode_item corresponding to one file need to move from one bucket to another bucket when this file get cold or hotter. Does it make sense to you? hlist_node in the inode_item and range_item. This field can be then used to link into the corresponding hash list. You can use the container_of() get to the inode_item or the range_item using the hlist_node field. You can thus eliminate a lot of code. As what i said above, i don't think it can eliminate a lo of code. + /* An item representing an inode and its access frequency */ struct hot_inode_item { /* node for hot_inode_tree rb_tree */ @@ -68,6 +93,8 @@ struct hot_inode_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this inode */ + struct hot_hash_node *heat_node; this can be just struct hlist_node head_node; /* lookup hot_inode hash list */ Use this field to link it into the corresponding hashlist. }; this can be just /* @@ -91,6 +118,8 @@ struct hot_range_item { spinlock_t lock; /* prevents kfree */ struct kref refs; + /* hashlist node for this range */ + struct hot_hash_node *heat_node; this can be just struct hlist_node head_node; /* lookup hot_range hash list */ }; struct hot_info { @@ -98,6 +127,12 @@ struct hot_info { /* red-black tree that keeps track of fs-wide hot data */ struct hot_inode_tree hot_inode_tree; + + /* hash map of inode temperature */ + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE]; + + /* hash map of range temperature */ + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; }; #endif /* _LINUX_HOTTRACK_H */ -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/2] btrfs-progs: some bugfixes
Some misc bugs are found when i work on other tasks. Now send out them for interview, thanks. Zhi Yong Wu (2): btrfs-progs: Close file descriptor on exit btrfs-progs: Fix up memory leakage cmds-filesystem.c | 16 1 files changed, 12 insertions(+), 4 deletions(-) -- 1.7.6.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] btrfs-progs: Close file descriptor on exit
Need to close fd on exit. Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- cmds-filesystem.c | 10 -- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index b1457de..e62c4fd 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -77,18 +77,23 @@ static int cmd_df(int argc, char **argv) if (ret) { fprintf(stderr, ERROR: couldn't get space info on '%s' - %s\n, path, strerror(e)); + close(fd); free(sargs); return ret; } - if (!sargs-total_spaces) + if (!sargs-total_spaces) { + close(fd); return 0; + } count = sargs-total_spaces; sargs = realloc(sargs, sizeof(struct btrfs_ioctl_space_args) + (count * sizeof(struct btrfs_ioctl_space_info))); - if (!sargs) + if (!sargs) { + close(fd); return -ENOMEM; + } sargs-space_slots = count; sargs-total_spaces = 0; @@ -148,6 +153,7 @@ static int cmd_df(int argc, char **argv) printf(%s: total=%s, used=%s\n, description, total_bytes, used_bytes); } + close(fd); free(sargs); return 0; -- 1.7.6.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs-progs: Fix up memory leakage
Some code pathes forget to free memory on exit. Changelog from v1: Fix the variable is used uncorrectly. [Ram Pai] Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- cmds-filesystem.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index e62c4fd..9c43d35 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -47,7 +47,7 @@ static const char * const cmd_df_usage[] = { static int cmd_df(int argc, char **argv) { - struct btrfs_ioctl_space_args *sargs; + struct btrfs_ioctl_space_args *sargs, *sargs_orig; u64 count = 0, i; int ret; int fd; @@ -65,7 +65,7 @@ static int cmd_df(int argc, char **argv) return 12; } - sargs = malloc(sizeof(struct btrfs_ioctl_space_args)); + sargs_orig = sargs = malloc(sizeof(struct btrfs_ioctl_space_args)); if (!sargs) return -ENOMEM; @@ -83,6 +83,7 @@ static int cmd_df(int argc, char **argv) } if (!sargs-total_spaces) { close(fd); + free(sargs); return 0; } @@ -92,6 +93,7 @@ static int cmd_df(int argc, char **argv) (count * sizeof(struct btrfs_ioctl_space_info))); if (!sargs) { close(fd); + free(sargs_orig); return -ENOMEM; } -- 1.7.6.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: rework the code logic
Sorry, please ignore this patch. On Thu, Jul 5, 2012 at 3:26 PM, zwu.ker...@gmail.com wrote: From: Zhi Yong Wu wu...@linux.vnet.ibm.com Signed-off-by: Zhi Yong Wu wu...@linux.vnet.ibm.com --- extent-cache.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/extent-cache.c b/extent-cache.c index 3dd6434..a5084ee 100644 --- a/extent-cache.c +++ b/extent-cache.c @@ -136,10 +136,10 @@ struct cache_extent *find_first_cache_extent(struct cache_tree *tree, struct cache_extent *entry; ret = __tree_search(tree-root, start, 1, prev); - if (!ret) + if (!ret) { ret = prev; - if (!ret) return NULL; +} entry = rb_entry(ret, struct cache_extent, rb_node); return entry; } -- 1.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_print_tree?
On Sun, Jul 1, 2012 at 5:41 PM, Mike Fleetwood mike.fleetw...@googlemail.com wrote: On 1 July 2012 05:53, Zhi Yong Wu zwu.ker...@gmail.com wrote: HI, Do anyone know where btrfs_print_tree is invoked? thanks. -- Regards, Zhi Yong Wu Is this the answer you are after? $ grep -r btrfs_print_tree fs/btrfs/ fs/btrfs/print-tree.c:void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *c) fs/btrfs/print-tree.c: btrfs_print_tree(root, next); fs/btrfs/print-tree.h:void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *t); No, i also did as this, but didn't find out who will invoke this function. From above output, we only saw that it invokes itself one time. Mike -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_print_tree?
On Sun, Jul 1, 2012 at 6:16 PM, Jeff Liu jeff@oracle.com wrote: On 07/01/2012 05:49 PM, Zhi Yong Wu wrote: On Sun, Jul 1, 2012 at 5:41 PM, Mike Fleetwood mike.fleetw...@googlemail.com wrote: On 1 July 2012 05:53, Zhi Yong Wu zwu.ker...@gmail.com wrote: HI, Do anyone know where btrfs_print_tree is invoked? thanks. -- Regards, Zhi Yong Wu Is this the answer you are after? $ grep -r btrfs_print_tree fs/btrfs/ fs/btrfs/print-tree.c:void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *c) fs/btrfs/print-tree.c: btrfs_print_tree(root, next); fs/btrfs/print-tree.h:void btrfs_print_tree(struct btrfs_root *root, struct extent_buffer *t); No, i also did as this, but didn't find out who will invoke this function. From above output, we only saw that it invokes itself one time. Looks this is a helper routine exported to btrfs-progs previously, it is used by debug-tree, quick-test, etc... But this function has been implemented at btrfs-progs now, maybe it could be safely removed from kernel, not sure. :) Great, thanks both. Thanks, -Jeff Mike -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs_print_tree?
HI, Do anyone know where btrfs_print_tree is invoked? thanks. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can give some help?
HI, Can anyone let me know where the funtions are declared or defined, such as btrfs_header_nritems(), btrfs_header_level(), etc? thanks. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can give some help?
On Fri, Jun 29, 2012 at 9:47 PM, Hugo Mills h...@carfax.org.uk wrote: On Fri, Jun 29, 2012 at 09:41:47PM +0800, Zhi Yong Wu wrote: HI, Can anyone let me know where the funtions are declared or defined, such as btrfs_header_nritems(), btrfs_header_level(), etc? thanks. ctree.h, somewhere around or after line 1550. They're all accessor functions, defined by a set of macros. Look for the *_SETGET_* macros. The actual definitions of BTRFS_SETGET_FUNCS are in struct-funcs.h Great, thanks. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 1: Family Holiday --- -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html