Re: [PATCH RESEND v1 00/16] vfs: hot data tracking
ping? any comments? On Thu, Dec 20, 2012 at 10:43 PM, wrote: > From: Zhi Yong Wu > > HI, guys, > > This patchset has been done scalability or performance tests > by fs_mark, ffsb and compilebench. > I have done the perf testing on Linux 3.7.0-rc8+ with Intel(R) Core(TM) > i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk. > > Any comments or ideas are appreciated, thanks. > > NOTE: > > The patchset can be obtained via my kernel dev git on github: > git://github.com/wuzhy/kernel.git hot_tracking > If you're interested, you can also review them via > https://github.com/wuzhy/kernel/commits/hot_tracking > > For more info, please check hot_tracking.txt in Documentation > > Below is the perf testing report: > > 1. fs_mark test > > w/o: without hot tracking > w/ : with hot tracking > > Count Size FSUse% Files/sec App > Overhead > > w/ow/ w/o w/ w/o >w/ > > 80 1 2 3 13756.4 32144.9 > 53506275436291 > 160 1 4 5 1163.4 1799.3 > 20848119 21708216 > 240 1 6 6 1360.8 1252.5 > 67987058715322 > 320 1 8 8 1600.1 1196.3 > 57511296013792 > 400 1 9 9 1071.4 1191.2 > 17204725 26786369 > 480 1 10 10 1483.5 1447.9 > 19418383046 > 560 1 11 11 1457.9 1699.5 > 5783588 10074681 > 640 1 12 13 1658.8 1628.5 > 69926976185551 > 720 1 14 14 1662.4 1857.1 > 5796793 13772592 > 800 1 15 15 2930.0 2653.8 > 124316826152573 > 880 1 16 17 1630.8 1665.0 > 7666719 13682765 > 960 1 18 18 1530.3 1583.9 > 5823644 10171644 > 1040 1 19 19 1437.9 1798.6 > 209352246048083 > 1120 1 20 20 1529.0 1550.6 > 66474506003151 > 1200 1 21 22 1558.6 1501.8 > 12539509 18144939 > 1280 1 23 23 1644.2 1432.1 > 7074419 28101975 > 1360 1 24 24 1753.6 1650.2 > 7164297 20888972 > 1440 1 25 25 2750.0 1483.9 > 127566927441225 > 1520 1 27 27 1551.1 1514.3 > 57410668250443 > 1600 1 28 28 1610.8 1635.9 > 721938608545285 > 1680 1 29 29 1646.7 1907.7 > 8945856 11703513 > 1760 1 30 31 1496.6 2722.3 > 58589618989393 > 1840 1 32 32 1457.7 1565.7 > 10914475 26504660 > 1920 1 33 33 1437.6 1518.7 > 6708975 213303618 > 2000 1 34 34 1825.4 1521.1 > 5722086 12490907 > 2080 1 36 35 1718.4 1611.5 > 5873290 17942534 > 2160 1 37 37 2152.6 1536.9 > 1130506278717940 > 2240 1 38 38 2443.7 1788.2 > 7398122 19834765 > 2320 1 39 39 1518.5 1587.6 > 5770959 10134882 > 2400 1 41 41 1536.8 2164.0 > 57512487214626 > 2480 1 42 42 1576.6 2939.4 > 73903146070271 > 2560 1 43 43 1707.4 1535.9 > 110759396052896 > 2640 1 44 44 1522.5 1563.1 > 10142987 22549898 > 2720 1 46 46 1827.4 1608.5 > 11613016 24828125 > 2800 1 47 47 3420.5 1741.9 > 8059985 16599156 > 2880 1 48 48 1815.5 1944.4 > 78479319043277 > 2960 1 50 49 1650.0 1596.6 > 56363237929164 > 3040 1 51 51 1683.7 1573.3 > 5766323 19369146 > 3120
[PATCH v2 0/2] btrfs-progs: some bugfixes
Some misc bugs are found when i work on other tasks. Now send out them for interview, thanks. Zhi Yong Wu (2): btrfs-progs: Close file descriptor on exit btrfs-progs: Fix up memory leakage cmds-filesystem.c | 16 1 files changed, 12 insertions(+), 4 deletions(-) -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] btrfs-progs: Close file descriptor on exit
Need to close fd on exit. Signed-off-by: Zhi Yong Wu --- cmds-filesystem.c | 10 -- 1 files changed, 8 insertions(+), 2 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index b1457de..e62c4fd 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -77,18 +77,23 @@ static int cmd_df(int argc, char **argv) if (ret) { fprintf(stderr, "ERROR: couldn't get space info on '%s' - %s\n", path, strerror(e)); + close(fd); free(sargs); return ret; } - if (!sargs->total_spaces) + if (!sargs->total_spaces) { + close(fd); return 0; + } count = sargs->total_spaces; sargs = realloc(sargs, sizeof(struct btrfs_ioctl_space_args) + (count * sizeof(struct btrfs_ioctl_space_info))); - if (!sargs) + if (!sargs) { + close(fd); return -ENOMEM; + } sargs->space_slots = count; sargs->total_spaces = 0; @@ -148,6 +153,7 @@ static int cmd_df(int argc, char **argv) printf("%s: total=%s, used=%s\n", description, total_bytes, used_bytes); } + close(fd); free(sargs); return 0; -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/2] btrfs-progs: Fix up memory leakage
Some code pathes forget to free memory on exit. Changelog from v1: Fix the variable is used uncorrectly. [Ram Pai] Signed-off-by: Zhi Yong Wu --- cmds-filesystem.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index e62c4fd..9c43d35 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -47,7 +47,7 @@ static const char * const cmd_df_usage[] = { static int cmd_df(int argc, char **argv) { - struct btrfs_ioctl_space_args *sargs; + struct btrfs_ioctl_space_args *sargs, *sargs_orig; u64 count = 0, i; int ret; int fd; @@ -65,7 +65,7 @@ static int cmd_df(int argc, char **argv) return 12; } - sargs = malloc(sizeof(struct btrfs_ioctl_space_args)); + sargs_orig = sargs = malloc(sizeof(struct btrfs_ioctl_space_args)); if (!sargs) return -ENOMEM; @@ -83,6 +83,7 @@ static int cmd_df(int argc, char **argv) } if (!sargs->total_spaces) { close(fd); + free(sargs); return 0; } @@ -92,6 +93,7 @@ static int cmd_df(int argc, char **argv) (count * sizeof(struct btrfs_ioctl_space_info))); if (!sargs) { close(fd); + free(sargs_orig); return -ENOMEM; } -- 1.7.6.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 00/11] VFS: hot data tracking
Sorry, forgot CCed to Ted. On Tue, Sep 11, 2012 at 10:27 PM, wrote: > From: Zhi Yong Wu > > HI, folks > I have pushed the patchset to my kernel dev git tree: > g...@github.com:wuzhy/kernel.git > > Also, you can review it via > https://github.com/wuzhy/kernel/commits/hottrack > > NOTE: > > The patchset still has a lot of bugfix and cleanup to do. It is post > out mainly to make sure it is going in the correct direction and > hope to get some helpful comments from other guys. > > TODO List: > > 1.) Need to do scalability or performance tests. > 2.) Fix up bugs. > 3.) Strictly split this patchset to keep them in order > This patchset is in RFC state, i haven't strictly split it > When it is in PATCH state, i will strictly split it and let > them in order. > 4.) Turn some Micro in to tunables > TIME_TO_KICK, and HEAT_UPDATE_DELAY > 5.) Rafactor hot_hash_is_aging() > If you just made the timeout value a timespec and compared > the _timespecs_, you would be doing a lot fewer conversions. > 6.) Cleanup some unnecessary lock protect > 7.) Add more comments to explain how to calc temperature > > Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to > be very btrfs-specific. I've taken their code and attempted to > make it more generic and integrate it at the VFS level. > > INTRODUCTION: > > Essentially, this means maintaining some key stats > (like number of reads/writes, last read/write time, frequency of > reads/writes), then distilling those numbers down to a single > "temperature" value that reflects what data is "hot," and using that > temperature to move data to SSDs. > > The long-term goal of these patches is to allow some FSs, > e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume. > Incidentally, this project has been motivated by > the Project Ideas page on the Btrfs wiki. > > Of course, users are warned not to run this code outside of development > environments. These patches are EXPERIMENTAL, and as such they might eat > your data and/or memory. That said, the code should be relatively safe > when the hottrack mount option are disabled. > > MOTIVATION: > > The overall goal of enabling hot data relocation to SSD has been > motivated by the Project Ideas page on the Btrfs wiki at > <https://btrfs.wiki.kernel.org/index.php/Project_ideas>. > It will divide into two steps. VFS provide hot data tracking function > while specific FS will provide hot data relocation function. > So as the first step of this goal, it is hoped that the patchset > for hot data tracking will eventually mature into VFS. > > This is essentially the traditional cache argument: SSD is fast and > expensive; HDD is cheap but slow. ZFS, for example, can already take > advantage of SSD caching. Btrfs should also be able to take advantage of > hybrid storage without many broad, sweeping changes to existing code. > > SUMMARY: > > - Hooks in existing vfs functions to track data access frequency > > - New rbtrees for tracking access frequency of inodes and sub-file > ranges (hot_rb.c) > The relationship between super_block and rbtree is as below: > super_block->s_hotinfo.hot_inode_tree > In include/linux/fs.h, one struct hot_info s_hotinfo is added to > super_block struct. Each FS instance can find hot tracking info > s_hotinfo via its super_block. In this hot_info, it store a lot of hot > tracking info such as hot_inode_tree, inode and range hash list, etc. > > - A hash list for indexing data by its temperature (hot_hash.c) > > - A debugfs interface for dumping data from the rbtrees (hot_debugfs.c) > > - A background kthread for updating inode heat info > > - Mount options for enabling temperature tracking(-o hottrack, default mean > disabled) > (hot_track.c) > > - An ioctl to retrieve the frequency information collected for a certain > file > > - Ioctls to enable/disable frequency tracking per inode. > > Usage syntax: > > root@debian-i386:~# mount -o hottrack /dev/sdb /mnt > [ 1505.894078] device label test devid 1 transid 29 /dev/sdb > [ 1505.952977] btrfs: disk space caching is enabled > [ 1506.069678] vfs: turning on hot data tracking > root@debian-i386:~# mount -t debugfs none /sys/kernel/debug > root@debian-i386:~# ls -l /sys/kernel/debug/vfs_hotdata/ > total 0 > drwxr-xr-x 2 root root 0 Aug 8 04:40 sdb > root@debian-i386:~# ls -l /sys/kernel/debug/vfs_hotdata/sdb > total 0 > -rw-r--r-- 1 root root 0 Aug 8 04:40 inode_data > -rw-r--r-- 1 root root 0 Aug 8 04:40 range_data > root@debian-i386:~# vi /mnt/file > root@debian-i386:~# cat /sys/kernel/debug/hot_track/
Re: [RFC 00/11] VFS: hot data tracking
hi, all maintainers. ping? any comments are appreciated, thanks. On Wed, Sep 12, 2012 at 10:31 PM, Zhi Yong Wu wrote: > Sorry, forgot CCed to Ted. > > On Tue, Sep 11, 2012 at 10:27 PM, wrote: >> From: Zhi Yong Wu >> >> HI, folks >> I have pushed the patchset to my kernel dev git tree: >> g...@github.com:wuzhy/kernel.git >> >> Also, you can review it via >> https://github.com/wuzhy/kernel/commits/hottrack >> >> NOTE: >> >> The patchset still has a lot of bugfix and cleanup to do. It is post >> out mainly to make sure it is going in the correct direction and >> hope to get some helpful comments from other guys. >> >> TODO List: >> >> 1.) Need to do scalability or performance tests. >> 2.) Fix up bugs. >> 3.) Strictly split this patchset to keep them in order >> This patchset is in RFC state, i haven't strictly split it >> When it is in PATCH state, i will strictly split it and let >> them in order. >> 4.) Turn some Micro in to tunables >> TIME_TO_KICK, and HEAT_UPDATE_DELAY >> 5.) Rafactor hot_hash_is_aging() >> If you just made the timeout value a timespec and compared >> the _timespecs_, you would be doing a lot fewer conversions. >> 6.) Cleanup some unnecessary lock protect >> 7.) Add more comments to explain how to calc temperature >> >> Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to >> be very btrfs-specific. I've taken their code and attempted to >> make it more generic and integrate it at the VFS level. >> >> INTRODUCTION: >> >> Essentially, this means maintaining some key stats >> (like number of reads/writes, last read/write time, frequency of >> reads/writes), then distilling those numbers down to a single >> "temperature" value that reflects what data is "hot," and using that >> temperature to move data to SSDs. >> >> The long-term goal of these patches is to allow some FSs, >> e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume. >> Incidentally, this project has been motivated by >> the Project Ideas page on the Btrfs wiki. >> >> Of course, users are warned not to run this code outside of development >> environments. These patches are EXPERIMENTAL, and as such they might eat >> your data and/or memory. That said, the code should be relatively safe >> when the hottrack mount option are disabled. >> >> MOTIVATION: >> >> The overall goal of enabling hot data relocation to SSD has been >> motivated by the Project Ideas page on the Btrfs wiki at >> <https://btrfs.wiki.kernel.org/index.php/Project_ideas>. >> It will divide into two steps. VFS provide hot data tracking function >> while specific FS will provide hot data relocation function. >> So as the first step of this goal, it is hoped that the patchset >> for hot data tracking will eventually mature into VFS. >> >> This is essentially the traditional cache argument: SSD is fast and >> expensive; HDD is cheap but slow. ZFS, for example, can already take >> advantage of SSD caching. Btrfs should also be able to take advantage of >> hybrid storage without many broad, sweeping changes to existing code. >> >> SUMMARY: >> >> - Hooks in existing vfs functions to track data access frequency >> >> - New rbtrees for tracking access frequency of inodes and sub-file >> ranges (hot_rb.c) >> The relationship between super_block and rbtree is as below: >> super_block->s_hotinfo.hot_inode_tree >> In include/linux/fs.h, one struct hot_info s_hotinfo is added to >> super_block struct. Each FS instance can find hot tracking info >> s_hotinfo via its super_block. In this hot_info, it store a lot of hot >> tracking info such as hot_inode_tree, inode and range hash list, etc. >> >> - A hash list for indexing data by its temperature (hot_hash.c) >> >> - A debugfs interface for dumping data from the rbtrees (hot_debugfs.c) >> >> - A background kthread for updating inode heat info >> >> - Mount options for enabling temperature tracking(-o hottrack, default mean >> disabled) >> (hot_track.c) >> >> - An ioctl to retrieve the frequency information collected for a certain >> file >> >> - Ioctls to enable/disable frequency tracking per inode. >> >> Usage syntax: >> >> root@debian-i386:~# mount -o hottrack /dev/sdb /mnt >> [ 1505.894078] device label test devid 1 transid 29 /dev/sdb >> [ 1505.952977] btrfs: disk space ca
Re: VFS hot tracking: How to calculate data temperature?
On Fri, Nov 2, 2012 at 4:41 PM, Zheng Liu wrote: > On Fri, Nov 02, 2012 at 02:38:29PM +0800, Zhi Yong Wu wrote: >> Here also has another question. >> >> How to save the file temperature among the umount to be able to >> preserve the file tempreture after reboot? >> >> This above is the requirement from DB product. >> I thought that we can save file temperature in its inode struct, that >> is, add one new field in struct inode, then this info will be written >> to disk with inode. >> >> Any comments or ideas are appreciated, thanks. > > Hi Zhiyong, > > I think that we might define a callback function. If a filesystem wants > to save these data, it can implement a function to save them. The > filesystem can decide whether adding it or not by themselves. Great idea, temperature saving function is maybe very specific to FS. But i am wondering if we can find one generic way to save temperature info at first. > > BTW, actually I don't really care about how to save these data because I > only want to observe which file is accessed in real time, which is very > useful for me to track a problem in our product system. heh, but other guys or products care about this. > > Regards, > Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VFS hot tracking: How to calculate data temperature?
On Sat, Nov 3, 2012 at 4:10 AM, Darrick J. Wong wrote: > On Fri, Nov 02, 2012 at 04:41:09PM +0800, Zheng Liu wrote: >> On Fri, Nov 02, 2012 at 02:38:29PM +0800, Zhi Yong Wu wrote: >> > Here also has another question. >> > >> > How to save the file temperature among the umount to be able to >> > preserve the file tempreture after reboot? >> > >> > This above is the requirement from DB product. >> > I thought that we can save file temperature in its inode struct, that >> > is, add one new field in struct inode, then this info will be written >> > to disk with inode. >> > >> > Any comments or ideas are appreciated, thanks. >> >> Hi Zhiyong, >> >> I think that we might define a callback function. If a filesystem wants >> to save these data, it can implement a function to save them. The >> filesystem can decide whether adding it or not by themselves. >> >> BTW, actually I don't really care about how to save these data because I >> only want to observe which file is accessed in real time, which is very >> useful for me to track a problem in our product system. > > I _think_ the vfs quota code simply asks the filesystem for a special > inode where it save the quota data in whatever (FS-agnostic) format it wants. > Have you considered something like that? No, but it is one good hint for my issue. thanks. > > (Or, maybe everyone secretly hates doing that? Secret files, yaaay...) ah, do you think of doing that? > > --D >> >> Regards, >> Zheng >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VFS hot tracking: How to calculate data temperature?
On Sat, Nov 3, 2012 at 5:27 AM, Mingming.cao wrote: > On Fri, 2012-11-02 at 14:38 +0800, Zhi Yong Wu wrote: >> Here also has another question. >> >> How to save the file temperature among the umount to be able to >> preserve the file tempreture after reboot? >> >> This above is the requirement from DB product. >> I thought that we can save file temperature in its inode struct, that >> is, add one new field in struct inode, then this info will be written >> to disk with inode. >> >> Any comments or ideas are appreciated, thanks. >> >> > > Maybe could save the last file temperature with extended attributes. It seems that only ext4 has the concept of extended attributes. > Just save the per-inode temperature only for now. > > Mingming >> On Fri, Nov 2, 2012 at 12:04 PM, Zhi Yong Wu wrote: >> > HI, guys >> > >> >VFS hot tracking currently show result as below, and it is very >> > strange and not nice. >> > >> > inode #279, reads 0, writes 1, avg read time 18446744073709551615, >> > avg write time 5251566408153596, temp 109 >> > >> > Do anyone know if there is one simpler but effective way to calculate >> > data temperature? >> > >> > -- >> > Regards, >> > >> > Zhi Yong Wu >> >> >> > > > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/24] 3.6.6-stable review
| 7 ++ > drivers/block/floppy.c | 27 +++-- > drivers/gpio/gpio-timberdale.c | 4 +- > drivers/gpio/gpiolib.c | 10 +- > drivers/gpu/drm/nouveau/nouveau_drv.c | 20 ++-- > drivers/gpu/drm/nouveau/nouveau_state.c | 4 +- > drivers/gpu/drm/nouveau/nv04_dac.c | 8 +- > drivers/gpu/drm/nouveau/nv04_dfp.c | 6 +- > drivers/gpu/drm/nouveau/nv04_tv.c | 4 +- > drivers/hid/hid-microsoft.c | 18 ++- > drivers/md/raid1.c | 2 +- > drivers/scsi/qla2xxx/qla_target.c | 25 ++-- > drivers/scsi/qla2xxx/qla_target.h | 1 + > drivers/scsi/qla2xxx/tcm_qla2xxx.c | 73 > drivers/target/target_core_sbc.c| 18 +++ > drivers/target/target_core_transport.c | 1 - > drivers/usb/serial/io_edgeport.c| 1 - > drivers/usb/serial/iuu_phoenix.c| 3 +- > drivers/usb/serial/mos7840.c| 198 > > fs/ceph/addr.c | 11 +- > fs/ceph/export.c| 2 + > fs/ceph/mds_client.c| 3 +- > fs/ext4/ialloc.c| 19 ++- > include/linux/ceph/osd_client.h | 2 +- > include/linux/ceph/osdmap.h | 6 +- > net/ceph/messenger.c| 6 +- > net/ceph/osd_client.c | 32 -- > net/ceph/osdmap.c | 18 ++- > 29 files changed, 310 insertions(+), 223 deletions(-) > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ 00/24] 3.6.6-stable review
On Mon, Nov 5, 2012 at 3:41 PM, Greg Kroah-Hartman wrote: > On Mon, Nov 05, 2012 at 03:37:31PM +0800, Zhi Yong Wu wrote: >> HI, greg >> >> Some of the patchset haven't passed the check of checkpatch.pl as below: >> >> WARNING: line over 80 characters >> #85: FILE: drivers/block/floppy.c:4319: >> + device_remove_file(&floppy_device[drive].dev, >> &dev_attr_cmos); >> >> WARNING: line over 80 characters >> #419: FILE: drivers/scsi/qla2xxx/qla_target.c:744: >> + ha->tgt.tgt_ops->update_sess(sess, fcport->d_id, >> fcport->loop_id, >> >> WARNING: line over 80 characters >> #420: FILE: drivers/scsi/qla2xxx/qla_target.c:745: >> + (fcport->flags & >> FCF_CONF_COMP_SUPPORTED)); >> >> WARNING: line over 80 characters >> #443: FILE: drivers/scsi/qla2xxx/qla_target.c:871: >> + ha->tgt.tgt_ops->update_sess(sess, fcport->d_id, >> fcport->loop_id, >> >> WARNING: line over 80 characters >> #444: FILE: drivers/scsi/qla2xxx/qla_target.c:872: >> + (fcport->flags & >> FCF_CONF_COMP_SUPPORTED)); >> >> WARNING: line over 80 characters >> #488: FILE: drivers/scsi/qla2xxx/tcm_qla2xxx.c:1491: >> + sess->s_id.b.domain, sess->s_id.b.area, >> sess->s_id.b.al_pa, >> >> WARNING: line over 80 characters >> #514: FILE: drivers/scsi/qla2xxx/tcm_qla2xxx.c:1517: >> + WARN(btree_remove32(&lport->lport_fcport_map, key) != >> se_nacl, >> >> WARNING: line over 80 characters >> #516: FILE: drivers/scsi/qla2xxx/tcm_qla2xxx.c:1519: >> + sess->s_id.b.domain, sess->s_id.b.area, >> sess->s_id.b.al_pa); >> >> WARNING: line over 80 characters >> #519: FILE: drivers/scsi/qla2xxx/tcm_qla2xxx.c:1522: >> + sess->s_id.b.domain, sess->s_id.b.area, >> sess->s_id.b.al_pa); >> >> WARNING: line over 80 characters >> #530: FILE: drivers/scsi/qla2xxx/tcm_qla2xxx.c:1533: >> + btree_insert32(&lport->lport_fcport_map, key, se_nacl, >> GFP_ATOMIC); >> >> total: 0 errors, 10 warnings, 1094 lines checked > > That's fine, take those up with the people who send them in for the > upstream kernel, I take them no-changes for the stable releases. I will do after all three stable -rc kernels test are completed. > > And really, the 80 characters is just a "hint", it's not a hard and fast > rule for anything here. Especially for the old, and horrible, floppy > and scsi drivers :) ok, thanks. > > thanks, > > greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VFS hot tracking: How to calculate data temperature?
On Mon, Nov 5, 2012 at 4:28 PM, Dave Chinner wrote: > On Mon, Nov 05, 2012 at 10:35:50AM +0800, Zhi Yong Wu wrote: >> On Sat, Nov 3, 2012 at 5:27 AM, Mingming.cao wrote: >> > On Fri, 2012-11-02 at 14:38 +0800, Zhi Yong Wu wrote: >> >> Here also has another question. >> >> >> >> How to save the file temperature among the umount to be able to >> >> preserve the file tempreture after reboot? >> >> >> >> This above is the requirement from DB product. >> >> I thought that we can save file temperature in its inode struct, that >> >> is, add one new field in struct inode, then this info will be written >> >> to disk with inode. >> >> >> >> Any comments or ideas are appreciated, thanks. >> >> >> >> >> > >> > Maybe could save the last file temperature with extended attributes. >> It seems that only ext4 has the concept of extended attributes. > > All major filesystems have xattr support. They are used extensively > by the security and integrity subsystems, for example. got it, thanks. > > Saving the information might be something that is useful to certian > applications, but lets have the people that need that functionality > spell out their requirements before discussing how or what to > implement. Indeed, discussion shoul dreally focus on getting the > core, in-memory infrastructure sorted out first before trying to > expand the functionality further... ah, but the latest patchset need some love from experienced FS guys:)... > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VFS hot tracking: How to calculate data temperature?
On Mon, Nov 5, 2012 at 6:33 PM, Steven Whitehouse wrote: > Hi, > > On Mon, 2012-11-05 at 16:44 +0800, Zhi Yong Wu wrote: >> On Mon, Nov 5, 2012 at 4:28 PM, Dave Chinner wrote: >> > On Mon, Nov 05, 2012 at 10:35:50AM +0800, Zhi Yong Wu wrote: >> >> On Sat, Nov 3, 2012 at 5:27 AM, Mingming.cao wrote: >> >> > On Fri, 2012-11-02 at 14:38 +0800, Zhi Yong Wu wrote: >> >> >> Here also has another question. >> >> >> >> >> >> How to save the file temperature among the umount to be able to >> >> >> preserve the file tempreture after reboot? >> >> >> >> >> >> This above is the requirement from DB product. >> >> >> I thought that we can save file temperature in its inode struct, that >> >> >> is, add one new field in struct inode, then this info will be written >> >> >> to disk with inode. >> >> >> >> >> >> Any comments or ideas are appreciated, thanks. >> >> >> >> >> >> >> >> > >> >> > Maybe could save the last file temperature with extended attributes. >> >> It seems that only ext4 has the concept of extended attributes. >> > >> > All major filesystems have xattr support. They are used extensively >> > by the security and integrity subsystems, for example. >> got it, thanks. >> > >> > Saving the information might be something that is useful to certian >> > applications, but lets have the people that need that functionality >> > spell out their requirements before discussing how or what to >> > implement. Indeed, discussion shoul dreally focus on getting the >> > core, in-memory infrastructure sorted out first before trying to >> > expand the functionality further... >> ah, but the latest patchset need some love from experienced FS guys:)... > > There is one other possible issue with saving the data into the > filesystem, which is that it may disturb what you are trying to measure. > Some filesystems (GFS2 is one) store data for small inodes in the same > block as the inode itself. So that means the accesses to the saved hot > tracking info may potentially affect the data access times too. Also > there is a very limited amount of space to expand the number of fields > in the inode, so xattr may be the only solution, depending on how much > data needs to be stored in each case. Very good analysis, two possible issues are very meanful, thanks. > > In the GFS2 case (I don't think it is unique in this) xattrs are stored > out of line and having to access them in every open means an extra block > read per inode, which again has performance implications. > > So that is not an insurmountable problem, but something to take into > account in selecting a solution, In summary, you look like preferring to xattr as its solution. > > Steve. > > > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Mon, Nov 5, 2012 at 7:07 PM, Steven Whitehouse wrote: > Hi, > > On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add some util helpers to update access frequencies >> for one file or its range. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c| 179 >> ++ >> fs/hot_tracking.h|7 ++ >> include/linux/hot_tracking.h |2 + >> 3 files changed, 188 insertions(+), 0 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index 68591f0..0a7d9a3 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) >> } >> } >> >> +struct hot_inode_item >> +*hot_inode_item_find(struct hot_info *root, u64 ino) >> +{ >> + struct hot_inode_item *he; >> + int ret; >> + >> +again: >> + spin_lock(&root->lock); >> + he = radix_tree_lookup(&root->hot_inode_tree, ino); >> + if (he) { >> + kref_get(&he->hot_inode.refs); >> + spin_unlock(&root->lock); >> + return he; >> + } >> + spin_unlock(&root->lock); >> + >> + he = kmem_cache_zalloc(hot_inode_item_cachep, >> + GFP_KERNEL | GFP_NOFS); > This doesn't look quite right... which of these two did you mean? I > assume probably just GFP_NOFS Yes, good catch, thanks. > >> + if (!he) >> + return ERR_PTR(-ENOMEM); >> + >> + hot_inode_item_init(he, ino, &root->hot_inode_tree); >> + >> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); >> + if (ret) { >> + kmem_cache_free(hot_inode_item_cachep, he); >> + return ERR_PTR(ret); >> + } >> + >> + spin_lock(&root->lock); >> + ret = radix_tree_insert(&root->hot_inode_tree, ino, he); >> + if (ret == -EEXIST) { >> + kmem_cache_free(hot_inode_item_cachep, he); >> + spin_unlock(&root->lock); >> + radix_tree_preload_end(); >> + goto again; >> + } >> + spin_unlock(&root->lock); >> + radix_tree_preload_end(); >> + >> + kref_get(&he->hot_inode.refs); >> + return he; >> +} >> +EXPORT_SYMBOL_GPL(hot_inode_item_find); >> + >> +static struct hot_range_item >> +*hot_range_item_find(struct hot_inode_item *he, >> + u32 start) >> +{ >> + struct hot_range_item *hr; >> + int ret; >> + >> +again: >> + spin_lock(&he->lock); >> + hr = radix_tree_lookup(&he->hot_range_tree, start); >> + if (hr) { >> + kref_get(&hr->hot_range.refs); >> + spin_unlock(&he->lock); >> + return hr; >> + } >> + spin_unlock(&he->lock); >> + >> + hr = kmem_cache_zalloc(hot_range_item_cachep, >> + GFP_KERNEL | GFP_NOFS); > Likewise, here too. ditto > > Steve. > > > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse wrote: > Hi, > > On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add a per-superblock workqueue and a delayed_work >> to run periodic work to update map info on each superblock. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c| 85 >> ++ >> fs/hot_tracking.h|3 + >> include/linux/hot_tracking.h |3 + >> 3 files changed, 91 insertions(+), 0 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index fff0038..0ef9cad 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -15,9 +15,12 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> #include >> #include >> +#include >> #include >> #include "hot_tracking.h" >> >> @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) >> } >> } >> >> +/* Temperature compare function*/ >> +static int hot_temp_cmp(void *priv, struct list_head *a, >> + struct list_head *b) >> +{ >> + struct hot_comm_item *ap = >> + container_of(a, struct hot_comm_item, n_list); >> + struct hot_comm_item *bp = >> + container_of(b, struct hot_comm_item, n_list); >> + >> + int diff = ap->hot_freq_data.last_temp >> + - bp->hot_freq_data.last_temp; >> + if (diff > 0) >> + return -1; >> + if (diff < 0) >> + return 1; >> + return 0; >> +} >> + >> +/* >> + * Every sync period we update temperatures for >> + * each hot inode item and hot range item for aging >> + * purposes. >> + */ >> +static void hot_update_worker(struct work_struct *work) >> +{ >> + struct hot_info *root = container_of(to_delayed_work(work), >> + struct hot_info, update_work); >> + struct hot_inode_item *hi_nodes[8]; >> + u64 ino = 0; >> + int i, n; >> + >> + while (1) { >> + n = radix_tree_gang_lookup(&root->hot_inode_tree, >> +(void **)hi_nodes, ino, >> +ARRAY_SIZE(hi_nodes)); >> + if (!n) >> + break; >> + >> + ino = hi_nodes[n - 1]->i_ino + 1; >> + for (i = 0; i < n; i++) { >> + kref_get(&hi_nodes[i]->hot_inode.refs); >> + hot_map_array_update( >> + &hi_nodes[i]->hot_inode.hot_freq_data, root); >> + hot_range_update(hi_nodes[i], root); >> + hot_inode_item_put(hi_nodes[i]); >> + } >> + } >> + >> + /* Sort temperature map info */ >> + for (i = 0; i < HEAT_MAP_SIZE; i++) { >> + list_sort(NULL, &root->heat_inode_map[i].node_list, >> + hot_temp_cmp); >> + list_sort(NULL, &root->heat_range_map[i].node_list, >> + hot_temp_cmp); >> + } >> + > > If this list can potentially have one (or more) entries per inode, then Only one hot_inode_item per inode, while maybe multiple hot_range_items per inode. > filesystems with a lot of inodes (millions) may potentially exceed the > max size of list which list_sort() can handle. If that happens it still > works, but you'll get a warning message and it won't be as efficient. I haven't do so large scale test. If we want to find that issue, we need to do large scale performance test, before that, i want to make sure the code change is correct at first. To be honest, for that issue you pointed to, i also have such concern.But list_sort() performance looks good from the test result of the following URL: https://lkml.org/lkml/2010/1/20/485 > > It is something that we've run into with list_sort() and GFS2, but it > only happens very rarely, Beside list_sort(), do you have any other way to share? For this concern, how does GFS2 resolve it? > > Steve. > > > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: VFS hot tracking: How to calculate data temperature?
On Mon, Nov 5, 2012 at 7:57 PM, Steven Whitehouse wrote: > Hi, > > On Mon, 2012-11-05 at 19:46 +0800, Zhi Yong Wu wrote: >> On Mon, Nov 5, 2012 at 6:33 PM, Steven Whitehouse >> wrote: >> > Hi, >> > >> > On Mon, 2012-11-05 at 16:44 +0800, Zhi Yong Wu wrote: >> >> On Mon, Nov 5, 2012 at 4:28 PM, Dave Chinner wrote: >> >> > On Mon, Nov 05, 2012 at 10:35:50AM +0800, Zhi Yong Wu wrote: >> >> >> On Sat, Nov 3, 2012 at 5:27 AM, Mingming.cao wrote: >> >> >> > On Fri, 2012-11-02 at 14:38 +0800, Zhi Yong Wu wrote: >> >> >> >> Here also has another question. >> >> >> >> >> >> >> >> How to save the file temperature among the umount to be able to >> >> >> >> preserve the file tempreture after reboot? >> >> >> >> >> >> >> >> This above is the requirement from DB product. >> >> >> >> I thought that we can save file temperature in its inode struct, >> >> >> >> that >> >> >> >> is, add one new field in struct inode, then this info will be >> >> >> >> written >> >> >> >> to disk with inode. >> >> >> >> >> >> >> >> Any comments or ideas are appreciated, thanks. >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > Maybe could save the last file temperature with extended attributes. >> >> >> It seems that only ext4 has the concept of extended attributes. >> >> > >> >> > All major filesystems have xattr support. They are used extensively >> >> > by the security and integrity subsystems, for example. >> >> got it, thanks. >> >> > >> >> > Saving the information might be something that is useful to certian >> >> > applications, but lets have the people that need that functionality >> >> > spell out their requirements before discussing how or what to >> >> > implement. Indeed, discussion shoul dreally focus on getting the >> >> > core, in-memory infrastructure sorted out first before trying to >> >> > expand the functionality further... >> >> ah, but the latest patchset need some love from experienced FS >> >> guys:)... >> > >> > There is one other possible issue with saving the data into the >> > filesystem, which is that it may disturb what you are trying to measure. >> > Some filesystems (GFS2 is one) store data for small inodes in the same >> > block as the inode itself. So that means the accesses to the saved hot >> > tracking info may potentially affect the data access times too. Also >> > there is a very limited amount of space to expand the number of fields >> > in the inode, so xattr may be the only solution, depending on how much >> > data needs to be stored in each case. >> Very good analysis, two possible issues are very meanful, thanks. >> > >> > In the GFS2 case (I don't think it is unique in this) xattrs are stored >> > out of line and having to access them in every open means an extra block >> > read per inode, which again has performance implications. >> > >> > So that is not an insurmountable problem, but something to take into >> > account in selecting a solution, >> In summary, you look like preferring to xattr as its solution. >> > > Well, that depends on exactly how large the data to be stored is, and > other factors. It will add overhead to the storage/retrieval but at > least it is fairly generic (wrt on-disk format) so likely to be easier > to retrofit to existing filesystems. Do you have some idea with more details about how to retrofit to existing FS?:) > > I suspect this may be one of those cases where there is no obvious right > answer and it is a case of selecting the least worst option, if that > makes sense? Then we can only check which solution is better via large scale performance test. > > Steve. > > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 09/19] vfs: add one work queue
On Mon, Nov 5, 2012 at 8:07 PM, Steven Whitehouse wrote: > Hi, > > On Mon, 2012-11-05 at 19:55 +0800, Zhi Yong Wu wrote: >> On Mon, Nov 5, 2012 at 7:21 PM, Steven Whitehouse >> wrote: >> > Hi, >> > >> > On Mon, 2012-10-29 at 12:30 +0800, zwu.ker...@gmail.com wrote: >> >> From: Zhi Yong Wu >> >> >> >> Add a per-superblock workqueue and a delayed_work >> >> to run periodic work to update map info on each superblock. >> >> >> >> Signed-off-by: Zhi Yong Wu >> >> --- >> >> fs/hot_tracking.c| 85 >> >> ++ >> >> fs/hot_tracking.h|3 + >> >> include/linux/hot_tracking.h |3 + >> >> 3 files changed, 91 insertions(+), 0 deletions(-) >> >> >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> >> index fff0038..0ef9cad 100644 >> >> --- a/fs/hot_tracking.c >> >> +++ b/fs/hot_tracking.c >> >> @@ -15,9 +15,12 @@ >> >> #include >> >> #include >> >> #include >> >> +#include >> >> +#include >> >> #include >> >> #include >> >> #include >> >> +#include >> >> #include >> >> #include "hot_tracking.h" >> >> >> >> @@ -557,6 +560,67 @@ static void hot_map_array_exit(struct hot_info *root) >> >> } >> >> } >> >> >> >> +/* Temperature compare function*/ >> >> +static int hot_temp_cmp(void *priv, struct list_head *a, >> >> + struct list_head *b) >> >> +{ >> >> + struct hot_comm_item *ap = >> >> + container_of(a, struct hot_comm_item, n_list); >> >> + struct hot_comm_item *bp = >> >> + container_of(b, struct hot_comm_item, n_list); >> >> + >> >> + int diff = ap->hot_freq_data.last_temp >> >> + - bp->hot_freq_data.last_temp; >> >> + if (diff > 0) >> >> + return -1; >> >> + if (diff < 0) >> >> + return 1; >> >> + return 0; >> >> +} >> >> + >> >> +/* >> >> + * Every sync period we update temperatures for >> >> + * each hot inode item and hot range item for aging >> >> + * purposes. >> >> + */ >> >> +static void hot_update_worker(struct work_struct *work) >> >> +{ >> >> + struct hot_info *root = container_of(to_delayed_work(work), >> >> + struct hot_info, update_work); >> >> + struct hot_inode_item *hi_nodes[8]; >> >> + u64 ino = 0; >> >> + int i, n; >> >> + >> >> + while (1) { >> >> + n = radix_tree_gang_lookup(&root->hot_inode_tree, >> >> +(void **)hi_nodes, ino, >> >> +ARRAY_SIZE(hi_nodes)); >> >> + if (!n) >> >> + break; >> >> + >> >> + ino = hi_nodes[n - 1]->i_ino + 1; >> >> + for (i = 0; i < n; i++) { >> >> + kref_get(&hi_nodes[i]->hot_inode.refs); >> >> + hot_map_array_update( >> >> + &hi_nodes[i]->hot_inode.hot_freq_data, >> >> root); >> >> + hot_range_update(hi_nodes[i], root); >> >> + hot_inode_item_put(hi_nodes[i]); >> >> + } >> >> + } >> >> + >> >> + /* Sort temperature map info */ >> >> + for (i = 0; i < HEAT_MAP_SIZE; i++) { >> >> + list_sort(NULL, &root->heat_inode_map[i].node_list, >> >> + hot_temp_cmp); >> >> + list_sort(NULL, &root->heat_range_map[i].node_list, >> >> + hot_temp_cmp); >> >> + } >> >> + >> > >> > If this list can potentially have one (or more) entries per inode, then >> Only one hot_inode_item per inode, while maybe multiple >> hot_range_items per inode. >> > filesystems with a lot of inodes (millions) may potentially exceed the >> &g
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner wrote: > On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> One root structure hot_info is defined, is hooked >> up in super_block, and will be used to hold rb trees >> root, hash list root and some other information, etc. >> Adds hot_inode_tree struct to keep track of >> frequently accessed files, and be keyed by {inode, offset}. >> Trees contain hot_inode_items representing those files >> and ranges. >> Having these trees means that vfs can quickly determine the >> temperature of some data by doing some calculations on the >> hot_freq_data struct that hangs off of the tree item. >> Define two items hot_inode_item and hot_range_item, >> one of them represents one tracked file >> to keep track of its access frequency and the tree of >> ranges in this file, while the latter represents >> a file range of one inode. >> Each of the two structures contains a hot_freq_data >> struct with its frequency of access metrics (number of >> {reads, writes}, last {read,write} time, frequency of >> {reads,writes}). >> Also, each hot_inode_item contains one hot_range_tree >> struct which is keyed by {inode, offset, length} >> and used to keep track of all the ranges in this file. >> >> Signed-off-by: Zhi Yong Wu > > Just a coupl eof minor formatting things first up - I'll have more > comments as I get deeper into the series. All comments are very reasonable, and will be applied. thanks for your review. > > >> +/* >> + * Initialize the inode tree. Should be called for each new inode >> + * access or other user of the hot_inode interface. >> + */ >> +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) > > The names of these are a bit clunky. You probably don't need the > "_rb_" in the function name. i.e. hot_inode_tree_init() is > sufficient, and if we every want to change in the tree type we don't > have to rename every single function... > > . >> +/* >> + * Initialize a new hot_inode_item structure. The new structure is >> + * returned with a reference count of one and needs to be >> + * freed using free_inode_item() >> + */ >> +void hot_rb_inode_item_init(void *_item) >> +{ > > The usual naming convention for slab initialiser functions is to use > a suffix of "_once" to indicate it is only ever called once per > slab object instantiation, not every time the object is allocated > fom the slab. See, for example, inode_init_once() and > inode_init_always(). > > so, that would make this function hot_inode_item_init_once(). > > >> +/* init hot_inode_item and hot_range_item kmem cache */ >> +static int __init hot_rb_item_cache_init(void) >> +{ >> + hot_inode_item_cache = kmem_cache_create("hot_inode_item", >> + sizeof(struct hot_inode_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_inode_item_init); >> + if (!hot_inode_item_cache) >> + goto inode_err; >> + >> + hot_range_item_cache = kmem_cache_create("hot_range_item", >> + sizeof(struct hot_range_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_range_item_init); >> + if (!hot_range_item_cache) >> + goto range_err; >> + >> + return 0; >> + >> +range_err: >> + kmem_cache_destroy(hot_inode_item_cache); >> +inode_err: >> + return -ENOMEM; >> +} >> + >> +/* >> + * Initialize kmem cache for hot_inode_item >> + * and hot_range_item >> + */ >> +void __init hot_track_cache_init(void) >> +{ >> + if (hot_rb_item_cache_init()) >> + return; > > No real need to have a hot_rb_item_cache_init() function here - just > open code it all in the hot_track_cache_init() function. > >> +} >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h >> new file mode 100644 >> index 000..269b67a >> --- /dev/null >> +++ b/fs/hot_tracking.h >> @@ -0,0 +1,27 @@ >> +/* >> + * fs/hot_tracking.h >> + * >> + * Copyright (C) 2012 IBM Corp. All rights reserved. >> + * Written by Zhi Yong Wu >> + *Ben Chociej >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public >>
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 3:37 PM, Dave Chinner wrote: > On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> One root structure hot_info is defined, is hooked >> up in super_block, and will be used to hold rb trees >> root, hash list root and some other information, etc. >> Adds hot_inode_tree struct to keep track of >> frequently accessed files, and be keyed by {inode, offset}. >> Trees contain hot_inode_items representing those files >> and ranges. >> Having these trees means that vfs can quickly determine the >> temperature of some data by doing some calculations on the >> hot_freq_data struct that hangs off of the tree item. >> Define two items hot_inode_item and hot_range_item, >> one of them represents one tracked file >> to keep track of its access frequency and the tree of >> ranges in this file, while the latter represents >> a file range of one inode. >> Each of the two structures contains a hot_freq_data >> struct with its frequency of access metrics (number of >> {reads, writes}, last {read,write} time, frequency of >> {reads,writes}). >> Also, each hot_inode_item contains one hot_range_tree >> struct which is keyed by {inode, offset, length} >> and used to keep track of all the ranges in this file. >> >> Signed-off-by: Zhi Yong Wu > > Just a coupl eof minor formatting things first up - I'll have more > comments as I get deeper into the series. OK, very look forward to seeing more on other patches, indeed thanks again. > > >> +/* >> + * Initialize the inode tree. Should be called for each new inode >> + * access or other user of the hot_inode interface. >> + */ >> +static void hot_rb_inode_tree_init(struct hot_inode_tree *tree) > > The names of these are a bit clunky. You probably don't need the > "_rb_" in the function name. i.e. hot_inode_tree_init() is > sufficient, and if we every want to change in the tree type we don't > have to rename every single function... > > . >> +/* >> + * Initialize a new hot_inode_item structure. The new structure is >> + * returned with a reference count of one and needs to be >> + * freed using free_inode_item() >> + */ >> +void hot_rb_inode_item_init(void *_item) >> +{ > > The usual naming convention for slab initialiser functions is to use > a suffix of "_once" to indicate it is only ever called once per > slab object instantiation, not every time the object is allocated > fom the slab. See, for example, inode_init_once() and > inode_init_always(). > > so, that would make this function hot_inode_item_init_once(). > > >> +/* init hot_inode_item and hot_range_item kmem cache */ >> +static int __init hot_rb_item_cache_init(void) >> +{ >> + hot_inode_item_cache = kmem_cache_create("hot_inode_item", >> + sizeof(struct hot_inode_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_inode_item_init); >> + if (!hot_inode_item_cache) >> + goto inode_err; >> + >> + hot_range_item_cache = kmem_cache_create("hot_range_item", >> + sizeof(struct hot_range_item), 0, >> + SLAB_RECLAIM_ACCOUNT | SLAB_MEM_SPREAD, >> + hot_rb_range_item_init); >> + if (!hot_range_item_cache) >> + goto range_err; >> + >> + return 0; >> + >> +range_err: >> + kmem_cache_destroy(hot_inode_item_cache); >> +inode_err: >> + return -ENOMEM; >> +} >> + >> +/* >> + * Initialize kmem cache for hot_inode_item >> + * and hot_range_item >> + */ >> +void __init hot_track_cache_init(void) >> +{ >> + if (hot_rb_item_cache_init()) >> + return; > > No real need to have a hot_rb_item_cache_init() function here - just > open code it all in the hot_track_cache_init() function. > >> +} >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h >> new file mode 100644 >> index 000..269b67a >> --- /dev/null >> +++ b/fs/hot_tracking.h >> @@ -0,0 +1,27 @@ >> +/* >> + * fs/hot_tracking.h >> + * >> + * Copyright (C) 2012 IBM Corp. All rights reserved. >> + * Written by Zhi Yong Wu >> + *Ben Chociej >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public >> + * Li
Re: [PATCH v2 1/2] btrfs-progs: Close file descriptor on exit
On Tue, Sep 25, 2012 at 6:12 PM, David Sterba wrote: > On Tue, Sep 25, 2012 at 10:02:15AM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Need to close fd on exit. > > Strictly you don't need to, kernel will do that at exit() time. I know, but it is not so nice. > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On Tue, Sep 25, 2012 at 6:14 PM, David Sterba wrote: > On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Some code pathes forget to free memory on exit. > > Same as with the fd's, kernel will free all memory for us at exit(). hi, can you let me know the pointer to exit() said by you? > > If there's lots of memory allocated, it may be even faster to leave the > unallocation process to kernel as it will do it in one go, while the > application would unnecessarily free it chunk by chunk. got it, thanks. > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 02/10] vfs: add support for updating access frequency
thanks a lot for your review in my heart, Dave. It is very helpful to me. On Tue, Sep 25, 2012 at 5:17 PM, Dave Chinner wrote: > On Sun, Sep 23, 2012 at 08:56:27PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add some utils helpers to update access frequencies >> for one file or its range. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c | 359 >> + >> fs/hot_tracking.h | 15 +++ >> 2 files changed, 374 insertions(+), 0 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index 173054b..52ed926 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -106,6 +106,365 @@ inode_err: >> } >> >> /* >> + * Drops the reference out on hot_inode_item by one and free the structure >> + * if the reference count hits zero >> + */ >> +void hot_rb_free_hot_inode_item(struct hot_inode_item *he) > > hot_inode_item_put() > >> +{ >> + if (!he) >> + return; > > It's a bug to call a put function on a kref counted item with a null > pointer. Let the kernel crash so it is noticed and fixed. OK, will remove it. > >> + >> + if (atomic_dec_and_test(&he->refs.refcount)) { >> + WARN_ON(he->in_tree); >> + kmem_cache_free(hot_inode_item_cache, he); >> + } > > Isn't this abusing krefs? i.e. this should be: Sorry, thanks for your explaination as below: > > hot_inode_item_free() > { > WARN_ON(he->in_tree); > kmem_cache_free(hot_inode_item_cache, he); > } > > hot_inode_item_put() > { > kref_put(&he->refs, hot_inode_item_free) > } > >> +/* >> + * Drops the reference out on hot_range_item by one and free the structure >> + * if the reference count hits zero >> + */ >> +static void hot_rb_free_hot_range_item(struct hot_range_item *hr) > > same comments as above. OK, thanks. > >> +static struct rb_node *hot_rb_insert_hot_inode_item(struct rb_root *root, >> + unsigned long inode_num, >> + struct rb_node *node) > > static struct rb_node * > hot_inode_item_find(. ) OK, thanks. > >> +{ >> + struct rb_node **p = &root->rb_node; >> + struct rb_node *parent = NULL; >> + struct hot_inode_item *entry; >> + >> + /* walk tree to find insertion point */ >> + while (*p) { >> + parent = *p; >> + entry = rb_entry(parent, struct hot_inode_item, rb_node); >> + >> + if (inode_num < entry->i_ino) >> + p = &(*p)->rb_left; >> + else if (inode_num > entry->i_ino) >> + p = &(*p)->rb_right; >> + else >> + return parent; >> + } >> + >> + entry = rb_entry(node, struct hot_inode_item, rb_node); >> + entry->in_tree = 1; >> + rb_link_node(node, parent, p); >> + rb_insert_color(node, root); > > So the hot inode item search key is the inode number? Why use an Yes > rb-tree then? Wouldn't a btree be a more memory efficient way to > hold a sparse index that has fixed key and pointer sizes? Yes, i know, but if we don't use btree, what will be better? Radix tree? > > Also, the API seems quite strange. you pass in the the rb_node and > the inode number which instead of passing in the hot inode item that > already holds this information. You then convert the rb_node back to > a hot inode item to set the in_tree variable. So why not just pass > in the struct hot_inode_item in the first place? Good catch, thanks for your remider. > >> +static u64 hot_rb_range_end(struct hot_range_item *hr) > > hot_range_item_end() OK > >> +{ >> + if (hr->start + hr->len < hr->start) >> + return (u64)-1; >> + >> + return hr->start + hr->len - 1; >> +} >> + >> +static struct rb_node *hot_rb_insert_hot_range_item(struct rb_root *root, > > hot_range_item_find() OK > >> + u64 start, >> + struct rb_node *node) >> +{ >> + struct rb_node **p = &root->rb_node; >> + struct rb_node *parent = NULL; >> + struct hot_range_item *entry; >> + >> + /* ensure start is on a range boundary */ >> + start = start & RANGE_SIZE_MASK; >
Re: [RFC v2 03/10] vfs: add one new mount option '-o hottrack'
On Tue, Sep 25, 2012 at 5:28 PM, Dave Chinner wrote: > On Sun, Sep 23, 2012 at 08:56:28PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Introduce one new mount option '-o hottrack', >> and add its parsing support. >> Its usage looks like: >>mount -o hottrack >>mount -o nouser,hottrack >>mount -o nouser,hottrack,loop >>mount -o hottrack,nouser > > I think that this option parsing should be done by the filesystem, > even though the tracking functionality is in the VFS. That way ony > the filesystems that can use the tracking information will turn it > on, rather than being able to turn it on for everything regardless > of whether it is useful or not. > > Along those lines, just using a normal superblock flag to indicate > it is active (e.g. MS_HOT_INODE_TRACKING in sb->s_flags) means you > don't need to allocate the sb->s_hot_info structure just to be able > to check whether we are tracking hot inodes or not. > > This then means the hot inode tracking for the superblock can be > initialised by the filesystem as part of it's fill_super method, > along with the filesystem specific code that will use the hot > tracking information the VFS gathers I can see what you mean, but don't know if other guys also agree with this. If so, all FS specific code which use hot tracking feature wll have to add the same chunk of code in it fill_super method. Is it good? > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] btrfs-progs: Fix up memory leakage
On Wed, Sep 26, 2012 at 1:14 AM, Goffredo Baroncelli wrote: > On 09/25/2012 12:14 PM, David Sterba wrote: >> >> On Tue, Sep 25, 2012 at 10:02:16AM +0800, zwu.ker...@gmail.com wrote: >>> >>> From: Zhi Yong Wu >>> >>>Some code pathes forget to free memory on exit. >> >> >> Same as with the fd's, kernel will free all memory for us at exit(). > > > I strongly disagree with this approach. The callee often don't know what > happen after and before the call. The same is true for the programmer, > because the code is quite often updated by several people. A clean exit() is > the right thing to do as general rule. I don't see any valid reason (in the > btrfs context) to do otherwise. > > Relying on the exit() for a proper clean-up increase the likelihood of bug > when the code evolves (see my patch [RESPOST][BTRFS-PROGS][PATCH] > btrfs_read_dev_super(): uninitialized variable for an example of what means > an incorrect deallocation of resource). > > >> If there's lots of memory allocated, it may be even faster to leave the >> unallocation process to kernel as it will do it in one go, while the >> application would unnecessarily free it chunk by chunk. > > > May be I am wrong, but I don't think that the increase of speed of the btrfs > "command" is even measurable relying on exit instead of free()-ing each > chunk of memory one at time The same should be true for the > open()/close() I fully agree with you. In one same function, i find that some code path free system sources, while other code path doesn't. This is one nice way. > > My 2¢ > > BR > G.Baroncelli > >> >> david >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> . >> > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 01/10] vfs: introduce private rb structures
On Tue, Sep 25, 2012 at 6:20 PM, Ram Pai wrote: > On Sun, Sep 23, 2012 at 08:56:26PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> One root structure hot_info is defined, is hooked >> up in super_block, and will be used to hold rb trees >> root, hash list root and some other information, etc. >> Adds hot_inode_tree struct to keep track of >> frequently accessed files, and be keyed by {inode, offset}. >> Trees contain hot_inode_items representing those files >> and ranges. >> Having these trees means that vfs can quickly determine the >> temperature of some data by doing some calculations on the >> hot_freq_data struct that hangs off of the tree item. >> Define two items hot_inode_item and hot_range_item, >> one of them represents one tracked file >> to keep track of its access frequency and the tree of >> ranges in this file, while the latter represents >> a file range of one inode. >> Each of the two structures contains a hot_freq_data >> struct with its frequency of access metrics (number of >> {reads, writes}, last {read,write} time, frequency of >> {reads,writes}). >> Also, each hot_inode_item contains one hot_range_tree >> struct which is keyed by {inode, offset, length} >> and used to keep track of all the ranges in this file. >> >> Signed-off-by: Zhi Yong Wu >> --- >> + > ..snip.. > >> +/* A tree that sits on the hot_info */ >> +struct hot_inode_tree { >> + struct rb_root map; >> + rwlock_t lock; >> +}; >> + >> +/* A tree of ranges for each inode in the hot_inode_tree */ >> +struct hot_range_tree { >> + struct rb_root map; >> + rwlock_t lock; >> +}; > > Can as well have a generic datastructure called hot_tree instead > of having two different datastructure which basically are the same. OK. > >> + >> +/* A frequency data struct holds values that are used to >> + * determine temperature of files and file ranges. These structs >> + * are members of hot_inode_item and hot_range_item >> + */ >> +struct hot_freq_data { >> + struct timespec last_read_time; >> + struct timespec last_write_time; >> + u32 nr_reads; >> + u32 nr_writes; >> + u64 avg_delta_reads; >> + u64 avg_delta_writes; >> + u8 flags; >> + u32 last_temperature; >> +}; >> + >> +/* An item representing an inode and its access frequency */ >> +struct hot_inode_item { >> + /* node for hot_inode_tree rb_tree */ >> + struct rb_node rb_node; >> + /* tree of ranges in this inode */ >> + struct hot_range_tree hot_range_tree; >> + /* frequency data for this inode */ >> + struct hot_freq_data hot_freq_data; >> + /* inode number, copied from inode */ >> + unsigned long i_ino; >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, i_no, in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; >> +}; >> + >> +/* >> + * An item representing a range inside of an inode whose frequency >> + * is being tracked >> + */ >> +struct hot_range_item { >> + /* node for hot_range_tree rb_tree */ >> + struct rb_node rb_node; >> + /* frequency data for this range */ >> + struct hot_freq_data hot_freq_data; >> + /* the hot_inode_item associated with this hot_range_item */ >> + struct hot_inode_item *hot_inode; >> + /* starting offset of this range */ >> + u64 start; >> + /* length of this range */ >> + u64 len; >> + /* used to check for errors in ref counting */ >> + u8 in_tree; >> + /* protects hot_freq_data, start, len, and in_tree */ >> + spinlock_t lock; >> + /* prevents kfree */ >> + struct kref refs; >> +}; > > might as well have just one generic datastructure called hot_item with > all the common fields and then have > > struct hot_inode_item { > struct hot_item hot_inode; > struct hot_tree hot_range_tree; > unsigned long i_ino; > } > > and > > struct hot_range_item { > struct hot_item hot_range; > u64 start; > u64 len;/* length of this range */ > } > > This should help you eliminate some duplicate code as well. OK, i will try to apply them. thanks. > > > RP > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v2 05/10] vfs: introduce one hash table
On Tue, Sep 25, 2012 at 5:54 PM, Ram Pai wrote: > On Sun, Sep 23, 2012 at 08:56:30PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Adds a hash table structure which contains >> a lot of hash list and is used to efficiently >> look up the data temperature of a file or its >> ranges. >> In each hash list of hash table, the hash node >> will keep track of temperature info. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c| 77 >> - >> include/linux/hot_tracking.h | 35 +++ >> 2 files changed, 110 insertions(+), 2 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index fa89f70..5f96442 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -16,6 +16,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -24,6 +25,9 @@ > > ...snip... > >> +/* Hash list heads for hot hash table */ >> +struct hot_hash_head { >> + struct hlist_head hashhead; >> + rwlock_t rwlock; >> + u32 temperature; >> +}; >> + >> +/* Nodes stored in each hash list of hash table */ >> +struct hot_hash_node { >> + struct hlist_node hashnode; >> + struct list_head node; >> + struct hot_freq_data *hot_freq_data; >> + struct hot_hash_head *hlist; >> + spinlock_t lock; /* protects hlist */ >> + >> + /* >> + * number of references to this node >> + * equals 1 (hashlist entry) >> + */ >> + struct kref refs; >> +}; > > Dont see why you need yet another datastructure to hash the inode_item > and the range_item into a hash list. You can just add another If there's such one structure, we can rapidly know which hash bucket one inode_item is currently linking to via its hlist field. This is useful if one inode_item corresponding to one file need to move from one bucket to another bucket when this file get cold or hotter. Does it make sense to you? > hlist_node in the inode_item and range_item. This field can be then used > to link into the corresponding hash list. > > You can use the container_of() get to the inode_item or the range_item > using the hlist_node field. > > You can thus eliminate a lot of code. As what i said above, i don't think it can eliminate a lo of code. > >> + >> /* An item representing an inode and its access frequency */ >> struct hot_inode_item { >> /* node for hot_inode_tree rb_tree */ >> @@ -68,6 +93,8 @@ struct hot_inode_item { >> spinlock_t lock; >> /* prevents kfree */ >> struct kref refs; >> + /* hashlist node for this inode */ >> + struct hot_hash_node *heat_node; > > this can be just > struct hlist_node head_node; /* lookup hot_inode hash list */ > > Use this field to link it into the corresponding hashlist. > >> }; >> > this can be just >> /* >> @@ -91,6 +118,8 @@ struct hot_range_item { >> spinlock_t lock; >> /* prevents kfree */ >> struct kref refs; >> + /* hashlist node for this range */ >> + struct hot_hash_node *heat_node; > > this can be just > struct hlist_node head_node; /* lookup hot_range hash list */ > > >> }; >> >> struct hot_info { >> @@ -98,6 +127,12 @@ struct hot_info { >> >> /* red-black tree that keeps track of fs-wide hot data */ >> struct hot_inode_tree hot_inode_tree; >> + >> + /* hash map of inode temperature */ >> + struct hot_hash_head heat_inode_hl[HEAT_HASH_SIZE]; >> + >> + /* hash map of range temperature */ >> + struct hot_hash_head heat_range_hl[HEAT_HASH_SIZE]; >> }; >> >> #endif /* _LINUX_HOTTRACK_H */ > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency
On Sun, Oct 28, 2012 at 3:55 PM, Zheng Liu wrote: > Hi Zhiyong, > > On Thu, Oct 25, 2012 at 11:08:55PM +0800, zwu.ker...@gmail.com wrote: > [snip] >> @@ -199,6 +342,54 @@ err: >> } >> >> /* >> + * Main function to update access frequency from read/writepage(s) hooks >> + */ >> +inline void hot_update_freqs(struct inode *inode, u64 start, >> + u64 len, int rw) > > This function seems too big. So we really need to inline this function? As Dave said in his comments, it will add a function call overhead even when tracking is not enabled. a static inline function will just result in no extra overhead other than the if statement > > Regards, > Zheng > >> +{ >> + struct hot_info *root = inode->i_sb->s_hot_root; >> + struct hot_inode_item *he; >> + struct hot_range_item *hr; >> + u32 cur, end; >> + >> + if (!root || (len == 0)) >> + return; >> + >> + he = hot_inode_item_find(root, inode->i_ino); >> + if (IS_ERR(he)) { >> + WARN_ON(1); >> + return; >> + } >> + >> + spin_lock(&he->hot_inode.lock); >> + hot_freq_data_update(&he->hot_inode.hot_freq_data, rw); >> + spin_unlock(&he->hot_inode.lock); >> + >> + /* >> + * Align ranges on RANGE_SIZE boundary >> + * to prevent proliferation of range structs >> + */ >> + end = (start + len + RANGE_SIZE - 1) >> RANGE_BITS; >> + for (cur = (start >> RANGE_BITS); cur < end; cur++) { >> + hr = hot_range_item_find(he, cur); >> + if (IS_ERR(hr)) { >> + WARN_ON(1); >> + hot_inode_item_put(he); >> + return; >> + } >> + >> + spin_lock(&hr->hot_range.lock); >> + hot_freq_data_update(&hr->hot_range.hot_freq_data, rw); >> + spin_unlock(&hr->hot_range.lock); >> + >> + hot_range_item_put(hr); >> + } >> + >> + hot_inode_item_put(he); >> +} >> +EXPORT_SYMBOL_GPL(hot_update_freqs); -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4 03/15] vfs,hot_track: add the function for collecting I/O frequency
On Mon, Oct 29, 2012 at 10:01 AM, Dave Chinner wrote: > On Sun, Oct 28, 2012 at 09:51:48PM +0800, Zhi Yong Wu wrote: >> On Sun, Oct 28, 2012 at 3:55 PM, Zheng Liu wrote: >> > Hi Zhiyong, >> > >> > On Thu, Oct 25, 2012 at 11:08:55PM +0800, zwu.ker...@gmail.com wrote: >> > [snip] >> >> @@ -199,6 +342,54 @@ err: >> >> } >> >> >> >> /* >> >> + * Main function to update access frequency from read/writepage(s) hooks >> >> + */ >> >> +inline void hot_update_freqs(struct inode *inode, u64 start, >> >> + u64 len, int rw) >> > >> > This function seems too big. So we really need to inline this function? >> As Dave said in his comments, it will add a function call >> overhead even when tracking is not enabled. a static inline function >> will just result in no extra overhead other than the if >> statement > > I don't think I said that with respect to this code. I think I said > it w.r.t. a define or a small wrapper that decides to call > hot_update_freqs(). A static inline fucntion should only be a > couple of lines of code at most. > > A static function, OTOH, can be inlined by the compiler if the > compiler thinks that is a win. But thanks for your explaination at first. > > . > >> >> +EXPORT_SYMBOL_GPL(hot_update_freqs); > > ... it's an exported function, so it can't be inline or static, so > using "inline" is wrong whatever way you look at it. ;) ah, but i' m surprised by why the compiler find this error. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 00/19] vfs: hot data tracking
On Mon, Oct 29, 2012 at 6:30 PM, Andi Kleen wrote: > zwu.ker...@gmail.com writes: >> >> TODO List: >> >> 1.) Need to do scalability or performance tests. > > You're changing some of the most performance critical code in the > kernel. This step is absolutely not optional. ah, i know, but now i need to make sure all the codes are correct at first, then do these tests. > > -Andi > > -- > a...@linux.intel.com -- Speaking for myself only -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 13/19] debugfs: introduce one function
On Tue, Oct 30, 2012 at 2:11 AM, Greg KH wrote: > On Mon, Oct 29, 2012 at 12:30:55PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> The debugfs function is used to get expected dentry. > > Huh? Why do you need this? Why haven't you added documentation for the It is used to determine if one sysfs directory has been created. OK, i will add some doc, thanks for your suggestion. > function saying what it does? > > confused, > > greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 15/19] sysfs: add two hot_track proc files
On Tue, Oct 30, 2012 at 2:10 AM, Greg KH wrote: > On Mon, Oct 29, 2012 at 12:30:57PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add two proc files hot-kick-time and hot-update-delay >> under the dir /proc/sys/fs/ in order to turn >> TIME_TO_KICK and HEAT_UPDATE_DELAY into be tunable. > > As you say, these are proc files, not sysfs files, so please fix the > Subject: up here. ah, OK, i will fix it, thanks for your pointing it out. > > thanks, > > greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 13/19] debugfs: introduce one function
On Tue, Oct 30, 2012 at 6:34 AM, Greg KH wrote: > On Tue, Oct 30, 2012 at 06:25:50AM +0800, Zhi Yong Wu wrote: >> On Tue, Oct 30, 2012 at 2:11 AM, Greg KH wrote: >> > On Mon, Oct 29, 2012 at 12:30:55PM +0800, zwu.ker...@gmail.com wrote: >> >> From: Zhi Yong Wu >> >> >> >> The debugfs function is used to get expected dentry. >> > >> > Huh? Why do you need this? Why haven't you added documentation for the >> It is used to determine if one sysfs directory has been created. OK, i >> will add some doc, thanks for your suggestion. > > You didn't answer the "why" part here. How come you think you need ah, Let me say its scenario at first. If we do two mount ops as below: 1.) mount -o loop,hot_track image1 /data1 2.) mount -o loop,hot_track image2 /data2 The mount -o hot_track operation will automatically create one sysfs directory /sys/kernel/debug/hot_track. To prevent this dir being created again when 2.) is done, we need to know if it has existed at first. In my patch, i at first get its dentry by this new function, then determine if its d_inode field is NULL, if no, it means that this sysfs dir has existed. This is the reason that i want to add one new function. > this? Can't you just save off the dentry you created somewhere so you > don't need to look it up again? Because i can't find one appropriate place to save it. > > greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 13/19] debugfs: introduce one function
On Tue, Oct 30, 2012 at 6:54 AM, Greg KH wrote: > On Tue, Oct 30, 2012 at 06:45:19AM +0800, Zhi Yong Wu wrote: >> On Tue, Oct 30, 2012 at 6:34 AM, Greg KH wrote: >> > On Tue, Oct 30, 2012 at 06:25:50AM +0800, Zhi Yong Wu wrote: >> >> On Tue, Oct 30, 2012 at 2:11 AM, Greg KH >> >> wrote: >> >> > On Mon, Oct 29, 2012 at 12:30:55PM +0800, zwu.ker...@gmail.com wrote: >> >> >> From: Zhi Yong Wu >> >> >> >> >> >> The debugfs function is used to get expected dentry. >> >> > >> >> > Huh? Why do you need this? Why haven't you added documentation for the >> >> It is used to determine if one sysfs directory has been created. OK, i >> >> will add some doc, thanks for your suggestion. >> > >> > You didn't answer the "why" part here. How come you think you need >> ah, Let me say its scenario at first. If we do two mount ops as below: >> 1.) mount -o loop,hot_track image1 /data1 >> 2.) mount -o loop,hot_track image2 /data2 >> >> The mount -o hot_track operation will automatically create one sysfs >> directory /sys/kernel/debug/hot_track. To prevent this dir being >> created again when 2.) is done, we need to know if it has existed at >> first. In my patch, i at first get its dentry by this new function, >> then determine if its d_inode field is NULL, if no, it means that this >> sysfs dir has existed. >> This is the reason that i want to add one new function. > > Why not do like the rest of the kernel does and just have a: > static dentry *hot_track_root; > and use that as your root debugfs directory dentry: ah, i'm one newbie, don't get familar with other kernel part, but this is one good point, i will apply it, thanks. > > if (!hot_track_root) { > /* Create root directory */ > hot_track_root = debugfs_create(...); > } > > No need to look anything up :) > > thanks, > > greg k-h -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] rbtree: Add some necessary condition checks
On Mon, Sep 2, 2013 at 4:57 PM, Michel Lespinasse wrote: > On Sun, Sep 1, 2013 at 11:30 PM, Zhi Yong Wu wrote: >> In Tue, Aug 27, 2013 at 6:01 AM, Michel Lespinasse wrote: >>> On Fri, Aug 23, 2013 at 7:45 AM, wrote: >>>> From: Zhi Yong Wu >>>> >>>> Signed-off-by: Zhi Yong Wu >>>> --- >>>> include/linux/rbtree_augmented.h | 3 ++- >>>> lib/rbtree.c | 5 +++-- >>>> 2 files changed, 5 insertions(+), 3 deletions(-) >>> >>> So, you are saying that the checks are necessary, but you are not saying >>> why. >>> >>> The way I see it, the checks are *not* necessary, because the rbtree >>> invariants guarantee them to be true. The only way for the checks to >>> fail would be if people directly manipulate the rbtrees without going >>> through the proper APIs, and if they do that then I think they're on >>> their own. So to me, I think it's the same situation as dereferencing >>> a pointer without checking if it's NULL, because you know it should >>> never be NULL - which in my eyes is perfectly acceptable. >> In my patchset, some rbtree APIs to be invoked, and I think that those >> rbtree APIs are used corrently, Below is the pointer of its code: >> https://github.com/wuzhy/kernel/compare/torvalds:master...hot_tracking >> But I hit some issues when using compilebench to do perf benchmark. >> compile dir kernel-7 691MB in 8.92 seconds (77.53 MB/s) > > Thanks for the link - I now better understand where you are coming > from with these fixes. > > Going back to the original message: > >> diff --git a/include/linux/rbtree_augmented.h >> b/include/linux/rbtree_augmented.h >> index fea49b5..7d19770 100644 >> --- a/include/linux/rbtree_augmented.h >> +++ b/include/linux/rbtree_augmented.h >> @@ -199,7 +199,8 @@ __rb_erase_augmented(struct rb_node *node, struct >> rb_root *root, >> } >> >> successor->rb_left = tmp = node->rb_left; >> - rb_set_parent(tmp, successor); >> + if (tmp) >> + rb_set_parent(tmp, successor); >> >> pc = node->__rb_parent_color; >> tmp = __rb_parent(pc); > > Note that node->rb_left was already fetched at the top of > __rb_erase_augmented(), and was checked to be non-NULL at the time - > otherwise we would have executed 'Case 1' in that function. So, you > are not expected to find tmp == NULL here. > >> diff --git a/lib/rbtree.c b/lib/rbtree.c >> index c0e31fe..2cb01ba 100644 >> --- a/lib/rbtree.c >> +++ b/lib/rbtree.c >> @@ -214,7 +214,7 @@ rb_erase_color(struct rb_node *parent, struct >> rb_root *root, >> */ >> sibling = parent->rb_right; >> if (node != sibling) { /* node == parent->rb_left */ >> - if (rb_is_red(sibling)) { >> + if (sibling && rb_is_red(sibling)) { >> /* >> * Case 1 - left rotate at parent >> * > > Note the loop invariants quoted just above: > > /* > * Loop invariants: > * - node is black (or NULL on first iteration) > * - node is not the root (parent is not NULL) > * - All leaf paths going through parent and node have a > * black node count that is 1 lower than other leaf paths. > */ > > Because of these, each path from sibling to a leaf must include at > least one black node, which implies that sibling can't be NULL - or to > put it another way, if sibling is null then the expected invariants > were violated before we even got there. > >> @@ -226,7 +226,8 @@ rb_erase_color(struct rb_node *parent, struct >> rb_root *root, >> */ >> parent->rb_right = tmp1 = sibling->rb_left; >> sibling->rb_left = parent; >> - rb_set_parent_color(tmp1, parent, RB_BLACK); >> + if (tmp1) >> + rb_set_parent_color(tmp1, parent, >> RB_BLACK); >> __rb_rotate_set_parents(parent, sibling, >> root, >> RB_RED); >> augm
Re: [PATCH] rbtree: Add some necessary condition checks
On Thu, Sep 5, 2013 at 7:59 AM, Davidlohr Bueso wrote: > On Thu, 2013-09-05 at 01:22 +0800, Zhi Yong Wu wrote: >> On Mon, Sep 2, 2013 at 4:57 PM, Michel Lespinasse wrote: >> > On Sun, Sep 1, 2013 at 11:30 PM, Zhi Yong Wu wrote: >> >> In Tue, Aug 27, 2013 at 6:01 AM, Michel Lespinasse >> >> wrote: >> >>> On Fri, Aug 23, 2013 at 7:45 AM, wrote: >> >>>> From: Zhi Yong Wu >> >>>> >> >>>> Signed-off-by: Zhi Yong Wu >> >>>> --- >> >>>> include/linux/rbtree_augmented.h | 3 ++- >> >>>> lib/rbtree.c | 5 +++-- >> >>>> 2 files changed, 5 insertions(+), 3 deletions(-) >> >>> >> >>> So, you are saying that the checks are necessary, but you are not saying >> >>> why. >> >>> >> >>> The way I see it, the checks are *not* necessary, because the rbtree >> >>> invariants guarantee them to be true. The only way for the checks to >> >>> fail would be if people directly manipulate the rbtrees without going >> >>> through the proper APIs, and if they do that then I think they're on >> >>> their own. So to me, I think it's the same situation as dereferencing >> >>> a pointer without checking if it's NULL, because you know it should >> >>> never be NULL - which in my eyes is perfectly acceptable. >> >> In my patchset, some rbtree APIs to be invoked, and I think that those >> >> rbtree APIs are used corrently, Below is the pointer of its code: >> >> https://github.com/wuzhy/kernel/compare/torvalds:master...hot_tracking >> >> But I hit some issues when using compilebench to do perf benchmark. >> >> compile dir kernel-7 691MB in 8.92 seconds (77.53 MB/s) >> > >> > Thanks for the link - I now better understand where you are coming >> > from with these fixes. >> > >> > Going back to the original message: >> > >> >> diff --git a/include/linux/rbtree_augmented.h >> >> b/include/linux/rbtree_augmented.h >> >> index fea49b5..7d19770 100644 >> >> --- a/include/linux/rbtree_augmented.h >> >> +++ b/include/linux/rbtree_augmented.h >> >> @@ -199,7 +199,8 @@ __rb_erase_augmented(struct rb_node *node, struct >> >> rb_root *root, >> >> } >> >> >> >> successor->rb_left = tmp = node->rb_left; >> >> - rb_set_parent(tmp, successor); >> >> + if (tmp) >> >> + rb_set_parent(tmp, successor); >> >> >> >> pc = node->__rb_parent_color; >> >> tmp = __rb_parent(pc); >> > >> > Note that node->rb_left was already fetched at the top of >> > __rb_erase_augmented(), and was checked to be non-NULL at the time - >> > otherwise we would have executed 'Case 1' in that function. So, you >> > are not expected to find tmp == NULL here. >> > >> >> diff --git a/lib/rbtree.c b/lib/rbtree.c >> >> index c0e31fe..2cb01ba 100644 >> >> --- a/lib/rbtree.c >> >> +++ b/lib/rbtree.c >> >> @@ -214,7 +214,7 @@ rb_erase_color(struct rb_node *parent, struct >> >> rb_root *root, >> >> */ >> >> sibling = parent->rb_right; >> >> if (node != sibling) { /* node == parent->rb_left */ >> >> - if (rb_is_red(sibling)) { >> >> + if (sibling && rb_is_red(sibling)) { >> >> /* >> >> * Case 1 - left rotate at parent >> >> * >> > >> > Note the loop invariants quoted just above: >> > >> > /* >> > * Loop invariants: >> > * - node is black (or NULL on first iteration) >> > * - node is not the root (parent is not NULL) >> > * - All leaf paths going through parent and node have a >> > * black node count that is 1 lower than other leaf >> > paths. >> > */ >> > >> > Because of these, each path from sibling to a leaf must include at >> > least one black node, which implies that sibling can't be NULL - or to >
Re: [PATCH] rbtree: Add some necessary condition checks
On Thu, Sep 5, 2013 at 9:12 AM, Davidlohr Bueso wrote: > On Thu, 2013-09-05 at 08:37 +0800, Zhi Yong Wu wrote: >> On Thu, Sep 5, 2013 at 7:59 AM, Davidlohr Bueso wrote: >> > On Thu, 2013-09-05 at 01:22 +0800, Zhi Yong Wu wrote: >> >> On Mon, Sep 2, 2013 at 4:57 PM, Michel Lespinasse >> >> wrote: >> >> > On Sun, Sep 1, 2013 at 11:30 PM, Zhi Yong Wu >> >> > wrote: >> >> >> In Tue, Aug 27, 2013 at 6:01 AM, Michel Lespinasse >> >> >> wrote: >> >> >>> On Fri, Aug 23, 2013 at 7:45 AM, wrote: >> >> >>>> From: Zhi Yong Wu >> >> >>>> >> >> >>>> Signed-off-by: Zhi Yong Wu >> >> >>>> --- >> >> >>>> include/linux/rbtree_augmented.h | 3 ++- >> >> >>>> lib/rbtree.c | 5 +++-- >> >> >>>> 2 files changed, 5 insertions(+), 3 deletions(-) >> >> >>> >> >> >>> So, you are saying that the checks are necessary, but you are not >> >> >>> saying why. >> >> >>> >> >> >>> The way I see it, the checks are *not* necessary, because the rbtree >> >> >>> invariants guarantee them to be true. The only way for the checks to >> >> >>> fail would be if people directly manipulate the rbtrees without going >> >> >>> through the proper APIs, and if they do that then I think they're on >> >> >>> their own. So to me, I think it's the same situation as dereferencing >> >> >>> a pointer without checking if it's NULL, because you know it should >> >> >>> never be NULL - which in my eyes is perfectly acceptable. >> >> >> In my patchset, some rbtree APIs to be invoked, and I think that those >> >> >> rbtree APIs are used corrently, Below is the pointer of its code: >> >> >> https://github.com/wuzhy/kernel/compare/torvalds:master...hot_tracking >> >> >> But I hit some issues when using compilebench to do perf benchmark. >> >> >> compile dir kernel-7 691MB in 8.92 seconds (77.53 MB/s) >> >> > >> >> > Thanks for the link - I now better understand where you are coming >> >> > from with these fixes. >> >> > >> >> > Going back to the original message: >> >> > >> >> >> diff --git a/include/linux/rbtree_augmented.h >> >> >> b/include/linux/rbtree_augmented.h >> >> >> index fea49b5..7d19770 100644 >> >> >> --- a/include/linux/rbtree_augmented.h >> >> >> +++ b/include/linux/rbtree_augmented.h >> >> >> @@ -199,7 +199,8 @@ __rb_erase_augmented(struct rb_node *node, struct >> >> >> rb_root *root, >> >> >> } >> >> >> >> >> >> successor->rb_left = tmp = node->rb_left; >> >> >> - rb_set_parent(tmp, successor); >> >> >> + if (tmp) >> >> >> + rb_set_parent(tmp, successor); >> >> >> >> >> >> pc = node->__rb_parent_color; >> >> >> tmp = __rb_parent(pc); >> >> > >> >> > Note that node->rb_left was already fetched at the top of >> >> > __rb_erase_augmented(), and was checked to be non-NULL at the time - >> >> > otherwise we would have executed 'Case 1' in that function. So, you >> >> > are not expected to find tmp == NULL here. >> >> > >> >> >> diff --git a/lib/rbtree.c b/lib/rbtree.c >> >> >> index c0e31fe..2cb01ba 100644 >> >> >> --- a/lib/rbtree.c >> >> >> +++ b/lib/rbtree.c >> >> >> @@ -214,7 +214,7 @@ rb_erase_color(struct rb_node *parent, struct >> >> >> rb_root *root, >> >> >> */ >> >> >> sibling = parent->rb_right; >> >> >> if (node != sibling) { /* node == parent->rb_left */ >> >> >> - if (rb_is_red(sibling)) { >> >> >> + if (sibling && rb_is_red(sibling)) { >> >> >> /* >> >> >> * Case 1 - left r
Re: [PATCH] rbtree: Add some necessary condition checks
In Tue, Aug 27, 2013 at 6:01 AM, Michel Lespinasse wrote: > On Fri, Aug 23, 2013 at 7:45 AM, wrote: >> From: Zhi Yong Wu >> >> Signed-off-by: Zhi Yong Wu >> --- >> include/linux/rbtree_augmented.h | 3 ++- >> lib/rbtree.c | 5 +++-- >> 2 files changed, 5 insertions(+), 3 deletions(-) > > So, you are saying that the checks are necessary, but you are not saying why. > > The way I see it, the checks are *not* necessary, because the rbtree > invariants guarantee them to be true. The only way for the checks to > fail would be if people directly manipulate the rbtrees without going > through the proper APIs, and if they do that then I think they're on > their own. So to me, I think it's the same situation as dereferencing > a pointer without checking if it's NULL, because you know it should > never be NULL - which in my eyes is perfectly acceptable. In my patchset, some rbtree APIs to be invoked, and I think that those rbtree APIs are used corrently, Below is the pointer of its code: https://github.com/wuzhy/kernel/compare/torvalds:master...hot_tracking But I hit some issues when using compilebench to do perf benchmark. compile dir kernel-7 691MB in 8.92 seconds (77.53 MB/s) [ 411.597890] general protection fault: [#1] SMP [ 411.598011] Modules linked in: [ 411.598011] CPU: 0 PID: 24 Comm: kswapd0 Not tainted 3.11.0-rc7+ #1235 [ 411.598011] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 411.598011] task: 88029366df70 ti: 88029211e000 task.ti: 88029211e000 [ 411.598011] RIP: 0010:[] [] rb_erase+0x1bd/0x390 [ 411.598011] RSP: 0018:88029211fb38 EFLAGS: 00010206 [ 411.598011] RAX: 8801484805f8 RBX: 8801484804e0 RCX: 00880155f00a [ 411.598011] RDX: 880155f00ae1 RSI: 88028863 RDI: 880155f00ad8 [ 411.598011] RBP: 88029211fb38 R08: 880155f00ae1 R09: [ 411.598011] R10: 8801484805f8 R11: R12: 88028863 [ 411.598011] R13: 880148480520 R14: 3af0 R15: 8801484805b0 [ 411.598011] FS: () GS:88029fc0() knlGS: [ 411.598011] CS: 0010 DS: ES: CR0: 8005003b [ 411.598011] CR2: f7798000 CR3: 00028a69d000 CR4: 06f0 [ 411.598011] Stack: [ 411.598011] 88029211fb68 811dd6f9 8801484804e0 88028863 [ 411.598011] 811dd1d0 3af0 88029211fb78 811dd79c [ 411.598011] 88029211fbe8 811dd8f9 88029211fba8 880288632008 [ 411.598011] Call Trace: [ 411.598011] [] hot_inode_item_free+0x29/0xa0 [ 411.598011] [] ? hot_mem_limit_sum+0x10/0x10 [ 411.598011] [] hot_inode_item_put+0x2c/0x30 [ 411.598011] [] hot_item_evict.part.5+0xa9/0x120 [ 411.598011] [] hot_track_prune+0x3a/0x70 [ 411.598011] [] shrink_slab+0x153/0x2f0 [ 411.598011] [] ? up_read+0x23/0x40 [ 411.598011] [] balance_pgdat+0x491/0x5d0 [ 411.598011] [] kswapd+0x190/0x470 [ 411.598011] [] ? wake_up_bit+0x40/0x40 [ 411.598011] [] ? balance_pgdat+0x5d0/0x5d0 [ 411.598011] [] kthread+0xea/0xf0 [ 411.598011] [] ? flush_kthread_work+0x1b0/0x1b0 [ 411.598011] [] ret_from_fork+0x7c/0xb0 [ 411.598011] [] ? flush_kthread_work+0x1b0/0x1b0 [ 411.598011] Code: 49 89 d0 49 83 c8 01 4c 89 07 48 89 d7 48 89 ca 48 8b 4a 08 49 89 d0 49 83 c8 01 48 85 c9 48 89 48 10 48 89 42 08 4c 89 07 74 0c <48> 8b 39 83 e7 01 48 09 c7 48 89 39 48 8b 08 48 89 0a 48 83 e1 [ 411.598011] RIP [] rb_erase+0x1bd/0x390 [ 411.598011] RSP [ 411.669881] ---[ end trace 89d346eca258dcf8 ]--- (gdb) l *rb_erase+0x1bd 0x8149952d is in rb_erase (include/linux/rbtree_augmented.h:101). 96#define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color) 97#define rb_is_black(rb)__rb_is_black((rb)->__rb_parent_color) 98 99static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) 100{ 101rb->__rb_parent_color = rb_color(rb) | (unsigned long)p; 102} 103 104static inline void rb_set_parent_color(struct rb_node *rb, 105 struct rb_node *p, int color) (gdb) compile dir kernel-27 691MB in 14.26 seconds (48.50 MB/s) create dir kernel-64334 222MB in 12.78 seconds (17.40 MB/s) [ 1630.646485] BUG: unable to handle kernel NULL pointer dereference at (null) [ 1630.647010] IP: [] rb_erase+0x62/0x390 [ 1630.647010] PGD 289f78067 PUD 291de4067 PMD 0 [ 1630.647010] Oops: [#1] SMP [ 1630.647010] Modules linked in: [ 1630.647010] CPU: 0 PID: 24 Comm: kswapd0 Not tainted 3.11.0-rc7+ #1235 [ 1630.647010] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 1630.647010] task: 88029366df70 ti: 88029211e000 task.ti: 88029211e000 [ 1630.647010] RIP: 0010:[] [] rb_erase+0x62/0x390 [ 1630.647010] RSP: 0018:88029211fb38 EFLAGS: 00010282 [ 1630.647010] RAX: 8800a57b3a08 R
Re: [PATCH] rbtree: Add some necessary condition checks
On Mon, Sep 2, 2013 at 4:57 PM, Michel Lespinasse wrote: > On Sun, Sep 1, 2013 at 11:30 PM, Zhi Yong Wu wrote: >> In Tue, Aug 27, 2013 at 6:01 AM, Michel Lespinasse wrote: >>> On Fri, Aug 23, 2013 at 7:45 AM, wrote: >>>> From: Zhi Yong Wu >>>> >>>> Signed-off-by: Zhi Yong Wu >>>> --- >>>> include/linux/rbtree_augmented.h | 3 ++- >>>> lib/rbtree.c | 5 +++-- >>>> 2 files changed, 5 insertions(+), 3 deletions(-) >>> >>> So, you are saying that the checks are necessary, but you are not saying >>> why. >>> >>> The way I see it, the checks are *not* necessary, because the rbtree >>> invariants guarantee them to be true. The only way for the checks to >>> fail would be if people directly manipulate the rbtrees without going >>> through the proper APIs, and if they do that then I think they're on >>> their own. So to me, I think it's the same situation as dereferencing >>> a pointer without checking if it's NULL, because you know it should >>> never be NULL - which in my eyes is perfectly acceptable. >> In my patchset, some rbtree APIs to be invoked, and I think that those >> rbtree APIs are used corrently, Below is the pointer of its code: >> https://github.com/wuzhy/kernel/compare/torvalds:master...hot_tracking >> But I hit some issues when using compilebench to do perf benchmark. >> compile dir kernel-7 691MB in 8.92 seconds (77.53 MB/s) > > Thanks for the link - I now better understand where you are coming > from with these fixes. > > Going back to the original message: > >> diff --git a/include/linux/rbtree_augmented.h >> b/include/linux/rbtree_augmented.h >> index fea49b5..7d19770 100644 >> --- a/include/linux/rbtree_augmented.h >> +++ b/include/linux/rbtree_augmented.h >> @@ -199,7 +199,8 @@ __rb_erase_augmented(struct rb_node *node, struct >> rb_root *root, >> } >> >> successor->rb_left = tmp = node->rb_left; >> - rb_set_parent(tmp, successor); >> + if (tmp) >> + rb_set_parent(tmp, successor); >> >> pc = node->__rb_parent_color; >> tmp = __rb_parent(pc); > > Note that node->rb_left was already fetched at the top of > __rb_erase_augmented(), and was checked to be non-NULL at the time - > otherwise we would have executed 'Case 1' in that function. So, you If 'Case 1' is executed, this line of code is also done, how about the result? 'Case 1' seems *not* to change node->rb_left at all. > are not expected to find tmp == NULL here. > >> diff --git a/lib/rbtree.c b/lib/rbtree.c >> index c0e31fe..2cb01ba 100644 >> --- a/lib/rbtree.c >> +++ b/lib/rbtree.c >> @@ -214,7 +214,7 @@ rb_erase_color(struct rb_node *parent, struct >> rb_root *root, >> */ >> sibling = parent->rb_right; >> if (node != sibling) { /* node == parent->rb_left */ >> - if (rb_is_red(sibling)) { >> + if (sibling && rb_is_red(sibling)) { >> /* >> * Case 1 - left rotate at parent >> * > > Note the loop invariants quoted just above: > > /* > * Loop invariants: > * - node is black (or NULL on first iteration) > * - node is not the root (parent is not NULL) > * - All leaf paths going through parent and node have a > * black node count that is 1 lower than other leaf paths. > */ > > Because of these, each path from sibling to a leaf must include at > least one black node, which implies that sibling can't be NULL - or to > put it another way, if sibling is null then the expected invariants > were violated before we even got there. In theory, i can understand what you mean, But don't know why and where it got violated. > >> @@ -226,7 +226,8 @@ rb_erase_color(struct rb_node *parent, struct >> rb_root *root, >> */ >> parent->rb_right = tmp1 = sibling->rb_left; >> sibling->rb_left = parent; >> - rb_set_parent_color(tmp1, parent, RB_BLACK); >> + if (tmp1) >> + rb_set_parent_col
Re: [PATCH] rbtree: Add some necessary condition checks
On Tue, Sep 3, 2013 at 1:48 PM, Michel Lespinasse wrote: > On Mon, Sep 2, 2013 at 9:45 PM, Zhi Yong Wu wrote: >> On Mon, Sep 2, 2013 at 4:57 PM, Michel Lespinasse wrote: >>> Thanks for the link - I now better understand where you are coming >>> from with these fixes. >>> >>> Going back to the original message: >>> >>>> diff --git a/include/linux/rbtree_augmented.h >>>> b/include/linux/rbtree_augmented.h >>>> index fea49b5..7d19770 100644 >>>> --- a/include/linux/rbtree_augmented.h >>>> +++ b/include/linux/rbtree_augmented.h >>>> @@ -199,7 +199,8 @@ __rb_erase_augmented(struct rb_node *node, struct >>>> rb_root *root, >>>> } >>>> >>>> successor->rb_left = tmp = node->rb_left; >>>> - rb_set_parent(tmp, successor); >>>> + if (tmp) >>>> + rb_set_parent(tmp, successor); >>>> >>>> pc = node->__rb_parent_color; >>>> tmp = __rb_parent(pc); >>> >>> Note that node->rb_left was already fetched at the top of >>> __rb_erase_augmented(), and was checked to be non-NULL at the time - >>> otherwise we would have executed 'Case 1' in that function. So, you >> If 'Case 1' is executed, this line of code is also done, how about the >> result? >> 'Case 1' seems *not* to change node->rb_left at all. > > Wait, I believe this line of code is executed only in Case 2 and Case 3 ? Yes. > >>>> diff --git a/lib/rbtree.c b/lib/rbtree.c >>>> index c0e31fe..2cb01ba 100644 >>>> --- a/lib/rbtree.c >>>> +++ b/lib/rbtree.c >>>> @@ -214,7 +214,7 @@ rb_erase_color(struct rb_node *parent, struct >>>> rb_root *root, >>>> */ >>>> sibling = parent->rb_right; >>>> if (node != sibling) { /* node == parent->rb_left */ >>>> - if (rb_is_red(sibling)) { >>>> + if (sibling && rb_is_red(sibling)) { >>>> /* >>>> * Case 1 - left rotate at parent >>>> * >>> >>> Note the loop invariants quoted just above: >>> >>> /* >>> * Loop invariants: >>> * - node is black (or NULL on first iteration) >>> * - node is not the root (parent is not NULL) >>> * - All leaf paths going through parent and node have a >>> * black node count that is 1 lower than other leaf paths. >>> */ >>> >>> Because of these, each path from sibling to a leaf must include at >>> least one black node, which implies that sibling can't be NULL - or to >>> put it another way, if sibling is null then the expected invariants >>> were violated before we even got there. >> In theory, i can understand what you mean, But don't know why and >> where it got violated. > > Same here. My point is, I don't think we can fix the issue without > answering that question. Yes, but how to look for this answer is the key. :) > >>> Now I had a quick look at your code and I couldn't tell at which point >>> the invariants are violated. However I did notice a couple suspicious >>> things in the very first patch >>> (f5c8f2b256d87ac0bf789a787e6b795ac0c736e8): >>> >>> 1- In both hot_range_tree_free() and and hot_tree_exit(), you try to >>> destroy rb trees by iterating on each node with rb_next() and then >> yes, but this item may not been freed immediately, You can know each item >> has its ref count. > > Are items guaranteed to have another refcount than the one we're dropping ? Those items maybe have another refcount. You know, other function may be referencing those items. e.g. hot_freqs_update(). > >>> freeing them. Note that rb_next() can reference prior nodes, which >>> have already been freed in your scheme, so that seems quite unsafe. >> I checked rb_next() function, and found that if its prior nodes are >> freed, is this node's parent not NULL? > > No, if the parent was freed with just a put() operation, the child > will still have a pointer to it. This is why I suggested using > rb_erase() on each node before freeing them, so that we don't k
Re: [PATCH] rbtree: Add some necessary condition checks
On Tue, Sep 3, 2013 at 1:48 PM, Michel Lespinasse wrote: > On Mon, Sep 2, 2013 at 9:45 PM, Zhi Yong Wu wrote: >> On Mon, Sep 2, 2013 at 4:57 PM, Michel Lespinasse wrote: >>> Thanks for the link - I now better understand where you are coming >>> from with these fixes. >>> >>> Going back to the original message: >>> >>>> diff --git a/include/linux/rbtree_augmented.h >>>> b/include/linux/rbtree_augmented.h >>>> index fea49b5..7d19770 100644 >>>> --- a/include/linux/rbtree_augmented.h >>>> +++ b/include/linux/rbtree_augmented.h >>>> @@ -199,7 +199,8 @@ __rb_erase_augmented(struct rb_node *node, struct >>>> rb_root *root, >>>> } >>>> >>>> successor->rb_left = tmp = node->rb_left; >>>> - rb_set_parent(tmp, successor); >>>> + if (tmp) >>>> + rb_set_parent(tmp, successor); >>>> >>>> pc = node->__rb_parent_color; >>>> tmp = __rb_parent(pc); >>> >>> Note that node->rb_left was already fetched at the top of >>> __rb_erase_augmented(), and was checked to be non-NULL at the time - >>> otherwise we would have executed 'Case 1' in that function. So, you >> If 'Case 1' is executed, this line of code is also done, how about the >> result? >> 'Case 1' seems *not* to change node->rb_left at all. > > Wait, I believe this line of code is executed only in Case 2 and Case 3 ? > >>>> diff --git a/lib/rbtree.c b/lib/rbtree.c >>>> index c0e31fe..2cb01ba 100644 >>>> --- a/lib/rbtree.c >>>> +++ b/lib/rbtree.c >>>> @@ -214,7 +214,7 @@ rb_erase_color(struct rb_node *parent, struct >>>> rb_root *root, >>>> */ >>>> sibling = parent->rb_right; >>>> if (node != sibling) { /* node == parent->rb_left */ >>>> - if (rb_is_red(sibling)) { >>>> + if (sibling && rb_is_red(sibling)) { >>>> /* >>>> * Case 1 - left rotate at parent >>>> * >>> >>> Note the loop invariants quoted just above: >>> >>> /* >>> * Loop invariants: >>> * - node is black (or NULL on first iteration) >>> * - node is not the root (parent is not NULL) >>> * - All leaf paths going through parent and node have a >>> * black node count that is 1 lower than other leaf paths. >>> */ >>> >>> Because of these, each path from sibling to a leaf must include at >>> least one black node, which implies that sibling can't be NULL - or to >>> put it another way, if sibling is null then the expected invariants >>> were violated before we even got there. >> In theory, i can understand what you mean, But don't know why and >> where it got violated. > > Same here. My point is, I don't think we can fix the issue without > answering that question. > >>> Now I had a quick look at your code and I couldn't tell at which point >>> the invariants are violated. However I did notice a couple suspicious >>> things in the very first patch >>> (f5c8f2b256d87ac0bf789a787e6b795ac0c736e8): >>> >>> 1- In both hot_range_tree_free() and and hot_tree_exit(), you try to >>> destroy rb trees by iterating on each node with rb_next() and then >> yes, but this item may not been freed immediately, You can know each item >> has its ref count. > > Are items guaranteed to have another refcount than the one we're dropping ? > >>> freeing them. Note that rb_next() can reference prior nodes, which >>> have already been freed in your scheme, so that seems quite unsafe. >> I checked rb_next() function, and found that if its prior nodes are >> freed, is this node's parent not NULL? > > No, if the parent was freed with just a put() operation, the child > will still have a pointer to it. This is why I suggested using > rb_erase() on each node before freeing them, so that we don't keep > pointers to freed nodes. > >>> The simplest fix would be to do a full rb_erase() on each node before >> full rb_erase()? sorry, i don't get
Re: [PATCH v5 00/10] VFS hot tracking
On Tue, Sep 24, 2013 at 8:20 AM, Al Viro wrote: > On Tue, Sep 17, 2013 at 06:17:45AM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> The patchset is trying to introduce hot tracking function in >> VFS layer, which will keep track of real disk I/O in memory. >> By it, you will easily know more details about disk I/O, and >> then detect where disk I/O hot spots are. Also, specific FS >> can take use of it to do accurate defragment, and hot relocation >> support, etc. >> >> Now it's time to send out its V5 for external review, and >> any comments or ideas are appreciated, thanks. > > FWIW, the most fundamental problem I see with this is that the data > you are collecting is very sensitive to VM pressure details. The > hotspots wrt accesses (i.e. the stuff accessed all the time) will > not generate a lot of IO - they'll just sit in cache and look > very cold for your code. The stuff that is accessed very rarely > will also look cold. The hot ones will be those that get in and > out of cache often; IOW, the ones that are borderline - a bit less > memory pressure and they would've stayed in cache. I would expect > that to vary very much from run to run, which would make its use > for decisions like SSD vs. HD rather dubious... Are you suggesting to collect the hot info when the data is accessed while in cache? Of course, i will do the perf testings in some scenarios where VFS hot tracking is taking effect. > > Do you have that data collected on some real tasks under varying > memory pressure conditions? How stable the results are? Can you say what some real tasks are with more details? What kind of tests are you suggesting? and what results are you expecting to see? > > Another question: do you have perf profiles for system with that > stuff enabled, on boxen with many CPUs? You are using a lot of No, i will try to do it, and let you know its result. > spinlocks there; how much contention and cacheline ping-pong are > you getting on root->t_lock, for example? Ditto for cacheline > ping-pong on root->hot_cnt, while we are at it... Sorry, What kind of tests are you suggesting? and what results are you expecting to see? You know, i am one newbie for VFS, can you say with more details? how to do this test? > > Comments re code: > > * don't export the stuff until it's used by a module. And as > a general policy, please, do not use EXPORT_SYMBOL_GPL in fs/*. > Either don't export at all, or pick a sane API that would not > expose the guts of your code (== wouldn't require you to look > at the users to decide what will and will not break on changes > in your code) and export that. As far as I'm concerned, > all variants of EXPORT_SYMBOL are greatly overused and > EXPORT_SYMBOL_GPL is an exercise in masturbation... OK, i will make appropriate change based on your comments, thanks. > > * hot_inode_item_lookup() is a couple of functions smashed together; > split it, please, and lose the "alloc" argument. Do you mean that it should be split into two functions "alloc" function and "lookup" function? > > * hot_inode_item_unlink() is used when the last link is killed > by unlink(2), but not when it's killed by successful rename(2). > Why? Since we are using inode for collecting the hot info, rename(2) doesn't destroy that information as inodeis kept the same. > > * what happens when one opens a file, unlinks it and starts doing > IO? hot_freqs_update() will be called, re-creating the inode item > unlink has killed... Since the file won't be opened and used anymore by anybody else we don't bother about it. But i will improve it based on your comments. hot_freqs_update() will directly return and not re-create the inode item when this file has been unlinked. > > * for pity sake, use inlines - the same hot_freqs_update() on a filesystem > that doesn't have this stuff enabled will *still* branch pretty far > out of line, only to return after checking that ->s_hot_root is NULL. > Incidentally, we still have about twenty spare bits in inode flags; > use one (S_TEMP_COLLECTED, or something like that) instead of that > check. Checking it is considerably cheaper than checking ->i_sb->s_hot_root. OK, will use inline and bits flag in the next post, thanks. > > * hot_bit_shift() is bloody annoying. Why does true mean "up", false - > "down" and why is it easier to memorize that than just use explicit << > and >>? OK, will make appropriate changed based on your comments, thanks. > > * file->f_dentry->d_inode is spelled file_inode(file), TYVM (so does > file->f_path.dentry->d_inode, actually). Ditto. > > * #ifdef
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner wrote: > On Wed, 10 Oct 2012, zwu.ker...@gmail.com wrote: > >> Date: Wed, 10 Oct 2012 18:07:23 +0800 >> From: zwu.ker...@gmail.com >> To: linux-fsde...@vger.kernel.org >> Cc: linux-e...@vger.kernel.org, linux-bt...@vger.kernel.org, >> linux-kernel@vger.kernel.org, linux...@linux.vnet.ibm.com, >> v...@zeniv.linux.org.uk, da...@fromorbit.com, d...@jikos.cz, >> ty...@mit.edu, c...@us.ibm.com, Zhi Yong Wu >> Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' >> >> From: Zhi Yong Wu >> >> Introduce one new mount option '-o hot_track', >> and add its parsing support. >> Its usage looks like: >>mount -o hot_track >>mount -o nouser,hot_track >>mount -o nouser,hot_track,loop >>mount -o hot_track,nouser > > This patch should probably be at the end of the series. Can you let me know your reason? I think that it is not necessary to be at the end of the series. > > -Lukas > >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/btrfs/ctree.h |1 + >> fs/btrfs/super.c |7 ++- >> 2 files changed, 7 insertions(+), 1 deletions(-) >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 9821b67..094bec6 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { >> #define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20) >> #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) >> #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) >> +#define BTRFS_MOUNT_HOT_TRACK(1 << 23) >> >> #define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt) >> #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt) >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> index 83d6f9f..00be9e3 100644 >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -41,6 +41,7 @@ >> #include >> #include >> #include >> +#include >> #include "compat.h" >> #include "delayed-inode.h" >> #include "ctree.h" >> @@ -303,7 +304,7 @@ enum { >> Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, >> Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, >> Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, >> - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, >> + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, >> Opt_check_integrity, Opt_check_integrity_including_extent_data, >> Opt_check_integrity_print_mask, Opt_fatal_errors, >> Opt_err, >> @@ -342,6 +343,7 @@ static match_table_t tokens = { >> {Opt_no_space_cache, "nospace_cache"}, >> {Opt_recovery, "recovery"}, >> {Opt_skip_balance, "skip_balance"}, >> + {Opt_hot_track, "hot_track"}, >> {Opt_check_integrity, "check_int"}, >> {Opt_check_integrity_including_extent_data, "check_int_data"}, >> {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, >> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char >> *options) >> case Opt_skip_balance: >> btrfs_set_opt(info->mount_opt, SKIP_BALANCE); >> break; >> + case Opt_hot_track: >> + btrfs_set_opt(info->mount_opt, HOT_TRACK); >> + break; >> #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY >> case Opt_check_integrity_including_extent_data: >> printk(KERN_INFO "btrfs: enabling check integrity" >> -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Wed, Oct 10, 2012 at 9:11 PM, Lukáš Czerner wrote: > On Wed, 10 Oct 2012, Zhi Yong Wu wrote: > >> Date: Wed, 10 Oct 2012 20:21:48 +0800 >> From: Zhi Yong Wu >> To: Lukáš Czerner >> Cc: linux-fsde...@vger.kernel.org, linux-e...@vger.kernel.org, >> linux-bt...@vger.kernel.org, linux-kernel@vger.kernel.org, >> linux...@linux.vnet.ibm.com, v...@zeniv.linux.org.uk, >> da...@fromorbit.com, >> d...@jikos.cz, ty...@mit.edu, c...@us.ibm.com, >> Zhi Yong Wu >> Subject: Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' >> >> On Wed, Oct 10, 2012 at 7:59 PM, Lukáš Czerner wrote: >> > On Wed, 10 Oct 2012, zwu.ker...@gmail.com wrote: >> > >> >> Date: Wed, 10 Oct 2012 18:07:23 +0800 >> >> From: zwu.ker...@gmail.com >> >> To: linux-fsde...@vger.kernel.org >> >> Cc: linux-e...@vger.kernel.org, linux-bt...@vger.kernel.org, >> >> linux-kernel@vger.kernel.org, linux...@linux.vnet.ibm.com, >> >> v...@zeniv.linux.org.uk, da...@fromorbit.com, d...@jikos.cz, >> >> ty...@mit.edu, c...@us.ibm.com, Zhi Yong Wu >> >> Subject: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track' >> >> >> >> From: Zhi Yong Wu >> >> >> >> Introduce one new mount option '-o hot_track', >> >> and add its parsing support. >> >> Its usage looks like: >> >>mount -o hot_track >> >>mount -o nouser,hot_track >> >>mount -o nouser,hot_track,loop >> >>mount -o hot_track,nouser >> > >> > This patch should probably be at the end of the series. >> Can you let me know your reason? I think that it is not necessary to >> be at the end of the series. > > Simply because you're adding the mount option which does not do > anything yet. Moreover you change the implementation of the hot track > as you go. You should enable this once it is ready to use, not the other > way around. So, please move this at the end of the patch set when > the feature is supposed to be ready to use. OK, done, thanks. If you have comments on other patches, it will be appreciated. > > Thanks! > -Lukas > >> >> > >> > -Lukas >> > >> >> >> >> Signed-off-by: Zhi Yong Wu >> >> --- >> >> fs/btrfs/ctree.h |1 + >> >> fs/btrfs/super.c |7 ++- >> >> 2 files changed, 7 insertions(+), 1 deletions(-) >> >> >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> >> index 9821b67..094bec6 100644 >> >> --- a/fs/btrfs/ctree.h >> >> +++ b/fs/btrfs/ctree.h >> >> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { >> >> #define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20) >> >> #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) >> >> #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) >> >> +#define BTRFS_MOUNT_HOT_TRACK(1 << 23) >> >> >> >> #define btrfs_clear_opt(o, opt) ((o) &= ~BTRFS_MOUNT_##opt) >> >> #define btrfs_set_opt(o, opt)((o) |= BTRFS_MOUNT_##opt) >> >> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c >> >> index 83d6f9f..00be9e3 100644 >> >> --- a/fs/btrfs/super.c >> >> +++ b/fs/btrfs/super.c >> >> @@ -41,6 +41,7 @@ >> >> #include >> >> #include >> >> #include >> >> +#include >> >> #include "compat.h" >> >> #include "delayed-inode.h" >> >> #include "ctree.h" >> >> @@ -303,7 +304,7 @@ enum { >> >> Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, >> >> Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, >> >> Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, >> >> - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, >> >> + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, >> >> Opt_check_integrity, Opt_check_integrity_including_extent_data, >> >> Opt_check_integrity_print_mask, Opt_fatal_errors, >> >> Opt_err, >> >> @@ -342,6 +343,7 @@ static match_table_t tokens = { >> >> {Opt_no_space_cache, "nospace_cache"}, >> >> {Opt_recovery, "recovery"}, >> >> {Opt_skip_balance, &
Re: [PATCH] ext4: remove some unused code lines
On Sun, Dec 2, 2012 at 8:27 PM, Zheng Liu wrote: > On Thu, Nov 29, 2012 at 06:00:00PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu > > Please write a commit log to describe this patch, even though it is > quite simple and straightfoward Good suggestion, thanks. > > Otherwise, it looks good to me. > > Reviewed-by: Zheng Liu > > Regards, > - Zheng > >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/ext4/extents.c |2 -- >> 1 files changed, 0 insertions(+), 2 deletions(-) >> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c >> index 7011ac9..43ec639 100644 >> --- a/fs/ext4/extents.c >> +++ b/fs/ext4/extents.c >> @@ -2156,7 +2156,6 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t >> block, >> struct ext4_extent *ex) >> { >> struct ext4_ext_cache *cex; >> - struct ext4_sb_info *sbi; >> int ret = 0; >> >> /* >> @@ -2164,7 +2163,6 @@ ext4_ext_in_cache(struct inode *inode, ext4_lblk_t >> block, >>*/ >> spin_lock(&EXT4_I(inode)->i_block_reservation_lock); >> cex = &EXT4_I(inode)->i_cached_extent; >> - sbi = EXT4_SB(inode->i_sb); >> >> /* has cache valid data? */ >> if (cex->ec_len == 0) >> -- >> 1.7.6.5 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 resend hot_track 00/16] vfs: hot data tracking
delete kernel-2 in 48.01 seconds delete kernel-70151 in 47.60 seconds stat dir kernel-1 in 21.80 seconds read dir kernel-18 in 109.98 8.21 MB/s clean kernel-18 680MB in 7.78 seconds (87.49 MB/s) compile dir kernel-17 680MB in 54.39 seconds (12.51 MB/s) read dir kernel-17 in 108.52 8.32 MB/s stat dir kernel-18 in 19.48 seconds stat dir kernel-64334 in 22.04 seconds delete kernel-24150 in 44.36 seconds delete kernel-17 in 49.09 seconds stat dir kernel-1 in 18.16 seconds compile dir kernel-7 691MB in 48.90 seconds (14.14 MB/s) patch dir kernel-16 109MB in 103.71 seconds (1.06 MB/s) stat dir kernel-7 in 21.94 seconds create dir kernel-82195 222MB in 110.82 seconds (2.01 MB/s) delete kernel-82195 in 38.64 seconds stat dir kernel-3 in 22.88 seconds patch dir kernel-2717 109MB in 92.23 seconds (1.19 MB/s) patch dir kernel-5 109MB in 64.95 seconds (1.69 MB/s) read dir kernel-2717 in 97.88 2.33 MB/s delete kernel-29 in 40.59 seconds clean kernel-7 691MB in 5.09 seconds (135.87 MB/s) read dir kernel-4 in 59.42 3.74 MB/s stat dir kernel-78184 in 20.24 seconds patch dir kernel-0 109MB in 95.95 seconds (1.14 MB/s) patch dir kernel-3 109MB in 62.86 seconds (1.74 MB/s) create dir kernel-30226 222MB in 106.81 seconds (2.08 MB/s) read dir kernel-19 in 81.32 2.81 MB/s read dir kernel-9 in 74.65 3.06 MB/s delete kernel-5 in 42.04 seconds read dir kernel-7 in 61.95 3.68 MB/s patch dir kernel-57327 109MB in 97.85 seconds (1.12 MB/s) read dir kernel-11 in 58.85 3.78 MB/s run complete: == intial create total runs 30 avg 2.42 MB/s (user 13.60s sys 36.18s) create total runs 14 avg 2.27 MB/s (user 13.66s sys 36.94s) patch total runs 15 avg 1.82 MB/s (user 6.62s sys 36.93s) compile total runs 14 avg 15.01 MB/s (user 2.76s sys 18.29s) clean total runs 10 avg 110.29 MB/s (user 0.46s sys 3.21s) read tree total runs 11 avg 3.29 MB/s (user 11.04s sys 28.65s) read compiled tree total runs 4 avg 8.60 MB/s (user 13.16s sys 41.32s) delete tree total runs 10 avg 41.44 seconds (user 6.43s sys 25.19s) delete compiled tree total runs 4 avg 47.81 seconds (user 7.18s sys 29.27s) stat tree total runs 11 avg 20.41 seconds (user 6.39s sys 7.45s) stat compiled tree total runs 7 avg 23.97 seconds (user 7.24s sys 8.74s) On Fri, 2012-11-16 at 17:51 +0800, zwu.ker...@gmail.com wrote: > From: Zhi Yong Wu > > HI, guys, > > Any comments or ideas are appreciated, thanks. > > NOTE: > > The patchset can be obtained via my kernel dev git on github: > git://github.com/wuzhy/kernel.git hot_tracking > If you're interested, you can also review them via > https://github.com/wuzhy/kernel/commits/hot_tracking > > For more info, please check hot_tracking.txt in Documentation > > TODO List: > > 1.) Need to do scalability or performance tests. - Required > 2.) Need one simpler but efficient temp calculation function > 3.) How to save the file temperature among the umount to be able to > preserve the file tempreture after reboot - Optional > > Changelog: > > - Solved 64 bits inode number issue. [David Sterba] > - Embed struct hot_type in struct file_system_type [Darrick J. Wong] > - Cleanup Some issues [David Sterba] > - Use a static hot debugfs root [Greg KH] > - Rewritten debugfs support based on seq_file operation. [Dave Chinner] > - Refactored workqueue support. [Dave Chinner] > - Turn some Micro into be tunable [Zhiyong, Zheng Liu] >TIME_TO_KICK, and HEAT_UPDATE_DELAY > - Introduce hot func registering framework [Zhiyong] > - Remove global variable for hot tracking [Zhiyong] > - Add xfs hot tracking support [Dave Chinner] > - Add ext4 hot tracking support [Zheng Liu] > - Cleanedup a lot of other issues [Dave Chinner] > - Added memory shrinker [Dave Chinner] > - Converted to one workqueue to update map info periodically [Dave Chinner] > - Cleanedup a lot of other issues [Dave Chinner] > - Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] > - Add btrfs hot tracking support [Zhiyong] > - The first three patches can probably just be flattened into one. > [Marco Stornelli , Dave Chinner] > > Zhi Yong Wu (16): > vfs: introduce some data structures > vfs: add init and cleanup functions > vfs: add I/O frequency update function > vfs: add two map arrays > vfs: add hooks to enable hot tracking > vfs: add temp calculation function > vfs: add map info update function > vfs: add aging function > vfs: add one work queue > vfs: add FS hot type support > vfs: register one shrinker > vfs: add one ioctl interface > vfs: add debugfs support > proc: add two hot_track proc files > btrfs: add hot tracking support > vfs: add documentation > > Documen
Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
On Tue, Oct 16, 2012 at 8:27 AM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:31PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add a per-superblock workqueue and a work_struct >> to run periodic work to update map info on each superblock. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c| 94 >> ++ >> fs/hot_tracking.h|3 + >> include/linux/hot_tracking.h |2 + >> 3 files changed, 99 insertions(+), 0 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index a8dc599..f333c47 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -15,6 +15,8 @@ >> #include >> #include >> #include >> +#include >> +#include >> #include >> #include >> #include >> @@ -623,6 +625,88 @@ static void hot_map_array_exit(struct hot_info *root) >> } >> >> /* >> + * Update temperatures for each hot inode item and >> + * hot range item for aging purposes >> + */ >> +static void hot_temperature_update_work(struct work_struct *work) >> +{ >> + struct hot_update_work *hot_work = >> + container_of(work, struct hot_update_work, work); >> + struct hot_info *root = hot_work->hot_info; >> + struct hot_inode_item *hi_nodes[8]; >> + unsigned long delay = HZ * HEAT_UPDATE_DELAY; >> + u64 ino = 0; >> + int i, n; >> + >> + do { >> + while (1) { >> + spin_lock(&root->lock); >> + n = radix_tree_gang_lookup(&root->hot_inode_tree, >> +(void **)hi_nodes, ino, >> +ARRAY_SIZE(hi_nodes)); >> + if (!n) { >> + spin_unlock(&root->lock); >> + break; >> + } >> + >> + ino = hi_nodes[n - 1]->i_ino + 1; >> + for (i = 0; i < n; i++) { >> + kref_get(&hi_nodes[i]->hot_inode.refs); >> + hot_map_array_update( >> + &hi_nodes[i]->hot_inode.hot_freq_data, >> root); >> + hot_range_update(hi_nodes[i], root); >> + hot_inode_item_put(hi_nodes[i]); >> + } >> + spin_unlock(&root->lock); > > This is a lot of work to do under a spin lock. Perhaps you should > get a reference on all the nodes, then drop the root->lock and then > update all the nodes in a separate loop. OK, done > >> + } >> + >> + if (unlikely(freezing(current))) { >> + __refrigerator(true); >> + } else { >> + set_current_state(TASK_INTERRUPTIBLE); >> + if (!kthread_should_stop()) { >> + schedule_timeout(delay); >> + } >> + __set_current_state(TASK_RUNNING); >> + } >> + } while (!kthread_should_stop()); > > I don't think you understand workqueues fully. A work queue worker > function is not something that executes endlessly. It is a > "one-shot" function that does the work once, not an endless loop > that has to delay it's execution for periodic work. ah, i have done this based on your following suggestions, thanks. > > If you need periodic work, then you should use a struct delayed_work > and queue the next work iteration to be run a later time. See, for > example, xfs_syncd_worker() and xfs_syncd_queue_sync() and how that > reschedules itself for periodic work. It also means you don't have > to handle kthread freezing, as the WQ infrastructure takes care of > that for you. ditto. > > This is why unmount is hanging for me - this work never completes, > so flush_workqueue() will never return. got it, thanks. > >> +} >> + >> +static int hot_wq_init(struct hot_info *root) >> +{ >> + struct hot_update_work *hot_work; >> + int ret = 0; >> + >> + root->update_wq = alloc_workqueue( >> + "hot_temperature_update", WQ_MEM_RECLAIM | WQ_UNBOUND, 1); >> + if (!root->update_wq) { >> + printk(KERN_ERR "%s: failed to create " >> + "temperature
Re: [RFC v3 00/13] vfs: hot data tracking
On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> NOTE: >> >> The patchset is currently post out mainly to make sure >> it is going in the correct direction and hope to get some >> helpful comments from other guys. >> For more infomation, please check hot_tracking.txt in Documentation >> >> TODO List: > > 1) Fix OOM issues - the hot inode tracking caches grow very large > and don't get trimmed under memory pressure. From slabtop, after > creating roughly 24 million single byte files(*) on a machine with > 8GB RAM: > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > 23859510 23859476 99%0.12K 795317 30 3181268K hot_range_item > 23859441 23859439 99%0.16K 1037367 23 4149468K hot_inode_item > 572530 572530 100%0.55K 817907327160K radix_tree_node > 241706 241406 99%0.22K 14218 17 56872K xfs_ili > 241206 241204 99%1.06K 804023321608K xfs_inode > > The inode tracking is trying to track all 24 million inodes even > though they have been written only once, and there are only 240,000 > inodes in the cache at this point in time. That was the last update > that slabtop got, so it is indicative of the impending OOM situation > that occurred. > >> Changelog from v2: >> 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] >> 2.) Added memory shrinker [Dave Chinner] > > I haven't looked at the shrinker, but clearly it is not working, > otherwise the above OOM situation would not be occurring. > > Cheers, > > Dave. > > (*) Tested on an empty 17TB XFS filesystem with: > > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc > meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 > blks > = sectsz=512 attr=2, projid32bit=0 > data = bsize=4096 blocks=4563402735, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=131072, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch > $ sudo chmod 777 /mnt/scratch > $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ > /mnt/scratch/6 -d /mnt/scratch/7 > . > 0 21601 16679.3 12552262 > 0 22401 15412.4 12588587 > 0 23201 16367.6 14199322 > 0 24001 15680.4 15741205 > In this test, i haven't see you enable hot_track function via mount, why did it meet OOM? > > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 09/13] vfs: add one wq to update map info periodically
On Thu, Oct 18, 2012 at 10:25 AM, Zheng Liu wrote: > On Wed, Oct 17, 2012 at 02:34:15PM +0800, Zhi Yong Wu wrote: >> >> diff --git a/fs/hot_tracking.h b/fs/hot_tracking.h >> >> index d19e64a..7a79a6d 100644 >> >> --- a/fs/hot_tracking.h >> >> +++ b/fs/hot_tracking.h >> >> @@ -36,6 +36,9 @@ >> >> */ >> >> #define TIME_TO_KICK 400 >> >> >> >> +/* set how often to update temperatures (seconds) */ >> >> +#define HEAT_UPDATE_DELAY 400 >> > >> > FWIW, 400 seconds is an unusual time period. It's expected that >> > periodic work might take place at intervals of 5 minutes, 10 >> > minutes, etc, not 6m40s. It's much easier to predict and understand >> > behaviour if it's at a interval of whole units like minutes, >> > especially when looking at timestamped event traces. Hence 300s (5 >> > minutes) makes a lot more sense as a period for updates... >> got it. thanks. > > Hi Zhi Yong, > > IMHO we'd better to make this value parameterized, and then the user > can adjust this value dynamically. Yeah, this has been in my TODO list. But i want to make the core function can work at first. > > Regards, > Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner wrote: > On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote: >> On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner wrote: >> > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: >> >> From: Zhi Yong Wu > >> > (*) Tested on an empty 17TB XFS filesystem with: >> > >> > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc >> > meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 >> > blks >> > = sectsz=512 attr=2, projid32bit=0 >> > data = bsize=4096 blocks=4563402735, imaxpct=5 >> > = sunit=0 swidth=0 blks >> > naming =version 2 bsize=4096 ascii-ci=0 >> > log =internal log bsize=4096 blocks=131072, version=2 >> > = sectsz=512 sunit=1 blks, lazy-count=1 >> > realtime =none extsz=4096 blocks=0, rtextents=0 >> > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch >> > $ sudo chmod 777 /mnt/scratch >> > $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ >> > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ >> > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ >> > /mnt/scratch/6 -d /mnt/scratch/7 >> > . >> > 0 21601 16679.3 12552262 >> > 0 22401 15412.4 12588587 >> > 0 23201 16367.6 14199322 >> > 0 24001 15680.4 15741205 >> > >> In this test, i haven't see you enable hot_track function via >> mount, why did it meet OOM? > > I copied the wrong mount command. It was definitely enabled. OK, BTW: fs_mark is the script written by you? After xfsprogs is installed, i haven't found this command. > > Cheers, > > Dave. > > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Thu, Oct 18, 2012 at 1:17 PM, Dave Chinner wrote: > On Thu, Oct 18, 2012 at 12:44:47PM +0800, Zhi Yong Wu wrote: >> On Thu, Oct 18, 2012 at 12:29 PM, Dave Chinner wrote: >> > On Wed, Oct 17, 2012 at 04:57:14PM +0800, Zhi Yong Wu wrote: >> >> On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner wrote: >> >> > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: >> >> >> From: Zhi Yong Wu >> > >> >> > (*) Tested on an empty 17TB XFS filesystem with: >> >> > >> >> > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc >> >> > meta-data=/dev/vdc isize=256agcount=17, >> >> > agsize=268435455 blks >> >> > = sectsz=512 attr=2, projid32bit=0 >> >> > data = bsize=4096 blocks=4563402735, >> >> > imaxpct=5 >> >> > = sunit=0 swidth=0 blks >> >> > naming =version 2 bsize=4096 ascii-ci=0 >> >> > log =internal log bsize=4096 blocks=131072, version=2 >> >> > = sectsz=512 sunit=1 blks, lazy-count=1 >> >> > realtime =none extsz=4096 blocks=0, rtextents=0 >> >> > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch >> >> > $ sudo chmod 777 /mnt/scratch >> >> > $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ >> >> > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ >> >> > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ >> >> > /mnt/scratch/6 -d /mnt/scratch/7 >> >> > . >> >> > 0 21601 16679.3 12552262 >> >> > 0 22401 15412.4 12588587 >> >> > 0 23201 16367.6 14199322 >> >> > 0 24001 15680.4 15741205 >> >> > >> >> In this test, i haven't see you enable hot_track function via >> >> mount, why did it meet OOM? >> > >> > I copied the wrong mount command. It was definitely enabled. >> OK, BTW: fs_mark is the script written by you? After xfsprogs is >> installed, i haven't found this command. > > # apt-get install fsmark thanks. let me try. > > Or get the source here: > > http://sourceforge.net/projects/fsmark/ > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> FS_IOC_GET_HEAT_INFO: return a struct containing the various >> metrics collected in btrfs_freq_data structs, and also return a > > I think you mean hot_freq_data :P > >> calculated data temperature based on those metrics. Optionally, retrieve >> the temperature from the hot data hash list instead of recalculating it. > > To get the heat info for a specific file you have to know what file > you want to get that info for, right? I can see the usefulness of > asking for the heat data on a specific file, but how do you find the > hot files in the first place? i.e. the big question the user > interface needs to answer is "what files are hot?". > > Once userspace knows what the hottest files are, it can open them > and query the data via the above ioctl, but expecting userspace to > iterate millions of inodes in a filesystem to find hot files is very > inefficient. > > FWIW, if you were to return file handles to the hottest files, then Good idea. I am not very clear about how to implement it. file handles mean file_handle?? How to return them to the application? via debugfs? How many hottest files should be returned?? Top 100? > the application could open and query them without even needing to > know the path name to them. This woul dbe exceedingly useful for > defragmentation programs, especially as that is the way xfs_fsr > already operates on candidate files.(*) > > IOWs, sometimes the pathname is irrelevant to the operations that > applications want to perform - all they care about having an > efficient method of finding the inode they want and getting a file > descriptor that points to the file. Given the heat map info fits > right in to the sort of operations defrag and data mover tools > already do, it kind of makes sense to optimise the interface towards > those uses > > (*) i.e. finds them via bulkstat which returns handle information > along with all the other inode data, then opens the file by handle > to do the defrag work > >> FS_IOC_GET_HEAT_OPTS: return an integer representing the current >> state of hot data tracking and migration: >> >> 0 = do nothing >> 1 = track frequency of access >> >> FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and >> migration, as described above. > > I can't see how this is a manageable interface. It is not > persistent, so after every filesystem mount you'd have to set the > flag on all your inodes again. Hence, for the moment, I'd suggest > that dropping per-inode tracking control until all the core issues > are sorted out > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Tue, Oct 16, 2012 at 4:42 AM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> NOTE: >> >> The patchset is currently post out mainly to make sure >> it is going in the correct direction and hope to get some >> helpful comments from other guys. >> For more infomation, please check hot_tracking.txt in Documentation >> >> TODO List: > > 1) Fix OOM issues - the hot inode tracking caches grow very large > and don't get trimmed under memory pressure. From slabtop, after > creating roughly 24 million single byte files(*) on a machine with > 8GB RAM: > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > 23859510 23859476 99%0.12K 795317 30 3181268K hot_range_item > 23859441 23859439 99%0.16K 1037367 23 4149468K hot_inode_item > 572530 572530 100%0.55K 817907327160K radix_tree_node > 241706 241406 99%0.22K 14218 17 56872K xfs_ili > 241206 241204 99%1.06K 804023321608K xfs_inode > > The inode tracking is trying to track all 24 million inodes even > though they have been written only once, and there are only 240,000 > inodes in the cache at this point in time. That was the last update > that slabtop got, so it is indicative of the impending OOM situation > that occurred. > >> Changelog from v2: >> 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] >> 2.) Added memory shrinker [Dave Chinner] > > I haven't looked at the shrinker, but clearly it is not working, HI, Dave, Some guys suggest that when inode slab cache is shrinked, the hot_inode[range]_item slab is accordingly also shrinked, this will make hot tracking don't need to register its own shrinker. Do you think of it? If you don't like above idea. Do you have any good suggestion on how to remove hot_inode_item and hot_range_item? > otherwise the above OOM situation would not be occurring. > > Cheers, > > Dave. > > (*) Tested on an empty 17TB XFS filesystem with: > > $ sudo mkfs.xfs -f -l size=131072b,sunit=8 /dev/vdc > meta-data=/dev/vdc isize=256agcount=17, agsize=268435455 > blks > = sectsz=512 attr=2, projid32bit=0 > data = bsize=4096 blocks=4563402735, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=131072, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > $ sudo mount -o logbsize=256k /dev/vdc /mnt/scratch > $ sudo chmod 777 /mnt/scratch > $ fs_mark -D 1 -S0 -n 10 -s 1 -L 63 -d \ > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d \ > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d \ > /mnt/scratch/6 -d /mnt/scratch/7 > . > 0 21601 16679.3 12552262 > 0 22400000 1 15412.4 12588587 > 0 23201 16367.6 14199322 > 0 24001 15680.4 15741205 > > > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 resend hot_track 00/16] vfs: hot data tracking
HI, all guys. any comments or suggestions? On Thu, Dec 6, 2012 at 11:28 AM, Zhi Yong Wu wrote: > HI, guys > > THe perf testing is done separately with fs_mark, fio, ffsb and > compilebench in one kvm guest. > > Below is the performance testing report for hot tracking, and no obvious > perf downgrade is found. > > Note: original kernel means its source code is not changed; > kernel with enabled hot tracking means its source code is with hot > tracking patchset. > > The test env is set up as below: > > root@debian-i386:/home/zwu# uname -a > Linux debian-i386 3.7.0-rc8+ #266 SMP Tue Dec 4 12:17:55 CST 2012 x86_64 > GNU/Linux > > root@debian-i386:/home/zwu# mkfs.xfs -f -l > size=1310b,sunit=8 /home/zwu/bdev.img > meta-data=/home/zwu/bdev.img isize=256agcount=4, agsize=128000 > blks > = sectsz=512 attr=2, projid32bit=0 > data = bsize=4096 blocks=512000, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=1310, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > 1.) original kernel > > root@debian-i386:/home/zwu# mount -o > loop,logbsize=256k /home/zwu/bdev.img /mnt/scratch > [ 1197.421616] XFS (loop0): Mounting Filesystem > [ 1197.567399] XFS (loop0): Ending clean mount > root@debian-i386:/home/zwu# mount > /dev/sda1 on / type ext3 (rw,errors=remount-ro) > tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) > proc on /proc type proc (rw,noexec,nosuid,nodev) > sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) > udev on /dev type tmpfs (rw,mode=0755) > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) > devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) > none on /selinux type selinuxfs (rw,relatime) > debugfs on /sys/kernel/debug type debugfs (rw) > binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc > (rw,noexec,nosuid,nodev) > /dev/loop0 on /mnt/scratch type xfs (rw,logbsize=256k) > root@debian-i386:/home/zwu# free -m > total used free sharedbuffers > cached > Mem: 112109 2 0 4 > 53 > -/+ buffers/cache: 51 60 > Swap: 713 29684 > > 2.) kernel with enabled hot tracking > > root@debian-i386:/home/zwu# mount -o > hot_track,loop,logbsize=256k /home/zwu/bdev.img /mnt/scratch > [ 364.648470] XFS (loop0): Mounting Filesystem > [ 364.910035] XFS (loop0): Ending clean mount > [ 364.921063] VFS: Turning on hot data tracking > root@debian-i386:/home/zwu# mount > /dev/sda1 on / type ext3 (rw,errors=remount-ro) > tmpfs on /lib/init/rw type tmpfs (rw,nosuid,mode=0755) > proc on /proc type proc (rw,noexec,nosuid,nodev) > sysfs on /sys type sysfs (rw,noexec,nosuid,nodev) > udev on /dev type tmpfs (rw,mode=0755) > tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev) > devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=620) > none on /selinux type selinuxfs (rw,relatime) > binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc > (rw,noexec,nosuid,nodev) > /dev/loop0 on /mnt/scratch type xfs (rw,hot_track,logbsize=256k) > root@debian-i386:/home/zwu# free -m > total used free sharedbuffers > cached > Mem: 112107 4 0 2 > 34 > -/+ buffers/cache: 70 41 > Swap: 713 2711 > > 1. fs_mark test > > 1.) orginal kernel > > # ./fs_mark -D 100 -S0 -n 1000 -s 1 -L 30 -d /mnt/scratch/0 > -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 > -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 > -d /mnt/scratch/7 > # Version 3.3, 8 thread(s) starting at Wed Dec 5 03:20:58 2012 > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > # Directories: Time based hash between directories across 100 > subdirectories with 180 seconds per subdirectory. > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 > random bytes at end of name) > # Files info: size 1 bytes, written with an IO size of 16384 bytes per > write > # App overhead is time in microseconds spent in the test not doing file > writing related system calls. > > FSUse%Count SizeFiles/sec App Overhead > 2 80001375.6 27175895 > 3160001375.6 27478079 > 4240001346.0
Re: VFS hot tracking: How to calculate data temperature?
On Tue, Nov 6, 2012 at 4:39 PM, Zheng Liu wrote: > On Mon, Nov 05, 2012 at 10:29:39AM +0800, Zhi Yong Wu wrote: >> On Fri, Nov 2, 2012 at 4:41 PM, Zheng Liu wrote: >> > On Fri, Nov 02, 2012 at 02:38:29PM +0800, Zhi Yong Wu wrote: >> >> Here also has another question. >> >> >> >> How to save the file temperature among the umount to be able to >> >> preserve the file tempreture after reboot? >> >> >> >> This above is the requirement from DB product. >> >> I thought that we can save file temperature in its inode struct, that >> >> is, add one new field in struct inode, then this info will be written >> >> to disk with inode. >> >> >> >> Any comments or ideas are appreciated, thanks. >> > >> > Hi Zhiyong, >> > >> > I think that we might define a callback function. If a filesystem wants >> > to save these data, it can implement a function to save them. The >> > filesystem can decide whether adding it or not by themselves. >> Great idea, temperature saving function is maybe very specific to FS. >> But i am wondering if we can find one generic way to save temperature >> info at first. > > I don't think a generic way is better because it cannot support a > variety of filesystems. So maybe you must answer this question firstly: > how many filesystems do you want to save this info? such as ext4, xfs, > btrfs, etc. Then we can try to find a generic way. If only these three > filesystems you want to support, maybe saving in xattr is an optional > way. yes, xattr is one good choice from currect discussion result. Maybe we can provide one generic way, and one callback registering infrastructure, if FS register its own saving callback, this callback function will be used, otherwise the generic way will be applied. By the way, as what Dave mentioned, the patchset v4+ review has highest priority, then the way how to calc data temperature, and the lowest priority is the way how to save data temperature info. > > Regards, > Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 02/19] vfs: initialize and free data structures
On Wed, Nov 7, 2012 at 6:24 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:44PM +0800, zwu.ker...@gmail.com wrote: >> +/* Frees the entire hot_range_tree. */ >> +static void hot_inode_item_free(struct kref *kref) >> +{ >> + struct hot_comm_item *comm_item = container_of(kref, >> + struct hot_comm_item, refs); >> + struct hot_inode_item *he = container_of(comm_item, >> + struct hot_inode_item, hot_inode); >> + >> + hot_range_tree_free(he); >> + radix_tree_delete(he->hot_inode_tree, he->i_ino); > > void *radix_tree_delete(struct radix_tree_root *root, unsigned long index) > > and he::i_ino is u64, this will not work when > sizeof(unsigned long) != sizeof(u64) (iirc this is a known limitation of > radix tree implementation). This will work on 64bit only, not sure if > this is intentional. i actually also realized this. Do you have a better way to handle this? > >> + kmem_cache_free(hot_inode_item_cachep, he); >> +} >> + >> +/* Frees the entire hot_inode_tree. */ >> +static void hot_inode_tree_exit(struct hot_info *root) >> +{ >> + struct hot_inode_item *hi_nodes[8]; >> + u64 ino = 0; >> + int i, n; > > nitpick, put the declarations on separate lines Will it have any issue? It has passed the check of checkpatch.pl. > >> + >> + while (1) { >> + spin_lock(&root->lock); >> + n = radix_tree_gang_lookup(&root->hot_inode_tree, >> +(void **)hi_nodes, ino, >> +ARRAY_SIZE(hi_nodes)); >> + if (!n) { >> + spin_unlock(&root->lock); >> + break; >> + } >> + >> + ino = hi_nodes[n - 1]->i_ino + 1; >> + for (i = 0; i < n; i++) >> + hot_inode_item_put(hi_nodes[i]); >> + spin_unlock(&root->lock); >> + } >> +} >> + >> /* >> * Initialize kmem cache for hot_inode_item and hot_range_item. >> */ >> @@ -106,3 +197,36 @@ err: >> kmem_cache_destroy(hot_inode_item_cachep); >> } >> EXPORT_SYMBOL_GPL(hot_cache_init); >> + >> +/* >> + * Initialize the data structures for hot data tracking. >> + */ >> +int hot_track_init(struct super_block *sb) >> +{ >> + struct hot_info *root; >> + int ret = -ENOMEM; >> + >> + root = kzalloc(sizeof(struct hot_info), GFP_NOFS); >> + if (!root) { >> + printk(KERN_ERR "%s: Failed to malloc memory for " >> + "hot_info\n", __func__); >> + return ret; > > minor: you can drop the variable ret and just reurn ENOMEM here This variable will also be used in the following patches. > >> + } >> + >> + sb->s_hot_root = root; >> + hot_inode_tree_init(root); >> + >> + printk(KERN_INFO "VFS: Turning on hot data tracking\n"); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(hot_track_init); > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Wed, Nov 7, 2012 at 6:37 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:45PM +0800, zwu.ker...@gmail.com wrote: >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> +struct hot_inode_item >> +*hot_inode_item_find(struct hot_info *root, u64 ino) >> +{ >> + struct hot_inode_item *he; >> + int ret; >> + >> +again: >> + spin_lock(&root->lock); >> + he = radix_tree_lookup(&root->hot_inode_tree, ino); >> + if (he) { >> + kref_get(&he->hot_inode.refs); >> + spin_unlock(&root->lock); >> + return he; >> + } >> + spin_unlock(&root->lock); >> + >> + he = kmem_cache_zalloc(hot_inode_item_cachep, >> + GFP_KERNEL | GFP_NOFS); >> + if (!he) >> + return ERR_PTR(-ENOMEM); >> + >> + hot_inode_item_init(he, ino, &root->hot_inode_tree); >> + >> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); >> + if (ret) { >> + kmem_cache_free(hot_inode_item_cachep, he); > > radix_tree_preload_end() > >> + return ERR_PTR(ret); >> + } >> + >> + spin_lock(&root->lock); >> + ret = radix_tree_insert(&root->hot_inode_tree, ino, he); >> + if (ret == -EEXIST) { >> + kmem_cache_free(hot_inode_item_cachep, he); >> + spin_unlock(&root->lock); >> + radix_tree_preload_end(); >> + goto again; >> + } >> + spin_unlock(&root->lock); >> + radix_tree_preload_end(); >> + >> + kref_get(&he->hot_inode.refs); >> + return he; >> +} >> +EXPORT_SYMBOL_GPL(hot_inode_item_find); >> + >> +static struct hot_range_item >> +*hot_range_item_find(struct hot_inode_item *he, >> + u32 start) >> +{ >> + struct hot_range_item *hr; >> + int ret; >> + >> +again: >> + spin_lock(&he->lock); >> + hr = radix_tree_lookup(&he->hot_range_tree, start); >> + if (hr) { >> + kref_get(&hr->hot_range.refs); >> + spin_unlock(&he->lock); >> + return hr; >> + } >> + spin_unlock(&he->lock); >> + >> + hr = kmem_cache_zalloc(hot_range_item_cachep, >> + GFP_KERNEL | GFP_NOFS); >> + if (!hr) >> + return ERR_PTR(-ENOMEM); >> + >> + hot_range_item_init(hr, start, he); >> + >> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); >> + if (ret) { >> + kmem_cache_free(hot_range_item_cachep, hr); > > radix_tree_preload_end() I checked some kernel existing cases about the usage of radix_tree_preload(), it seems that when radix_tree_preload() fail, its error handling doesn't need call radix_tree_preload_end() any more. > >> + return ERR_PTR(ret); >> + } >> + >> + spin_lock(&he->lock); >> + ret = radix_tree_insert(&he->hot_range_tree, start, hr); >> + if (ret == -EEXIST) { >> + kmem_cache_free(hot_range_item_cachep, hr); >> + spin_unlock(&he->lock); >> + radix_tree_preload_end(); ditto. >> + goto again; >> + } >> + spin_unlock(&he->lock); >> + radix_tree_preload_end(); >> + >> + kref_get(&hr->hot_range.refs); >> + return hr; >> +} > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 05/19] vfs: add hooks to enable hot tracking
On Wed, Nov 7, 2012 at 6:51 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:47PM +0800, zwu.ker...@gmail.com wrote: >> --- a/mm/readahead.c >> +++ b/mm/readahead.c >> @@ -19,6 +19,7 @@ >> #include >> #include >> #include >> +#include >> >> /* >> * Initialise a struct file's readahead state. Assumes that the caller has >> @@ -138,6 +139,11 @@ static int read_pages(struct address_space *mapping, >> struct file *filp, >> out: >> blk_finish_plug(&plug); >> >> + /* Hot data tracking */ >> + hot_update_freqs(mapping->host, (u64)(list_entry(pages->prev,\ >> + struct page, lru)->index) << PAGE_CACHE_SHIFT, >> + (u64)nr_pages * PAGE_CACHE_SIZE, 0); > > There's a stale \ at the end of the line, and I find this formatting > hard to read. Does the following look acceptable? yes, great, thanks. > > hot_update_freqs(mapping->host, > (u64)(list_entry(pages->prev, struct page, lru)->index) > << PAGE_CACHE_SHIFT, > (u64)nr_pages * PAGE_CACHE_SIZE, 0); > >> + >> return ret; >> } >> -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 10/19] vfs: introduce hot func register framework
On Wed, Nov 7, 2012 at 7:14 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:52PM +0800, zwu.ker...@gmail.com wrote: >> +static struct hot_func_type *hot_func_get(const char *name) >> +{ >> + struct hot_func_type *f, *h = &hot_func_def; >> + >> + spin_lock(&hot_func_list_lock); >> + list_for_each_entry(f, &hot_func_list, list) { >> + if (!strcmp(f->hot_func_name, name)) >> + h = f; > > You probably want to break here Good catch, done, thanks. > >> + } >> + spin_unlock(&hot_func_list_lock); >> + >> + return h; >> +} >> + >> +int hot_func_register(struct hot_func_type *h) >> +{ >> + struct hot_func_type *f, *t = NULL; >> + >> + /* register, don't allow duplicate names */ >> + spin_lock(&hot_func_list_lock); >> + list_for_each_entry(f, &hot_func_list, list) { >> + if (!strcmp(f->hot_func_name, h->hot_func_name)) >> + t = f; > > if duplicate names are not allowed, then a warning may make sense to > let us know that something is wrong done, thanks. > >> + } >> + >> + if (t) { >> + spin_unlock(&hot_func_list_lock); >> + return -EBUSY; >> + } >> + >> + list_add_tail(&h->list, &hot_func_list); >> + spin_unlock(&hot_func_list_lock); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(hot_func_register); >> --- a/include/linux/hot_tracking.h >> +++ b/include/linux/hot_tracking.h >> @@ -73,6 +75,25 @@ struct hot_range_item { >> u32 len; /* length in bytes */ >> }; >> >> +typedef u64 (hot_rw_freq_calc_fn) (struct timespec old_atime, >> + struct timespec cur_time, u64 old_avg); >> +typedef u32 (hot_temp_calc_fn) (struct hot_freq_data *freq_data); >> +typedef bool (hot_is_obsolete_fn) (struct hot_freq_data *freq_data); > > I'm thinking, whether these typedefs are useful, similar ops structures > do not introduce them, also when you pick a struct member names exactly > same as the typedefs: > >> +struct hot_func_ops { >> + hot_rw_freq_calc_fn *hot_rw_freq_calc_fn; >> + hot_temp_calc_fn *hot_temp_calc_fn; >> + hot_is_obsolete_fn *hot_is_obsolete_fn; >> +}; > > My suggestion is to make the types explicit in the structure. sorry, i don't get your point, can you elaborate it about how to do this? > >> +/* identifies an hot func type */ >> +struct hot_func_type { >> + char hot_func_name[HOT_NAME_MAX]; > > 'name' would be sufficient IMHO done, thanks. > >> + /* fields provided by specific FS */ >> + struct hot_func_ops ops; >> + struct list_head list; >> +}; > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 12/19] vfs: add one ioctl interface
On Wed, Nov 7, 2012 at 7:30 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:54PM +0800, zwu.ker...@gmail.com wrote: >> +static int ioctl_heat_info(struct file *file, void __user *argp) >> +{ >> + struct inode *inode = file->f_dentry->d_inode; >> + struct hot_heat_info *heat_info; >> + struct hot_inode_item *he; >> + int ret = 0; >> + >> + heat_info = kmalloc(sizeof(struct hot_heat_info), >> + GFP_KERNEL | GFP_NOFS); > > heat_info is small enough to fit onto the stack, so you can avoid the > kmalloc, I don't think there are deep callstacks to be expected. ok, done. > Nevertheless, if you want to use kmalloc here, then please check the > return value and use GFP_KERNEL. thanks for your pointing out. > >> + >> + if (copy_from_user((void *) heat_info, >> + argp, >> + sizeof(struct hot_heat_info)) != 0) { >> + ret = -EFAULT; >> + goto err; >> + } >> + >> + he = hot_inode_item_find(inode->i_sb->s_hot_root, inode->i_ino); >> + if (!he) { >> + /* we don't have any info on this file yet */ >> + ret = -ENODATA; >> + goto err; >> + } >> + >> + spin_lock(&he->hot_inode.lock); >> + heat_info->avg_delta_reads = >> + (__u64) he->hot_inode.hot_freq_data.avg_delta_reads; >> + heat_info->avg_delta_writes = >> + (__u64) he->hot_inode.hot_freq_data.avg_delta_writes; >> + heat_info->last_read_time = >> + (__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_read_time); >> + heat_info->last_write_time = >> + (__u64) timespec_to_ns(&he->hot_inode.hot_freq_data.last_write_time); >> + heat_info->num_reads = >> + (__u32) he->hot_inode.hot_freq_data.nr_reads; >> + heat_info->num_writes = >> + (__u32) he->hot_inode.hot_freq_data.nr_writes; >> + >> + if (heat_info->live > 0) { >> + /* >> + * got a request for live temperature, >> + * call hot_hash_calc_temperature to recalculate >> + */ >> + heat_info->temp = >> + inode->i_sb->s_hot_root->hot_func_type->ops.hot_temp_calc_fn( >> + &he->hot_inode.hot_freq_data); >> + } else { >> + /* not live temperature, get it from the hashlist */ >> + heat_info->temp = he->hot_inode.hot_freq_data.last_temp; >> + } >> + spin_unlock(&he->hot_inode.lock); >> + >> + hot_inode_item_put(he); >> + >> + if (copy_to_user(argp, (void *) heat_info, >> + sizeof(struct hot_heat_info))) { >> + ret = -EFAULT; >> + goto err; >> + } >> + >> +err: >> + kfree(heat_info); >> + return ret; >> +} > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 14/19] vfs: add debugfs support
On Wed, Nov 7, 2012 at 7:45 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:56PM +0800, zwu.ker...@gmail.com wrote: >> +static int hot_range_seq_show(struct seq_file *seq, void *v) >> +{ >> + struct hot_range_item *hr = v; >> + struct hot_inode_item *he = hr->hot_inode; >> + struct hot_freq_data *freq_data = &hr->hot_range.hot_freq_data; >> + >> + /* Always lock hot_inode_item first */ >> + spin_lock(&he->hot_inode.lock); >> + spin_lock(&hr->hot_range.lock); >> + seq_printf(seq, "inode #%llu, range start " \ > > the # seems unnecessary to me OK, removed. > >> + "%llu (range len %u) reads %u, writes %u, " >> + "avg read time %llu, avg write time %llu, temp %u\n", > > compiler will complain if it sees a %llu format and not the expected > type of 'unsigned long long' When built, i haven't seen any warning report about this... > >> + he->i_ino, > > (unsigned long long)he->i_ino, > >> + (u64)hr->start * RANGE_SIZE, >> + hr->len, >> + freq_data->nr_reads, >> + freq_data->nr_writes, >> + freq_data->avg_delta_reads / NSEC_PER_MSEC, >> + freq_data->avg_delta_writes / NSEC_PER_MSEC, >> + freq_data->last_temp >> (32 - HEAT_MAP_BITS)); >> + spin_unlock(&hr->hot_range.lock); >> + spin_unlock(&he->hot_inode.lock); >> + >> + return 0; >> +} >> + >> +static int hot_inode_seq_show(struct seq_file *seq, void *v) >> +{ >> + struct hot_inode_item *he = v; >> + struct hot_freq_data *freq_data = &he->hot_inode.hot_freq_data; >> + >> + spin_lock(&he->hot_inode.lock); >> + seq_printf(seq, "inode #%llu, reads %u, writes %u, " \ >> + "avg read time %llu, avg write time %llu, temp %u\n", > > (same here) ditto. > >> + he->i_ino, >> + freq_data->nr_reads, >> + freq_data->nr_writes, >> + freq_data->avg_delta_reads / NSEC_PER_MSEC, >> + freq_data->avg_delta_writes / NSEC_PER_MSEC, >> + freq_data->last_temp >> (32 - HEAT_MAP_BITS)); >> + spin_unlock(&he->hot_inode.lock); >> + >> + return 0; >> +} >> >> +static void *hot_spot_range_seq_next(struct seq_file *seq, void *v, loff_t >> *pos) >> +{ >> + struct hot_info *root = seq->private; >> + struct hot_range_item *hr_next, *hr = v; >> + struct hot_comm_item *comm_item; >> + struct list_head *n_list; >> + int i = >> + hr->hot_range.hot_freq_data.last_temp >> (32 - HEAT_MAP_BITS); > > now I have noticed that I've seen the ... (32 - HEAT_MAP_BITS) > expression so many times that it tend to think it deserves a helper > function This helper function has existed, hot_raw_shift(), i will replace this with it. > >> + >> + n_list = seq_list_next(&hr->hot_range.n_list, >> + &root->heat_range_map[i].node_list, pos); >> + hot_range_item_put(hr); >> +next: >> + if (n_list) { >> + comm_item = container_of(n_list, >> + struct hot_comm_item, n_list); >> + hr_next = container_of(comm_item, >> + struct hot_range_item, hot_range); >> + kref_get(&hr_next->hot_range.refs); >> + return hr_next; >> + } else if (--i >= 0) { >> + n_list = seq_list_next(&root->heat_range_map[i].node_list, >> + &root->heat_range_map[i].node_list, pos); >> + goto next; >> + } >> + >> + return NULL; >> +} >> + >> +static void hot_debugfs_exit(struct super_block *sb) >> +{ >> + struct dentry *vol_dentry; >> + >> + vol_dentry = debugfs_get_dentry(sb->s_id, >> + sb->s_hot_root->debugfs_root, >> strlen(sb->s_id)); >> + /* remove all debugfs entries recursively from the volume root */ >> + if (vol_dentry) >> + debugfs_remove_recursive(vol_dentry); >> + else >> + BUG_ON(1); > > BUG() done, thanks. > >> + >> + if (list_empty(&sb->s_hot_root->debugfs_root->d_subdirs)) >> + debugfs_remove(sb->s_hot_root->debugfs_root); >> +} >> + >> +/* > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 16/19] btrfs: add hot tracking support
On Wed, Nov 7, 2012 at 8:00 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:58PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Introduce one new mount option '-o hot_track', >> and add its parsing support. >> Its usage looks like: >>mount -o hot_track >>mount -o nouser,hot_track >>mount -o nouser,hot_track,loop >>mount -o hot_track,nouser >> >> Signed-off-by: Zhi Yong Wu > Reviewed-by: David Sterba thanks for your review. -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Wed, Nov 7, 2012 at 6:45 AM, Darrick J. Wong wrote: > On Mon, Oct 29, 2012 at 12:30:45PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add some util helpers to update access frequencies >> for one file or its range. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c| 179 >> ++ >> fs/hot_tracking.h|7 ++ >> include/linux/hot_tracking.h |2 + >> 3 files changed, 188 insertions(+), 0 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index 68591f0..0a7d9a3 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info *root) >> } >> } >> >> +struct hot_inode_item >> +*hot_inode_item_find(struct hot_info *root, u64 ino) >> +{ >> + struct hot_inode_item *he; >> + int ret; >> + >> +again: >> + spin_lock(&root->lock); >> + he = radix_tree_lookup(&root->hot_inode_tree, ino); >> + if (he) { >> + kref_get(&he->hot_inode.refs); >> + spin_unlock(&root->lock); >> + return he; >> + } >> + spin_unlock(&root->lock); >> + >> + he = kmem_cache_zalloc(hot_inode_item_cachep, >> + GFP_KERNEL | GFP_NOFS); >> + if (!he) >> + return ERR_PTR(-ENOMEM); >> + >> + hot_inode_item_init(he, ino, &root->hot_inode_tree); >> + >> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); >> + if (ret) { >> + kmem_cache_free(hot_inode_item_cachep, he); >> + return ERR_PTR(ret); >> + } >> + >> + spin_lock(&root->lock); >> + ret = radix_tree_insert(&root->hot_inode_tree, ino, he); >> + if (ret == -EEXIST) { >> + kmem_cache_free(hot_inode_item_cachep, he); >> + spin_unlock(&root->lock); >> + radix_tree_preload_end(); >> + goto again; >> + } >> + spin_unlock(&root->lock); >> + radix_tree_preload_end(); >> + >> + kref_get(&he->hot_inode.refs); >> + return he; >> +} >> +EXPORT_SYMBOL_GPL(hot_inode_item_find); >> + >> +static struct hot_range_item >> +*hot_range_item_find(struct hot_inode_item *he, >> + u32 start) >> +{ >> + struct hot_range_item *hr; >> + int ret; >> + >> +again: >> + spin_lock(&he->lock); >> + hr = radix_tree_lookup(&he->hot_range_tree, start); >> + if (hr) { >> + kref_get(&hr->hot_range.refs); >> + spin_unlock(&he->lock); >> + return hr; >> + } >> + spin_unlock(&he->lock); >> + >> + hr = kmem_cache_zalloc(hot_range_item_cachep, >> + GFP_KERNEL | GFP_NOFS); >> + if (!hr) >> + return ERR_PTR(-ENOMEM); >> + >> + hot_range_item_init(hr, start, he); >> + >> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); >> + if (ret) { >> + kmem_cache_free(hot_range_item_cachep, hr); >> + return ERR_PTR(ret); >> + } >> + >> + spin_lock(&he->lock); >> + ret = radix_tree_insert(&he->hot_range_tree, start, hr); >> + if (ret == -EEXIST) { >> + kmem_cache_free(hot_range_item_cachep, hr); >> + spin_unlock(&he->lock); >> + radix_tree_preload_end(); >> + goto again; >> + } >> + spin_unlock(&he->lock); >> + radix_tree_preload_end(); >> + >> + kref_get(&hr->hot_range.refs); >> + return hr; >> +} >> + >> +/* >> + * This function does the actual work of updating >> + * the frequency numbers, whatever they turn out to be. >> + */ >> +static u64 hot_average_update(struct timespec old_atime, >> + struct timespec cur_time, u64 old_avg) >> +{ >> + struct timespec delta_ts; >> + u64 new_avg; >> + u64 new_delta; >> + >> + delta_ts = timespec_sub(cur_time, old_atime); >> + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; >> + >> + new_avg = (old_avg << FREQ_POWER) - old_avg + new_delta; >> + new_avg = new_avg >>
Re: [RFC v4+ hot_track 10/19] vfs: introduce hot func register framework
On Wed, Nov 7, 2012 at 7:30 AM, Darrick J. Wong wrote: > On Mon, Oct 29, 2012 at 12:30:52PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Introduce one framwork to enable that specific FS >> can register its own hot tracking functions. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c| 78 >> ++ >> include/linux/hot_tracking.h | 25 + >> 2 files changed, 96 insertions(+), 7 deletions(-) >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> index 0ef9cad..c6c6138 100644 >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -24,6 +24,9 @@ >> #include >> #include "hot_tracking.h" >> >> +static DEFINE_SPINLOCK(hot_func_list_lock); >> +static LIST_HEAD(hot_func_list); >> + >> /* kmem_cache pointers for slab caches */ >> static struct kmem_cache *hot_inode_item_cachep __read_mostly; >> static struct kmem_cache *hot_range_item_cachep __read_mostly; >> @@ -305,20 +308,23 @@ static u64 hot_average_update(struct timespec >> old_atime, >> return new_avg; >> } >> >> -static void hot_freq_data_update(struct hot_freq_data *freq_data, bool >> write) >> +static void hot_freq_data_update(struct hot_info *root, >> + struct hot_freq_data *freq_data, bool write) >> { >> struct timespec cur_time = current_kernel_time(); >> >> if (write) { >> freq_data->nr_writes += 1; >> - freq_data->avg_delta_writes = hot_average_update( >> + freq_data->avg_delta_writes = >> + root->hot_func_type->ops.hot_rw_freq_calc_fn( >> freq_data->last_write_time, >> cur_time, >> freq_data->avg_delta_writes); >> freq_data->last_write_time = cur_time; >> } else { >> freq_data->nr_reads += 1; >> - freq_data->avg_delta_reads = hot_average_update( >> + freq_data->avg_delta_reads = >> + root->hot_func_type->ops.hot_rw_freq_calc_fn( >> freq_data->last_read_time, >> cur_time, >> freq_data->avg_delta_reads); >> @@ -430,7 +436,7 @@ static void hot_map_array_update(struct hot_freq_data >> *freq_data, >> struct hot_comm_item *comm_item; >> struct hot_inode_item *he; >> struct hot_range_item *hr; >> - u32 temp = hot_temp_calc(freq_data); >> + u32 temp = root->hot_func_type->ops.hot_temp_calc_fn(freq_data); >> u8 a_temp = temp >> (32 - HEAT_MAP_BITS); >> u8 b_temp = freq_data->last_temp >> (32 - HEAT_MAP_BITS); >> >> @@ -511,7 +517,7 @@ static void hot_range_update(struct hot_inode_item *he, >> &hr_nodes[i]->hot_range.hot_freq_data, root); >> >> spin_lock(&hr_nodes[i]->hot_range.lock); >> - obsolete = hot_is_obsolete( >> + obsolete = root->hot_func_type->ops.hot_is_obsolete_fn( >> &hr_nodes[i]->hot_range.hot_freq_data); >> spin_unlock(&hr_nodes[i]->hot_range.lock); >> >> @@ -668,7 +674,7 @@ void hot_update_freqs(struct inode *inode, u64 start, >> } >> >> spin_lock(&he->hot_inode.lock); >> - hot_freq_data_update(&he->hot_inode.hot_freq_data, rw); >> + hot_freq_data_update(root, &he->hot_inode.hot_freq_data, rw); >> spin_unlock(&he->hot_inode.lock); >> >> /* >> @@ -685,7 +691,7 @@ void hot_update_freqs(struct inode *inode, u64 start, >> } >> >> spin_lock(&hr->hot_range.lock); >> - hot_freq_data_update(&hr->hot_range.hot_freq_data, rw); >> + hot_freq_data_update(root, &hr->hot_range.hot_freq_data, rw); >> spin_unlock(&hr->hot_range.lock); >> >> hot_range_item_put(hr); >> @@ -695,6 +701,61 @@ void hot_update_freqs(struct inode *inode, u64 start, >> } >> EXPORT_SYMBOL_GPL(hot_update_freqs); >> >> +static struct hot_func_type hot_func_def = { >> + .hot_func_name = "hot_type_def", >> + .ops = { >> + .hot_rw_freq
Re: [PATCH] xfs: add hot tracking support.
HI, Dave, I guess that you should add some hot tracking stuff in some xfs_show_xxx function, right? On Tue, Oct 16, 2012 at 8:04 AM, Dave Chinner wrote: > > From: Dave Chinner > > Connect up the VFS hot tracking support so XFS filesystems can make > use of it. > > Signed-off-by: Dave Chinner > --- > fs/xfs/xfs_mount.h |1 + > fs/xfs/xfs_super.c |9 + > 2 files changed, 10 insertions(+) > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h > index a631ca3..d5e7277 100644 > --- a/fs/xfs/xfs_mount.h > +++ b/fs/xfs/xfs_mount.h > @@ -215,6 +215,7 @@ typedef struct xfs_mount { > #define XFS_MOUNT_WSYNC(1ULL << 0) /* for nfs - all > metadata ops >must be synchronous except >for space allocations */ > +#define XFS_MOUNT_HOTTRACK (1ULL << 1) /* hot inode tracking */ > #define XFS_MOUNT_WAS_CLEAN(1ULL << 3) > #define XFS_MOUNT_FS_SHUTDOWN (1ULL << 4) /* atomic stop of all > filesystem >operations, typically for > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 56c2537..17786ff 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -61,6 +61,7 @@ > #include > #include > #include > +#include > > static const struct super_operations xfs_super_operations; > static kmem_zone_t *xfs_ioend_zone; > @@ -114,6 +115,7 @@ mempool_t *xfs_ioend_pool; > #define MNTOPT_NODELAYLOG "nodelaylog"/* Delayed logging disabled */ > #define MNTOPT_DISCARD"discard"/* Discard unused blocks */ > #define MNTOPT_NODISCARD "nodiscard" /* Do not discard unused blocks */ > +#define MNTOPT_HOTTRACK"hot_track" /* hot inode tracking */ > > /* > * Table driven mount option parser. > @@ -371,6 +373,8 @@ xfs_parseargs( > mp->m_flags |= XFS_MOUNT_DISCARD; > } else if (!strcmp(this_char, MNTOPT_NODISCARD)) { > mp->m_flags &= ~XFS_MOUNT_DISCARD; > + } else if (!strcmp(this_char, MNTOPT_HOTTRACK)) { > + mp->m_flags |= XFS_MOUNT_HOTTRACK; > } else if (!strcmp(this_char, "ihashsize")) { > xfs_warn(mp, > "ihashsize no longer used, option is deprecated."); > @@ -1040,6 +1044,9 @@ xfs_fs_put_super( > { > struct xfs_mount*mp = XFS_M(sb); > > + if (mp->m_flags & XFS_MOUNT_HOTTRACK) > + hot_track_exit(sb); > + > xfs_filestream_unmount(mp); > xfs_unmountfs(mp); > > @@ -1470,6 +1477,8 @@ xfs_fs_fill_super( > error = ENOMEM; > goto out_unmount; > } > + if (mp->m_flags & XFS_MOUNT_HOTTRACK) > + hot_track_init(sb); > > return 0; > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 00/19] vfs: hot data tracking
On Mon, Oct 29, 2012 at 12:30 PM, wrote: > From: Zhi Yong Wu > > NOTE: > > The patchset can be obtained via my kernel dev git on github: > g...@github.com:wuzhy/kernel.git hot_tracking > If you're interested, you can also can review them via > https://github.com/wuzhy/kernel/commits/hot_tracking hi, guys, The latest code change has been pushed into my above dev git tree. If no further comments are done, i will post next version soon. > > For more info, please check hot_tracking.txt in Documentation > > TODO List: > > 1.) Need to do scalability or performance tests. > 2.) Need one simpler but effective temp calc'ing function > 3.) How to save the file temperature among the umount to be able to > preserve the file tempreture after reboot > > Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to > be very btrfs-specific. I've taken their code and attempted to > make it more generic and integrate it at the VFS level. > > Changelog from v3: > 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner] > 2.) Refactored workqueue support. [Dave Chinner] > 3.) Turn some Micro into be tunable [Zhiyong, Zheng Liu] >TIME_TO_KICK, and HEAT_UPDATE_DELAY > 4.) Introduce hot func registering framework [Zhiyong] > 5.) Remove global variable for hot tracking [Zhiyong] > 6.) Add xfs hot tracking support [Dave Chinner] > 7.) Add ext4 hot tracking support [Zheng Liu] > 8.) Cleanedup a lot of other issues [Dave Chinner] > > v3: > 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] > 2.) Added memory shrinker [Dave Chinner] > > v2: > 1.) Converted to one workqueue to update map info periodically [Dave Chinner] > 2.) Cleanedup a lot of other issues [Dave Chinner] > > v1: > 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] > 2.) Add btrfs hot tracking support [Zhiyong] > 3.) The first three patches can probably just be flattened into one. > [Marco Stornelli , Dave Chinner] > > Dave Chinner (1): > xfs: add hot tracking support > > Zheng Liu (1): > ext4: add hot tracking support > > Zhi Yong Wu (17): > vfs: introduce private radix tree structures > vfs: initialize and free data structures > vfs: add I/O frequency update function > vfs: add two map arrays > vfs: add hooks to enable hot tracking > vfs: add temp calculation function > vfs: add map info update function > vfs: add aging function > vfs: add one work queue > vfs: introduce hot func register framework > vfs: register one shrinker > vfs: add one ioctl interface > debugfs: introduce one function > vfs: add debugfs support > sysfs: add two hot_track proc files > btrfs: add hot tracking support > vfs: add documentation > > Documentation/filesystems/00-INDEX |2 + > Documentation/filesystems/hot_tracking.txt | 262 ++ > fs/Makefile|2 +- > fs/btrfs/ctree.h |1 + > fs/btrfs/super.c | 22 +- > fs/compat_ioctl.c |5 + > fs/dcache.c|2 + > fs/debugfs/inode.c | 26 + > fs/direct-io.c |6 + > fs/ext4/ext4.h |3 + > fs/ext4/super.c| 13 +- > fs/hot_tracking.c | 1367 > > fs/hot_tracking.h | 58 ++ > fs/ioctl.c | 78 ++ > fs/xfs/xfs_mount.h |1 + > fs/xfs/xfs_super.c | 16 + > include/linux/debugfs.h|9 + > include/linux/fs.h |4 + > include/linux/hot_tracking.h | 149 +++ > kernel/sysctl.c| 14 + > mm/filemap.c |6 + > mm/page-writeback.c| 12 + > mm/readahead.c |6 + > 23 files changed, 2061 insertions(+), 3 deletions(-) > create mode 100644 Documentation/filesystems/hot_tracking.txt > create mode 100644 fs/hot_tracking.c > create mode 100644 fs/hot_tracking.h > create mode 100644 include/linux/hot_tracking.h > > -- > 1.7.6.5 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v4+ hot_track 03/19] vfs: add I/O frequency update function
On Thu, Nov 8, 2012 at 2:49 AM, Darrick J. Wong wrote: > On Wed, Nov 07, 2012 at 04:27:05PM +0800, Zhi Yong Wu wrote: >> On Wed, Nov 7, 2012 at 6:45 AM, Darrick J. Wong >> wrote: >> > On Mon, Oct 29, 2012 at 12:30:45PM +0800, zwu.ker...@gmail.com wrote: >> >> From: Zhi Yong Wu >> >> >> >> Add some util helpers to update access frequencies >> >> for one file or its range. >> >> >> >> Signed-off-by: Zhi Yong Wu >> >> --- >> >> fs/hot_tracking.c| 179 >> >> ++ >> >> fs/hot_tracking.h|7 ++ >> >> include/linux/hot_tracking.h |2 + >> >> 3 files changed, 188 insertions(+), 0 deletions(-) >> >> >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> >> index 68591f0..0a7d9a3 100644 >> >> --- a/fs/hot_tracking.c >> >> +++ b/fs/hot_tracking.c >> >> @@ -172,6 +172,137 @@ static void hot_inode_tree_exit(struct hot_info >> >> *root) >> >> } >> >> } >> >> >> >> +struct hot_inode_item >> >> +*hot_inode_item_find(struct hot_info *root, u64 ino) >> >> +{ >> >> + struct hot_inode_item *he; >> >> + int ret; >> >> + >> >> +again: >> >> + spin_lock(&root->lock); >> >> + he = radix_tree_lookup(&root->hot_inode_tree, ino); >> >> + if (he) { >> >> + kref_get(&he->hot_inode.refs); >> >> + spin_unlock(&root->lock); >> >> + return he; >> >> + } >> >> + spin_unlock(&root->lock); >> >> + >> >> + he = kmem_cache_zalloc(hot_inode_item_cachep, >> >> + GFP_KERNEL | GFP_NOFS); >> >> + if (!he) >> >> + return ERR_PTR(-ENOMEM); >> >> + >> >> + hot_inode_item_init(he, ino, &root->hot_inode_tree); >> >> + >> >> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); >> >> + if (ret) { >> >> + kmem_cache_free(hot_inode_item_cachep, he); >> >> + return ERR_PTR(ret); >> >> + } >> >> + >> >> + spin_lock(&root->lock); >> >> + ret = radix_tree_insert(&root->hot_inode_tree, ino, he); >> >> + if (ret == -EEXIST) { >> >> + kmem_cache_free(hot_inode_item_cachep, he); >> >> + spin_unlock(&root->lock); >> >> + radix_tree_preload_end(); >> >> + goto again; >> >> + } >> >> + spin_unlock(&root->lock); >> >> + radix_tree_preload_end(); >> >> + >> >> + kref_get(&he->hot_inode.refs); >> >> + return he; >> >> +} >> >> +EXPORT_SYMBOL_GPL(hot_inode_item_find); >> >> + >> >> +static struct hot_range_item >> >> +*hot_range_item_find(struct hot_inode_item *he, >> >> + u32 start) >> >> +{ >> >> + struct hot_range_item *hr; >> >> + int ret; >> >> + >> >> +again: >> >> + spin_lock(&he->lock); >> >> + hr = radix_tree_lookup(&he->hot_range_tree, start); >> >> + if (hr) { >> >> + kref_get(&hr->hot_range.refs); >> >> + spin_unlock(&he->lock); >> >> + return hr; >> >> + } >> >> + spin_unlock(&he->lock); >> >> + >> >> + hr = kmem_cache_zalloc(hot_range_item_cachep, >> >> + GFP_KERNEL | GFP_NOFS); >> >> + if (!hr) >> >> + return ERR_PTR(-ENOMEM); >> >> + >> >> + hot_range_item_init(hr, start, he); >> >> + >> >> + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); >> >> + if (ret) { >> >> + kmem_cache_free(hot_range_item_cachep, hr); >> >> + return ERR_PTR(ret); >> >> + } >> >> + >> >> + spin_lock(&he->lock); >> >> + ret = radix_tree_insert(&he->hot_range_tree, start, hr); >> >> + if (ret == -EEXIST) { >> >> + km
Re: [RFC v4+ hot_track 10/19] vfs: introduce hot func register framework
On Thu, Nov 8, 2012 at 2:58 AM, Darrick J. Wong wrote: > On Wed, Nov 07, 2012 at 04:34:35PM +0800, Zhi Yong Wu wrote: >> On Wed, Nov 7, 2012 at 7:30 AM, Darrick J. Wong >> wrote: >> > On Mon, Oct 29, 2012 at 12:30:52PM +0800, zwu.ker...@gmail.com wrote: >> >> From: Zhi Yong Wu >> >> >> >> Introduce one framwork to enable that specific FS >> >> can register its own hot tracking functions. >> >> >> >> Signed-off-by: Zhi Yong Wu >> >> --- >> >> fs/hot_tracking.c| 78 >> >> ++ >> >> include/linux/hot_tracking.h | 25 + >> >> 2 files changed, 96 insertions(+), 7 deletions(-) >> >> >> >> diff --git a/fs/hot_tracking.c b/fs/hot_tracking.c >> >> index 0ef9cad..c6c6138 100644 >> >> --- a/fs/hot_tracking.c >> >> +++ b/fs/hot_tracking.c >> >> @@ -24,6 +24,9 @@ >> >> #include >> >> #include "hot_tracking.h" >> >> >> >> +static DEFINE_SPINLOCK(hot_func_list_lock); >> >> +static LIST_HEAD(hot_func_list); >> >> + >> >> /* kmem_cache pointers for slab caches */ >> >> static struct kmem_cache *hot_inode_item_cachep __read_mostly; >> >> static struct kmem_cache *hot_range_item_cachep __read_mostly; >> >> @@ -305,20 +308,23 @@ static u64 hot_average_update(struct timespec >> >> old_atime, >> >> return new_avg; >> >> } >> >> >> >> -static void hot_freq_data_update(struct hot_freq_data *freq_data, bool >> >> write) >> >> +static void hot_freq_data_update(struct hot_info *root, >> >> + struct hot_freq_data *freq_data, bool write) >> >> { >> >> struct timespec cur_time = current_kernel_time(); >> >> >> >> if (write) { >> >> freq_data->nr_writes += 1; >> >> - freq_data->avg_delta_writes = hot_average_update( >> >> + freq_data->avg_delta_writes = >> >> + root->hot_func_type->ops.hot_rw_freq_calc_fn( >> >> freq_data->last_write_time, >> >> cur_time, >> >> freq_data->avg_delta_writes); >> >> freq_data->last_write_time = cur_time; >> >> } else { >> >> freq_data->nr_reads += 1; >> >> - freq_data->avg_delta_reads = hot_average_update( >> >> + freq_data->avg_delta_reads = >> >> + root->hot_func_type->ops.hot_rw_freq_calc_fn( >> >> freq_data->last_read_time, >> >> cur_time, >> >> freq_data->avg_delta_reads); >> >> @@ -430,7 +436,7 @@ static void hot_map_array_update(struct hot_freq_data >> >> *freq_data, >> >> struct hot_comm_item *comm_item; >> >> struct hot_inode_item *he; >> >> struct hot_range_item *hr; >> >> - u32 temp = hot_temp_calc(freq_data); >> >> + u32 temp = root->hot_func_type->ops.hot_temp_calc_fn(freq_data); >> >> u8 a_temp = temp >> (32 - HEAT_MAP_BITS); >> >> u8 b_temp = freq_data->last_temp >> (32 - HEAT_MAP_BITS); >> >> >> >> @@ -511,7 +517,7 @@ static void hot_range_update(struct hot_inode_item >> >> *he, >> >> &hr_nodes[i]->hot_range.hot_freq_data, >> >> root); >> >> >> >> spin_lock(&hr_nodes[i]->hot_range.lock); >> >> - obsolete = hot_is_obsolete( >> >> + obsolete = >> >> root->hot_func_type->ops.hot_is_obsolete_fn( >> >> >> >> &hr_nodes[i]->hot_range.hot_freq_data); >> >> spin_unlock(&hr_nodes[i]->hot_range.lock); >> >> >> >> @@ -668,7 +674,7 @@ void hot_update_freqs(struct inode *inode, u64 start, >> >> } >> >> >> >> spin_lock(&he->hot_inode.lock); >> >> - hot_freq_data_update(&he->hot_inode.hot_freq_data, rw); >> >> + hot_freq_data_update(
Re: [PATCH RESEND v1 00/16] vfs: hot data tracking
HI, The raw data is very big, i don't know if it is appropriate to post them here. If you want to get them, please let me know. On Thu, Dec 20, 2012 at 10:43 PM, wrote: > From: Zhi Yong Wu > > HI, guys, > > This patchset has been done scalability or performance tests > by fs_mark, ffsb and compilebench. > I have done the perf testing on Linux 3.7.0-rc8+ with Intel(R) Core(TM) > i7-3770 CPU @ 3.40GHz with 8 CPUs, 16G ram and 260G disk. > > Any comments or ideas are appreciated, thanks. > > NOTE: > > The patchset can be obtained via my kernel dev git on github: > git://github.com/wuzhy/kernel.git hot_tracking > If you're interested, you can also review them via > https://github.com/wuzhy/kernel/commits/hot_tracking > > For more info, please check hot_tracking.txt in Documentation > > Below is the perf testing report: > > 1. fs_mark test > > w/o: without hot tracking > w/ : with hot tracking > > Count Size FSUse% Files/sec App > Overhead > > w/ow/ w/o w/ w/o >w/ > > 80 1 2 3 13756.4 32144.9 > 53506275436291 > 160 1 4 5 1163.4 1799.3 > 20848119 21708216 > 240 1 6 6 1360.8 1252.5 > 67987058715322 > 320 1 8 8 1600.1 1196.3 > 57511296013792 > 400 1 9 9 1071.4 1191.2 > 17204725 26786369 > 480 1 10 10 1483.5 1447.9 > 19418383046 > 560 1 11 11 1457.9 1699.5 > 5783588 10074681 > 640 1 12 13 1658.8 1628.5 > 69926976185551 > 720 1 14 14 1662.4 1857.1 > 5796793 13772592 > 800 1 15 15 2930.0 2653.8 > 124316826152573 > 880 1 16 17 1630.8 1665.0 > 7666719 13682765 > 960 1 18 18 1530.3 1583.9 > 5823644 10171644 > 1040 1 19 19 1437.9 1798.6 > 209352246048083 > 1120 1 20 20 1529.0 1550.6 > 66474506003151 > 1200 1 21 22 1558.6 1501.8 > 12539509 18144939 > 1280 1 23 23 1644.2 1432.1 > 7074419 28101975 > 1360 1 24 24 1753.6 1650.2 > 7164297 20888972 > 1440 1 25 25 2750.0 1483.9 > 127566927441225 > 1520 1 27 27 1551.1 1514.3 > 57410668250443 > 1600 1 28 28 1610.8 1635.9 > 721938608545285 > 1680 1 29 29 1646.7 1907.7 > 8945856 11703513 > 1760 1 30 31 1496.6 2722.3 > 58589618989393 > 1840 1 32 32 1457.7 1565.7 > 10914475 26504660 > 1920 1 33 33 1437.6 1518.7 > 6708975 213303618 > 2000 1 34 34 1825.4 1521.1 > 5722086 12490907 > 2080 1 36 35 1718.4 1611.5 > 5873290 17942534 > 2160 1 37 37 2152.6 1536.9 > 1130506278717940 > 2240 1 38 38 2443.7 1788.2 > 7398122 19834765 > 2320 1 39 39 1518.5 1587.6 > 5770959 10134882 > 2400 1 41 41 1536.8 2164.0 > 57512487214626 > 2480 1 42 42 1576.6 2939.4 > 73903146070271 > 2560 1 43 43 1707.4 1535.9 > 110759396052896 > 2640 1 44 44 1522.5 1563.1 > 10142987 22549898 > 2720 1 46 46 1827.4 1608.5 > 11613016 24828125 > 2800 1 47 47 3420.5 1741.9 > 8059985 16599156 > 2880 1 48 48 1815.5 1944.4 > 78479319043277 > 2960 1 50 49 1650.0 1596.6 > 56363237929164 >
Re: [RFC v4+ hot_track 02/19] vfs: initialize and free data structures
On Wed, Nov 7, 2012 at 6:24 AM, David Sterba wrote: > On Mon, Oct 29, 2012 at 12:30:44PM +0800, zwu.ker...@gmail.com wrote: >> +/* Frees the entire hot_range_tree. */ >> +static void hot_inode_item_free(struct kref *kref) >> +{ >> + struct hot_comm_item *comm_item = container_of(kref, >> + struct hot_comm_item, refs); >> + struct hot_inode_item *he = container_of(comm_item, >> + struct hot_inode_item, hot_inode); >> + >> + hot_range_tree_free(he); >> + radix_tree_delete(he->hot_inode_tree, he->i_ino); > > void *radix_tree_delete(struct radix_tree_root *root, unsigned long index) > > and he::i_ino is u64, this will not work when > sizeof(unsigned long) != sizeof(u64) (iirc this is a known limitation of > radix tree implementation). This will work on 64bit only, not sure if > this is intentional. Fixed, thanks. > >> + kmem_cache_free(hot_inode_item_cachep, he); >> +} >> + >> +/* Frees the entire hot_inode_tree. */ >> +static void hot_inode_tree_exit(struct hot_info *root) >> +{ >> + struct hot_inode_item *hi_nodes[8]; >> + u64 ino = 0; >> + int i, n; > > nitpick, put the declarations on separate lines > >> + >> + while (1) { >> + spin_lock(&root->lock); >> + n = radix_tree_gang_lookup(&root->hot_inode_tree, >> +(void **)hi_nodes, ino, >> +ARRAY_SIZE(hi_nodes)); >> + if (!n) { >> + spin_unlock(&root->lock); >> + break; >> + } >> + >> + ino = hi_nodes[n - 1]->i_ino + 1; >> + for (i = 0; i < n; i++) >> + hot_inode_item_put(hi_nodes[i]); >> + spin_unlock(&root->lock); >> + } >> +} >> + >> /* >> * Initialize kmem cache for hot_inode_item and hot_range_item. >> */ >> @@ -106,3 +197,36 @@ err: >> kmem_cache_destroy(hot_inode_item_cachep); >> } >> EXPORT_SYMBOL_GPL(hot_cache_init); >> + >> +/* >> + * Initialize the data structures for hot data tracking. >> + */ >> +int hot_track_init(struct super_block *sb) >> +{ >> + struct hot_info *root; >> + int ret = -ENOMEM; >> + >> + root = kzalloc(sizeof(struct hot_info), GFP_NOFS); >> + if (!root) { >> + printk(KERN_ERR "%s: Failed to malloc memory for " >> + "hot_info\n", __func__); >> + return ret; > > minor: you can drop the variable ret and just reurn ENOMEM here > >> + } >> + >> + sb->s_hot_root = root; >> + hot_inode_tree_init(root); >> + >> + printk(KERN_INFO "VFS: Turning on hot data tracking\n"); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(hot_track_init); > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xfs: introduce object readahead to log recovery
On Fri, Jul 26, 2013 at 7:35 PM, Dave Chinner wrote: > On Fri, Jul 26, 2013 at 02:36:15PM +0800, Zhi Yong Wu wrote: >> Dave, >> >> All comments are good to me, and will be applied to next version, thanks a >> lot. >> >> On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner wrote: >> > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.ker...@gmail.com wrote: >> >> From: Zhi Yong Wu >> >> >> >> It can take a long time to run log recovery operation because it is >> >> single threaded and is bound by read latency. We can find that it took >> >> most of the time to wait for the read IO to occur, so if one object >> >> readahead is introduced to log recovery, it will obviously reduce the >> >> log recovery time. >> >> >> >> In dirty log case as below: >> >> data device: 0xfd10 >> >> log device: 0xfd10 daddr: 20480032 length: 20480 >> >> >> >> log tail: 7941 head: 11077 state: >> > >> > That's only a small log (10MB). As I've said on irc, readahead won't >> Yeah, it is one 10MB log, but how do you calculate it based on the above >> info? > > length = 20480 blocks. 20480 * 512 = 10MB Thanks. > >> > And the recovery time from this is between 15-17s: >> > >> > >> > log device: 0xfd20 daddr: 107374182032 length: 4173824 >> >^^^ almost 2GB >> > log tail: 19288 head: 264809 state: >> > >> > real0m17.913s >> > user0m0.000s >> > sys 0m2.381s >> > >> > And runs at 3-4000 read IOPs for most of that time. It's largely IO >> > bound, even on SSDs. >> > >> > With your patch: >> > >> > log tail: 35871 head: 308393 state: >> > real0m12.715s >> > user0m0.000s >> > sys 0m2.247s >> > >> > And it peaked at ~5000 read IOPS. >> How do you know its READ IOPS is ~5000? > > Other monitoring. iostat can tell you this, though I use PCP... thanks. > >> > Ok, so you've based the readahead on the transaction item list >> > having a next pointer. What I think you should do is turn this into >> > a readahead queue by moving objects to a new list. i.e. >> > >> > list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) { >> > >> > case XLOG_RECOVER_PASS2: >> > if (ra_qdepth++ >= MAX_QDEPTH) { >> > recover_items(log, trans, &buffer_list, >> > &ra_item_list); >> > ra_qdepth = 0; >> > } else { >> > xlog_recover_item_readahead(log, item); >> > list_move_tail(&item->ri_list, >> > &ra_item_list); >> > } >> > break; >> > ... >> > } >> > } >> > if (!list_empty(&ra_item_list)) >> > recover_items(log, trans, &buffer_list, &ra_item_list); >> > >> > I'd suggest that a queue depth somewhere between 10 and 100 will >> > be necessary to keep enough IO in flight to keep the pipeline full >> > and prevent recovery from having to wait on IO... >> Good suggestion, will apply it to next version, thanks. > > FWIW, I hacked a quick test of this into your patch here and a depth > of 100 brought the reocvery time down to under 8s. For other > workloads which have nothing but dirty inodes (like fsmark) a depth > of 100 drops the recovery time from ~100s to ~25s, and the iop rate > is peaking at well over 15,000 IOPS. So we definitely want to queue > up more than a single readahead... Excited, I will try it. By the way, how do you try the workload which has nothing but dirty dquote objects? > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xfs: introduce object readahead to log recovery
On Mon, Jul 29, 2013 at 10:45 AM, Dave Chinner wrote: > On Mon, Jul 29, 2013 at 09:38:11AM +0800, Zhi Yong Wu wrote: >> By the way, how do you try the workload which has nothing but dirty >> dquote objects? > > Create quota limits for non-existent users. That will allocate the > dquots and dirty them. > > Or if you already have a few hundred thousand dquots, just change > the limits on them all to get them logged and dirty... OK, thanks. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] xfs: fix an assertion failure
On Fri, Jul 26, 2013 at 7:37 PM, Dave Chinner wrote: > On Fri, Jul 26, 2013 at 02:01:23PM +0800, Zhi Yong Wu wrote: >> No, it still raised the same assertion as below: >> >> [ 521.715103] XFS: Assertion failed: !(bip->bli_item.li_flags & >> XFS_LI_IN_AIL), file: fs/xfs/xfs_buf_item.c, line: 940 > > How are you reproducing it? > > Can you take an event trace with trace-cmd and attach the output of > the report? After kernel base is switched to Linux 3.11-rc3, when i try to take event trace with trace-cmd, that issue doesn't appear any more in my environment. :) > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] xfs: introduce object readahead to log recovery
On Tue, Jul 30, 2013 at 9:10 PM, Brian Foster wrote: > On 07/30/2013 05:59 AM, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> It can take a long time to run log recovery operation because it is >> single threaded and is bound by read latency. We can find that it took >> most of the time to wait for the read IO to occur, so if one object >> readahead is introduced to log recovery, it will obviously reduce the >> log recovery time. >> >> Log recovery time stat: >> >> w/o this patchw/ this patch >> >> real:0m15.023s 0m7.802s >> user:0m0.001s 0m0.001s >> sys: 0m0.246s 0m0.107s >> >> Signed-off-by: Zhi Yong Wu >> --- > > Cool patch. I'm not terribly familiar with the log recovery code so take > my comments with a grain of salt, but a couple things I noticed on a > quick skim... > >> fs/xfs/xfs_log_recover.c | 162 >> +-- >> fs/xfs/xfs_log_recover.h | 2 + >> 2 files changed, 159 insertions(+), 5 deletions(-) >> > ... >> >> +STATIC int >> +xlog_recover_items_pass2( >> + struct xlog *log, >> + struct xlog_recover *trans, >> + struct list_head*buffer_list, >> + struct list_head*ra_list) > > A nit, but technically this function doesn't have to be involved with > readahead. Perhaps rename ra_list to item_list..? ok, applied. > >> +{ >> + int error = 0; >> + xlog_recover_item_t *item; >> + >> + list_for_each_entry(item, ra_list, ri_list) { >> + error = xlog_recover_commit_pass2(log, trans, >> + buffer_list, item); >> + if (error) >> + return error; >> + } >> + >> + return error; >> +} >> + >> /* >> * Perform the transaction. >> * >> @@ -3189,9 +3314,11 @@ xlog_recover_commit_trans( >> struct xlog_recover *trans, >> int pass) >> { >> - int error = 0, error2; >> - xlog_recover_item_t *item; >> + int error = 0, error2, ra_qdepth = 0; >> + xlog_recover_item_t *item, *next; >> LIST_HEAD (buffer_list); >> + LIST_HEAD (ra_list); >> + LIST_HEAD (all_ra_list); >> >> hlist_del(&trans->r_list); >> >> @@ -3199,14 +3326,21 @@ xlog_recover_commit_trans( >> if (error) >> return error; >> >> - list_for_each_entry(item, &trans->r_itemq, ri_list) { >> + list_for_each_entry_safe(item, next, &trans->r_itemq, ri_list) { >> switch (pass) { >> case XLOG_RECOVER_PASS1: >> error = xlog_recover_commit_pass1(log, trans, item); >> break; >> case XLOG_RECOVER_PASS2: >> - error = xlog_recover_commit_pass2(log, trans, >> - &buffer_list, item); >> + if (ra_qdepth++ >= XLOG_RECOVER_MAX_QDEPTH) { > > The counting mechanism looks strange and easy to break with future > changes. Why not increment ra_qdepth in the else bracket where it is > explicitly tied to the operation it tracks? ok. > >> + error = xlog_recover_items_pass2(log, trans, >> + &buffer_list, &ra_list); >> + list_splice_tail_init(&ra_list, &all_ra_list); >> + ra_qdepth = 0; > > So now we've queued up a bunch of items we've issued readahead on in > ra_list and we've executed the recovery on the list. What happens to the > current item? Good catch, the current item was missed. Done. > > Brian > >> + } else { >> + xlog_recover_ra_pass2(log, item); >> + list_move_tail(&item->ri_list, &ra_list); >> + } >> break; >> default: >> ASSERT(0); >> @@ -3216,9 +3350,27 @@ xlog_recover_commit_trans( >> goto out; >> } >> >> + if (!list_empty(&ra_list)) { >> + error = xlog_recove
Re: [PATCH v2] xfs: introduce object readahead to log recovery
On Wed, Jul 31, 2013 at 7:11 AM, Dave Chinner wrote: > On Tue, Jul 30, 2013 at 05:59:07PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> It can take a long time to run log recovery operation because it is >> single threaded and is bound by read latency. We can find that it took >> most of the time to wait for the read IO to occur, so if one object >> readahead is introduced to log recovery, it will obviously reduce the >> log recovery time. >> >> Log recovery time stat: >> >> w/o this patchw/ this patch >> >> real:0m15.023s 0m7.802s >> user:0m0.001s 0m0.001s >> sys: 0m0.246s 0m0.107s >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/xfs/xfs_log_recover.c | 162 >> +-- >> fs/xfs/xfs_log_recover.h | 2 + >> 2 files changed, 159 insertions(+), 5 deletions(-) >> >> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c >> index 7681b19..029826f 100644 >> --- a/fs/xfs/xfs_log_recover.c >> +++ b/fs/xfs/xfs_log_recover.c >> @@ -3116,6 +3116,111 @@ xlog_recover_free_trans( >> kmem_free(trans); >> } >> >> +STATIC void >> +xlog_recover_buffer_ra_pass2( >> + struct xlog *log, >> + struct xlog_recover_item*item) >> +{ >> + xfs_buf_log_format_t*buf_f = item->ri_buf[0].i_addr; >> + xfs_mount_t *mp = log->l_mp; > > struct xfs_buf_log_format > struct xfs_mount Why? *_t is also used in a lot of other places. > >> + >> + if (xlog_check_buffer_cancelled(log, buf_f->blf_blkno, >> + buf_f->blf_len, buf_f->blf_flags)) { >> + return; >> + } >> + >> + xfs_buf_readahead(mp->m_ddev_targp, buf_f->blf_blkno, >> + buf_f->blf_len, NULL); >> +} >> + >> +STATIC void >> +xlog_recover_inode_ra_pass2( >> + struct xlog *log, >> + struct xlog_recover_item*item) >> +{ >> + xfs_inode_log_format_t in_buf, *in_f; >> + xfs_mount_t *mp = log->l_mp; > > struct xfs_inode_log_format > struct xfs_mount > > and a separate declaration for each variable. > > Also, in_buf and in_f are not very good names as tehy don't follow > any commonly used XFs naming convention. The shorthand for a > struct xfs_inode_log_format variable is "ilf" and a pointer to such > an object is "ilfp". i.e: > > struct xfs_inode_log_format ilf_buf; > struct xfs_inode_log_format *ilfp; > >> +xlog_recover_dquot_ra_pass2( >> + struct xlog *log, >> + struct xlog_recover_item*item) >> +{ >> + xfs_mount_t *mp = log->l_mp; >> + struct xfs_disk_dquot *recddq; >> + int error; >> + xfs_dq_logformat_t *dq_f; >> + uinttype; > > More typedefs. > >> + >> + >> + if (mp->m_qflags == 0) >> + return; >> + >> + recddq = item->ri_buf[1].i_addr; >> + if (recddq == NULL) >> + return; >> + if (item->ri_buf[1].i_len < sizeof(xfs_disk_dquot_t)) >> + return; >> + >> + type = recddq->d_flags & (XFS_DQ_USER | XFS_DQ_PROJ | XFS_DQ_GROUP); >> + ASSERT(type); >> + if (log->l_quotaoffs_flag & type) >> + return; >> + >> + dq_f = item->ri_buf[0].i_addr; >> + ASSERT(dq_f); >> + error = xfs_qm_dqcheck(mp, recddq, dq_f->qlf_id, 0, XFS_QMOPT_DOWARN, >> +"xlog_recover_dquot_ra_pass2 (log copy)"); > > You don't need to do validation of the dquot in the transaction > here - it's unlikely to be corrupt. Just do the readahead like for a > normal buffer, and the validation can occur when recovering the > item in the next pass. Agreed, done. > >> + if (error) >> + return; >> + ASSERT(dq_f->qlf_len == 1); >> + >> + xfs_buf_readahead(mp->m_ddev_targp, dq_f->qlf_blkno, >> + dq_f->qlf_len, NULL); >> +} >> + >> +STATIC void >> +xlog_recover_ra_pass2( >> + struct xlog *log, >> + struct xlog_recover_item*item) >> +{ >> + switch (ITEM_TYPE(item)) { >
Re: [PATCH v2] xfs: introduce object readahead to log recovery
On Wed, Jul 31, 2013 at 9:35 PM, Ben Myers wrote: > Hey Zhi, > > On Wed, Jul 31, 2013 at 12:07:32PM +0800, Zhi Yong Wu wrote: >> On Wed, Jul 31, 2013 at 7:11 AM, Dave Chinner wrote: >> > On Tue, Jul 30, 2013 at 05:59:07PM +0800, zwu.ker...@gmail.com wrote: >> >> From: Zhi Yong Wu >> >> >> >> It can take a long time to run log recovery operation because it is >> >> single threaded and is bound by read latency. We can find that it took >> >> most of the time to wait for the read IO to occur, so if one object >> >> readahead is introduced to log recovery, it will obviously reduce the >> >> log recovery time. >> >> >> >> Log recovery time stat: >> >> >> >> w/o this patchw/ this patch >> >> >> >> real:0m15.023s 0m7.802s >> >> user:0m0.001s 0m0.001s >> >> sys: 0m0.246s 0m0.107s >> >> >> >> Signed-off-by: Zhi Yong Wu >> >> --- >> >> fs/xfs/xfs_log_recover.c | 162 >> >> +-- >> >> fs/xfs/xfs_log_recover.h | 2 + >> >> 2 files changed, 159 insertions(+), 5 deletions(-) >> >> >> >> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c >> >> index 7681b19..029826f 100644 >> >> --- a/fs/xfs/xfs_log_recover.c >> >> +++ b/fs/xfs/xfs_log_recover.c >> >> @@ -3116,6 +3116,111 @@ xlog_recover_free_trans( >> >> kmem_free(trans); >> >> } >> >> >> >> +STATIC void >> >> +xlog_recover_buffer_ra_pass2( >> >> + struct xlog *log, >> >> + struct xlog_recover_item*item) >> >> +{ >> >> + xfs_buf_log_format_t*buf_f = item->ri_buf[0].i_addr; >> >> + xfs_mount_t *mp = log->l_mp; >> > >> > struct xfs_buf_log_format >> > struct xfs_mount >> Why? *_t is also used in a lot of other places. > > It is just a general style preference for using the struct instead of the _t > in > the xfs codebase. Over the course of the past few years they've slowly been > converted in this direction, and we prefer not to add any more _t if it can be > avoided. Got it, thanks. I have sent out v3 with this style change. > > -Ben -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] scsi: fix the build warning
HI, If you'd like, you should draft one patch for this warning. On Thu, Aug 22, 2013 at 9:02 AM, Joe Perches wrote: > On Thu, 2013-08-22 at 08:44 +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Initialize csum variable to fix the build warning. > > Maybe it'd be better to change the variable > scsi_debug_guard type to bool? > > Something like: > --- > drivers/scsi/scsi_debug.c | 22 +++--- > 1 file changed, 7 insertions(+), 15 deletions(-) > > diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c > index cb4fefa..58375c3 100644 > --- a/drivers/scsi/scsi_debug.c > +++ b/drivers/scsi/scsi_debug.c > @@ -97,7 +97,7 @@ static const char * scsi_debug_version_date = "20100324"; > #define DEF_D_SENSE 0 > #define DEF_EVERY_NTH 0 > #define DEF_FAKE_RW0 > -#define DEF_GUARD 0 > +#define DEF_GUARD false > #define DEF_LBPU 0 > #define DEF_LBPWS 0 > #define DEF_LBPWS10 0 > @@ -169,7 +169,7 @@ static int scsi_debug_dix = DEF_DIX; > static int scsi_debug_dsense = DEF_D_SENSE; > static int scsi_debug_every_nth = DEF_EVERY_NTH; > static int scsi_debug_fake_rw = DEF_FAKE_RW; > -static int scsi_debug_guard = DEF_GUARD; > +static bool scsi_debug_guard = DEF_GUARD; > static int scsi_debug_lowest_aligned = DEF_LOWEST_ALIGNED; > static int scsi_debug_max_luns = DEF_MAX_LUNS; > static int scsi_debug_max_queue = SCSI_DEBUG_CANQUEUE; > @@ -1735,14 +1735,11 @@ static u16 dif_compute_csum(const void *buf, int len) > { > u16 csum; > > - switch (scsi_debug_guard) { > - case 1: > + if (scsi_debug_guard) > csum = ip_compute_csum(buf, len); > - break; > - case 0: > + else > csum = cpu_to_be16(crc_t10dif(buf, len)); > - break; > - } > + > return csum; > } > > @@ -2736,7 +2733,7 @@ module_param_named(dix, scsi_debug_dix, int, S_IRUGO); > module_param_named(dsense, scsi_debug_dsense, int, S_IRUGO | S_IWUSR); > module_param_named(every_nth, scsi_debug_every_nth, int, S_IRUGO | S_IWUSR); > module_param_named(fake_rw, scsi_debug_fake_rw, int, S_IRUGO | S_IWUSR); > -module_param_named(guard, scsi_debug_guard, int, S_IRUGO); > +module_param_named(guard, scsi_debug_guard, bool, S_IRUGO); > module_param_named(lbpu, scsi_debug_lbpu, int, S_IRUGO); > module_param_named(lbpws, scsi_debug_lbpws, int, S_IRUGO); > module_param_named(lbpws10, scsi_debug_lbpws10, int, S_IRUGO); > @@ -3312,11 +3309,6 @@ static int __init scsi_debug_init(void) > return -EINVAL; > } > > - if (scsi_debug_guard > 1) { > - printk(KERN_ERR "scsi_debug_init: guard must be 0 or 1\n"); > - return -EINVAL; > - } > - > if (scsi_debug_ato > 1) { > printk(KERN_ERR "scsi_debug_init: ato must be 0 or 1\n"); > return -EINVAL; > @@ -4028,7 +4020,7 @@ static int sdebug_driver_probe(struct device * dev) >(host_prot & SHOST_DIX_TYPE2_PROTECTION) ? " DIX2" : "", > (host_prot & SHOST_DIX_TYPE3_PROTECTION) ? " DIX3" : ""); > > - if (scsi_debug_guard == 1) > + if (scsi_debug_guard) > scsi_host_set_guard(hpnt, SHOST_DIX_GUARD_IP); > else > scsi_host_set_guard(hpnt, SHOST_DIX_GUARD_CRC); > > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] scsi: fix the build warning
On Thu, Aug 22, 2013 at 9:55 AM, Joe Perches wrote: > On Wed, 2013-08-21 at 21:25 -0400, Martin K. Petersen wrote: >> >>>>> "Joe" == Joe Perches writes: >> >> Joe> I don't get this build warning in the first place and I think the >> Joe> scsi_debug file is quite old and probably doesn't need to be >> Joe> changed at all. >> >> guard isn't a boolean, it selects the checksum algorithm used. > > OK, maybe this then... > --- > drivers/scsi/scsi_debug.c | 2 +- > 1 file changed, 1 insertions(+), 1 deletions(-) > > diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c > index cb4fefa..6fc2831 100644 > --- a/drivers/scsi/scsi_debug.c > +++ b/drivers/scsi/scsi_debug.c > @@ -3312,7 +3312,7 @@ static int __init scsi_debug_init(void) > return -EINVAL; > } > > - if (scsi_debug_guard > 1) { > + if (scsi_debug_guard < 0 || scsi_debug_guard > 1) { I don't think that it can fix that issue. > printk(KERN_ERR "scsi_debug_init: guard must be 0 or 1\n"); > return -EINVAL; > } > > -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] xfs: introduce object readahead to log recovery
HI, xfs maintainers, any comments? On Wed, Jul 31, 2013 at 4:42 PM, wrote: > From: Zhi Yong Wu > > It can take a long time to run log recovery operation because it is > single threaded and is bound by read latency. We can find that it took > most of the time to wait for the read IO to occur, so if one object > readahead is introduced to log recovery, it will obviously reduce the > log recovery time. > > Log recovery time stat: > > w/o this patchw/ this patch > > real:0m15.023s 0m7.802s > user:0m0.001s 0m0.001s > sys: 0m0.246s 0m0.107s > > Signed-off-by: Zhi Yong Wu > --- > fs/xfs/xfs_log_recover.c | 159 > +-- > 1 file changed, 153 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 7681b19..ebb00bc 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -3116,6 +3116,106 @@ xlog_recover_free_trans( > kmem_free(trans); > } > > +STATIC void > +xlog_recover_buffer_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + struct xfs_buf_log_format *buf_f = item->ri_buf[0].i_addr; > + struct xfs_mount*mp = log->l_mp; > + > + if (xlog_check_buffer_cancelled(log, buf_f->blf_blkno, > + buf_f->blf_len, buf_f->blf_flags)) { > + return; > + } > + > + xfs_buf_readahead(mp->m_ddev_targp, buf_f->blf_blkno, > + buf_f->blf_len, NULL); > +} > + > +STATIC void > +xlog_recover_inode_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + struct xfs_inode_log_format ilf_buf; > + struct xfs_inode_log_format *ilfp; > + struct xfs_mount*mp = log->l_mp; > + int error; > + > + if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) { > + ilfp = item->ri_buf[0].i_addr; > + } else { > + ilfp = &ilf_buf; > + memset(ilfp, 0, sizeof(*ilfp)); > + error = xfs_inode_item_format_convert(&item->ri_buf[0], ilfp); > + if (error) > + return; > + } > + > + if (xlog_check_buffer_cancelled(log, ilfp->ilf_blkno, ilfp->ilf_len, > 0)) > + return; > + > + xfs_buf_readahead(mp->m_ddev_targp, ilfp->ilf_blkno, > + ilfp->ilf_len, &xfs_inode_buf_ops); > +} > + > +STATIC void > +xlog_recover_dquot_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + struct xfs_mount*mp = log->l_mp; > + struct xfs_disk_dquot *recddq; > + struct xfs_dq_logformat *dq_f; > + uinttype; > + > + > + if (mp->m_qflags == 0) > + return; > + > + recddq = item->ri_buf[1].i_addr; > + if (recddq == NULL) > + return; > + if (item->ri_buf[1].i_len < sizeof(struct xfs_disk_dquot)) > + return; > + > + type = recddq->d_flags & (XFS_DQ_USER | XFS_DQ_PROJ | XFS_DQ_GROUP); > + ASSERT(type); > + if (log->l_quotaoffs_flag & type) > + return; > + > + dq_f = item->ri_buf[0].i_addr; > + ASSERT(dq_f); > + ASSERT(dq_f->qlf_len == 1); > + > + xfs_buf_readahead(mp->m_ddev_targp, dq_f->qlf_blkno, > + dq_f->qlf_len, NULL); > +} > + > +STATIC void > +xlog_recover_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + switch (ITEM_TYPE(item)) { > + case XFS_LI_BUF: > + xlog_recover_buffer_ra_pass2(log, item); > + break; > + case XFS_LI_INODE: > + xlog_recover_inode_ra_pass2(log, item); > + break; > + case XFS_LI_DQUOT: > + xlog_recover_dquot_ra_pass2(log, item); > + break; > + case XFS_LI_EFI: > + case XFS_LI_EFD: > + case XFS_LI_QUOTAOFF: > + default: > + break; > + } > +} > + > STATIC int > xlog_recover_commit_pass1( > struct xlog *log, > @@ -3177,6 +3277,26 @@ xlog_recover_com
Re: [PATCH] xfs: fix an assertion failure
No, it still raised the same assertion as below: [ 521.715103] XFS: Assertion failed: !(bip->bli_item.li_flags & XFS_LI_IN_AIL), file: fs/xfs/xfs_buf_item.c, line: 940 [ 521.716378] [ cut here ] [ 521.716934] kernel BUG at fs/xfs/xfs_message.c:108! [ 521.717364] invalid opcode: [#1] SMP [ 521.717364] Modules linked in: [ 521.717364] CPU: 0 PID: 6 Comm: kworker/u2:0 Not tainted 3.11.0-rc2+ #959 [ 521.717364] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 521.717364] Workqueue: writeback bdi_writeback_workfn (flush-8:0) [ 521.717364] task: 880293531fd0 ti: 88029353c000 task.ti: 88029353c000 [ 521.717364] RIP: 0010:[] [] assfail+0x22/0x30 [ 521.717364] RSP: 0018:88029353d5f8 EFLAGS: 00010292 [ 521.717364] RAX: 0068 RBX: 880289f603a0 RCX: 0006 [ 521.717364] RDX: 0006 RSI: 880293532758 RDI: 880293531fd0 [ 521.717364] RBP: 88029353d5f8 R08: R09: 0001 [ 521.717364] R10: R11: R12: 880291e4a480 [ 521.717364] R13: 880289f603a0 R14: 81342b2f R15: 8801b43a0810 [ 521.717364] FS: () GS:88029fc0() knlGS: [ 521.717364] CS: 0010 DS: ES: CR0: 8005003b [ 521.717364] CR2: f75045e4 CR3: 00029094c000 CR4: 06f0 [ 521.717364] Stack: [ 521.717364] 88029353d628 813428df 880289f603a0 0002 [ 521.717364] 880291e4a480 0001 88029353d668 81342b2f [ 521.717364] 880289f603a0 880276231fc0 [ 521.717364] Call Trace: [ 521.717364] [] xfs_buf_item_relse+0x4f/0xd0 [ 521.717364] [] xfs_buf_item_unlock+0x1cf/0x1f0 [ 521.717364] [] xfs_trans_free_items+0x7d/0xb0 [ 521.717364] [] xfs_trans_cancel+0x13c/0x1b0 [ 521.717364] [] xfs_iomap_write_allocate+0x249/0x350 [ 521.717364] [] xfs_map_blocks+0x2e2/0x350 [ 521.717364] [] xfs_vm_writepage+0x236/0x5e0 [ 521.717364] [] __writepage+0x1a/0x50 [ 521.717364] [] write_cache_pages+0x225/0x4a0 [ 521.717364] [] ? mapping_tagged+0x20/0x20 [ 521.717364] [] generic_writepages+0x4d/0x70 [ 521.717364] [] xfs_vm_writepages+0x50/0x70 [ 521.717364] [] do_writepages+0x21/0x50 [ 521.717364] [] __writeback_single_inode+0x40/0x230 [ 521.717364] [] writeback_sb_inodes+0x291/0x460 [ 521.717364] [] __writeback_inodes_wb+0x9f/0xd0 [ 521.717364] [] wb_writeback+0x24b/0x2e0 [ 521.717364] [] ? global_dirtyable_memory+0x1a/0x60 [ 521.717364] [] bdi_writeback_workfn+0x1d6/0x3d0 [ 521.717364] [] process_one_work+0x1eb/0x4f0 [ 521.717364] [] ? process_one_work+0x189/0x4f0 [ 521.717364] [] worker_thread+0x11b/0x370 [ 521.717364] [] ? rescuer_thread+0x330/0x330 [ 521.717364] [] kthread+0xea/0xf0 [ 521.717364] [] ? flush_kthread_work+0x1b0/0x1b0 [ 521.717364] [] ret_from_fork+0x7c/0xb0 [ 521.717364] [] ? flush_kthread_work+0x1b0/0x1b0 [ 521.717364] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 f1 41 89 d0 48 c7 c6 f0 b5 c7 81 48 89 fa 31 c0 48 89 e5 31 ff e8 de fb ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 [ 521.717364] RIP [] assfail+0x22/0x30 [ 521.717364] RSP [ 521.754783] ---[ end trace e13298ca69cdb092 ]--- I applied your patch as below: [root@f17 linux-2.6]# git diff diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c index bfc4e0c..98308f4 100644 --- a/fs/xfs/xfs_buf_item.c +++ b/fs/xfs/xfs_buf_item.c @@ -601,11 +601,9 @@ xfs_buf_item_unlock( } } } - if (clean) - xfs_buf_item_relse(bp); - else if (aborted) { + if (clean || aborted) { if (atomic_dec_and_test(&bip->bli_refcount)) { - ASSERT(XFS_FORCED_SHUTDOWN(lip->li_mountp)); + ASSERT(!aborted || XFS_FORCED_SHUTDOWN(lip->li_mountp)); xfs_buf_item_relse(bp); } } else [root@f17 linux-2.6]# On Fri, Jul 26, 2013 at 10:03 AM, Dave Chinner wrote: > On Thu, Jul 25, 2013 at 09:38:44PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> When running the compilebench, one assertion failure was found. >> This related line of code was introduced by commit 5f6bed76c0. >> >> commit 5f6bed76c0c85cb4d04885a5de00b629deee550b >> Author: Dave Chinner >> Date: Thu Jun 27 16:04:52 2013 +1000 >> >> xfs: Introduce an ordered buffer item > > Ok, so the assert was introduced there, and for good reason: if we > are about to free the xfs_buf_log_item, then it better not still be > referenced by the AIL. > >> XFS: Assertion failed: !(bip->bli_item.li_flags & XFS_LI_IN_AIL), file: >> fs/xfs/xfs_buf_item.c, line: 942 >> [ cut here ] >>
Re: [PATCH] xfs: introduce object readahead to log recovery
Dave, All comments are good to me, and will be applied to next version, thanks a lot. On Fri, Jul 26, 2013 at 10:50 AM, Dave Chinner wrote: > On Thu, Jul 25, 2013 at 04:23:39PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> It can take a long time to run log recovery operation because it is >> single threaded and is bound by read latency. We can find that it took >> most of the time to wait for the read IO to occur, so if one object >> readahead is introduced to log recovery, it will obviously reduce the >> log recovery time. >> >> In dirty log case as below: >> data device: 0xfd10 >> log device: 0xfd10 daddr: 20480032 length: 20480 >> >> log tail: 7941 head: 11077 state: > > That's only a small log (10MB). As I've said on irc, readahead won't Yeah, it is one 10MB log, but how do you calculate it based on the above info? > show any real difference on this and you need to be testing with > large logs. > >> >> This dirty ratio is about 15%. I am trying to do tests in larger scale >> and dirtier filesystem environment. >> >> Log recovery time stat: >> >> w/o this patchw/ this patch >> real 0m1.051s 0m0.965s >> sys 0m0.033s 0m0.035s >> >> iowait 0m1.018s 0m0.930s > > Well, it's not made much of a difference there. That's probably > within the noise of repeated log recovery runs. > > My simple test is: > > $ cat recovery_time.sh > #!/bin/bash > > cd /home/dave/src/compilebench-0.6/ > > mkfs.xfs -f /dev/vdc; > mount /dev/vdc /mnt/scratch > chmod 777 /mnt/scratch; > ./compilebench --no-sync -D /mnt/scratch & > sleep 55 > /home/dave/src/xfstests-dev/src/godown /mnt/scratch > umount /mnt/scratch > xfs_logprint -t /dev/vdc |head -20 > time mount /dev/vdc /mnt/scratch > umount /mnt/scratch > $ > > And the recovery time from this is between 15-17s: > > > log device: 0xfd20 daddr: 107374182032 length: 4173824 >^^^ almost 2GB > log tail: 19288 head: 264809 state: > > real0m17.913s > user0m0.000s > sys 0m2.381s > > And runs at 3-4000 read IOPs for most of that time. It's largely IO > bound, even on SSDs. > > With your patch: > > log tail: 35871 head: 308393 state: > real0m12.715s > user0m0.000s > sys 0m2.247s > > And it peaked at ~5000 read IOPS. How do you know its READ IOPS is ~5000? > > It's definitely an improvement, but there's a lot of dead time still > spent waiting for IO that we should be able to remove. Let's have a > look at the code... > > >> +STATIC void >> +xlog_recover_inode_ra_pass2( >> + struct xlog *log, >> + struct xlog_recover_item*item) >> +{ >> + xfs_inode_log_format_t *in_f; >> + xfs_mount_t *mp = log->l_mp; >> + int error; >> + int need_free = 0; >> + >> + if (item->ri_buf[0].i_len == sizeof(xfs_inode_log_format_t)) { >> + in_f = item->ri_buf[0].i_addr; >> + } else { >> + in_f = kmem_alloc(sizeof(xfs_inode_log_format_t), KM_SLEEP); >> + need_free = 1; >> + error = xfs_inode_item_format_convert(&item->ri_buf[0], in_f); >> + if (error) >> + goto error; >> + } > > I'd just put the conversion buffer on stack and avoid the need to > alloc/free memory here. There's plenty of stack space available > during recovery here, so let's use it to keep the overhead of > readahead down. OK, will use the buffer on stack. > >> +STATIC void >> +xlog_recover_dquot_ra_pass2( >> + struct xlog *log, >> + struct xlog_recover_item*item) >> +{ >> + xfs_mount_t *mp = log->l_mp; >> + xfs_buf_t *bp; >> + struct xfs_disk_dquot *recddq; >> + int error; >> + xfs_dq_logformat_t *dq_f; >> + uinttype; >> + >> + >> + if (mp->m_qflags == 0) >> + return; >> + >> + recddq = item->ri_buf[1].i_addr; >> + if (recddq == NULL) >> + return; >> + if (item->ri_buf[1].i_len < sizeof(xfs_disk_dquot_t)) >> + return; >> + >> + type = recddq->d_flags & (XFS_DQ_U
Re: [PATCH v3] xfs: introduce object readahead to log recovery
On Wed, Aug 14, 2013 at 1:35 PM, Dave Chinner wrote: > On Wed, Jul 31, 2013 at 04:42:45PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> It can take a long time to run log recovery operation because it is >> single threaded and is bound by read latency. We can find that it took >> most of the time to wait for the read IO to occur, so if one object >> readahead is introduced to log recovery, it will obviously reduce the >> log recovery time. >> >> Log recovery time stat: >> >> w/o this patchw/ this patch >> >> real:0m15.023s 0m7.802s >> user:0m0.001s 0m0.001s >> sys: 0m0.246s 0m0.107s > > This version works as advertised as well. > >> @@ -3216,6 +3351,18 @@ xlog_recover_commit_trans( >> goto out; >> } >> >> + if (!list_empty(&ra_list)) { >> + error = xlog_recover_items_pass2(log, trans, >> + &buffer_list, &ra_list); >> + if (error) >> + goto out; >> + >> + list_splice_tail_init(&ra_list, &done_list); >> + } >> + >> + if (!list_empty(&done_list)) >> + list_splice_init(&done_list, &trans->r_itemq); >> + >> xlog_recover_free_trans(trans); > > I think this still leaks the trans structure when an error occurs. > Indeed, I think this is a pre-existing leak, as the current code > will skip freeing the trans structure on item recovery failure and > nothing else frees it. So it appears to me to be busted before this > patch is added. Yes, i also found this and think so. > > Hence on a xlog_recover_items_pass2() error we need to splice the > ra-list to the done_list and free trans. i.e. the "if (error) goto > out;" lines in the above hunk do not need to be there, and the > "out:" label moved to above the call to xlog_recover_free_trans() so > the main loop does the right thing when an error occurs. Do you need to draft one patch to fix trans leaking? or can it be fixed in this patch? or will you draft one patch? > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 00/12] VFS hot tracking
Ping... On Tue, May 14, 2013 at 8:59 AM, wrote: > From: Zhi Yong Wu > > The patchset is trying to introduce hot tracking function in > VFS layer, which will keep track of real disk I/O in memory. > By it, you will easily know more details about disk I/O, and > then detect where disk I/O hot spots are. Also, specific FS > can take use of it to do accurate defragment, and hot relocation > support, etc. > > After V1 was sent out, Chandra Seetharaman has reviewed and > made a lot of comments, thanks a lot to him. Not it's time to > send out its V2 for external review, any comments or ideas are > appreciated, thanks. > > NOTE: > > The patchset can be obtained via my kernel dev git on github: > git://github.com/wuzhy/kernel.git hot_tracking > If you're interested, you can also review them via > https://github.com/wuzhy/kernel/commits/hot_tracking > > For how to use and more other info and performance report, > please check hot_tracking.txt in Documentation and following > links: > 1.) http://lwn.net/Articles/525651/ > 2.) https://lkml.org/lkml/2012/12/20/199 > > Changelog from v1: > - Refactored to be under RCU [Chandra Seetharaman] > - Merged some code changes [Chandra Seetharaman] > - Fixed some issues [Chandra Seetharaman] > > v1: > - Solved 64 bits inode number issue. [David Sterba] > - Embed struct hot_type in struct file_system_type [Darrick J. Wong] > - Cleanup Some issues [David Sterba] > - Use a static hot debugfs root [Greg KH] > > rfcv4: > - Introduce hot func registering framework [Zhiyong] > - Remove global variable for hot tracking [Zhiyong] > - Add btrfs hot tracking support [Zhiyong] > > rfcv3: > 1.) Rewritten debugfs support based seq_file operation. [Dave Chinner] > 2.) Refactored workqueue support. [Dave Chinner] > 3.) Turn some Micro into be tunable [Zhiyong, Liu Zheng] >TIME_TO_KICK, and HEAT_UPDATE_DELAY > 4.) Cleanedup a lot of other issues [Dave Chinner] > > rfcv2: > 1.) Converted to Radix trees, not RB-tree [Zhiyong, Dave Chinner] > 2.) Added memory shrinker [Dave Chinner] > 3.) Converted to one workqueue to update map info periodically [Dave Chinner] > 4.) Cleanedup a lot of other issues [Dave Chinner] > > rfcv1: > 1.) Reduce new files and put all in fs/hot_tracking.[ch] [Dave Chinner] > 2.) The first three patches can probably just be flattened into one. > [Marco Stornelli , Dave Chinner] > > Zhi Yong Wu (12): > VFS hot tracking: introduce some data structures > VFS hot tracking: add i/o freq tracking hooks > VFS hot tracking: add one workqueue to update hot map > VFS hot tracking: register one shrinker > VFS hot tracking, rcu: introduce one rcu macro for list > VFS hot tracking, seq_file: introduce one set of rcu seq_list > interfaces > VFS hot tracking: add debugfs support > VFS hot tracking: add one ioctl interface > VFS hot tracking, procfs: add two proc interfaces > VFS hot tracking, btrfs: add hot tracking support > VFS hot tracking: add documentation > VFS hot tracking: add fs hot type support > > Documentation/filesystems/00-INDEX |2 + > Documentation/filesystems/hot_tracking.txt | 256 ++ > fs/Makefile|2 +- > fs/btrfs/ctree.h |1 + > fs/btrfs/super.c | 22 +- > fs/compat_ioctl.c |5 + > fs/dcache.c|2 + > fs/direct-io.c |5 + > fs/hot_tracking.c | 1320 > > fs/hot_tracking.h | 67 ++ > fs/ioctl.c | 70 ++ > fs/namei.c |2 + > fs/seq_file.c | 37 + > include/linux/fs.h |5 + > include/linux/hot_tracking.h | 175 > include/linux/rculist.h|5 + > include/linux/seq_file.h |7 + > kernel/sysctl.c| 14 + > mm/filemap.c |6 + > mm/page-writeback.c| 12 + > mm/readahead.c |6 + > 21 files changed, 2019 insertions(+), 2 deletions(-) > create mode 100644 Documentation/filesystems/hot_tracking.txt > create mode 100644 fs/hot_tracking.c > create mode 100644 fs/hot_tracking.h > create mode 100644 include/linux/hot_tracking.h > > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > t
Re: [PATCH v2 00/12] VFS hot tracking
HI, Al Viro. I have incorporated all comments from all reviewers and waited for so long time. If you have no comments, can you merge the patchset? thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v1 00/11] vfs: hot data tracking
On Mon, Sep 17, 2012 at 5:45 PM, Marco Stornelli wrote: > 2012/9/17 : >> From: Zhi Yong Wu >> >> NOTE: >> >> The patchset is currently post out mainly to make sure >> it is going in the correct direction and hope to get some >> helpful comments from other guys. >> >> TODO List: >> >> 1.) Need to do scalability or performance tests. >> 2.) Turn some Micro into tunables >>TIME_TO_KICK, and HEAT_UPDATE_DELAY >> 3.) Rafactor hot_hash_is_aging() >>If you just made the timeout value a timespec and compared >> the _timespecs_, you would be doing a lot fewer conversions. >> 4.) Cleanup some unnecessary lock protect >> 5.) Add more comments to explain how to calc temperature >> >> Ben Chociej, Matt Lupfer and Conor Scott originally wrote this code to >> be very btrfs-specific. I've taken their code and attempted to >> make it more generic and integrate it at the VFS level. >> >> INTRODUCTION: >> >> Essentially, this means maintaining some key stats >> (like number of reads/writes, last read/write time, frequency of >> reads/writes), then distilling those numbers down to a single >> "temperature" value that reflects what data is "hot," and using that >> temperature to move data to SSDs. >> >> The long-term goal of these patches is to allow some FSs, >> e.g. Btrfs to intelligently utilize SSDs in a heterogenous volume. >> Incidentally, this project has been motivated by >> the Project Ideas page on the Btrfs wiki. >> >> Of course, users are warned not to run this code outside of development >> environments. These patches are EXPERIMENTAL, and as such they might eat >> your data and/or memory. That said, the code should be relatively safe >> when the hottrack mount option are disabled. >> >> MOTIVATION: >> >> The overall goal of enabling hot data relocation to SSD has been >> motivated by the Project Ideas page on the Btrfs wiki at >> <https://btrfs.wiki.kernel.org/index.php/Project_ideas>. >> It will divide into two steps. VFS provide hot data tracking function >> while specific FS will provide hot data relocation function. >> So as the first step of this goal, it is hoped that the patchset >> for hot data tracking will eventually mature into VFS. >> >> This is essentially the traditional cache argument: SSD is fast and >> expensive; HDD is cheap but slow. ZFS, for example, can already take >> advantage of SSD caching. Btrfs should also be able to take advantage of >> hybrid storage without many broad, sweeping changes to existing code. >> >> SUMMARY: >> >> - Hooks in existing vfs functions to track data access frequency >> >> - New rbtrees for tracking access frequency of inodes and sub-file >> ranges (hot_rb.c) >> The relationship between super_block and rbtree is as below: >> super_block->s_hotinfo.hot_inode_tree >> In include/linux/fs.h, one struct hot_info s_hotinfo is added to >> super_block struct. Each FS instance can find hot tracking info >> s_hotinfo via its super_block. In this hot_info, it store a lot of hot >> tracking info such as hot_inode_tree, inode and range hash list, etc. >> >> - A hash list for indexing data by its temperature (hot_hash.c) >> >> - A debugfs interface for dumping data from the rbtrees (hot_debugfs.c) >> >> - A background kthread for updating inode heat info >> >> - Mount options for enabling temperature tracking(-o hottrack, default mean >> disabled) >> (hot_track.c) >> >> - An ioctl to retrieve the frequency information collected for a certain >> file >> >> - Ioctls to enable/disable frequency tracking per inode. >> >> Usage syntax: >> >> root@debian-i386:~# mount -o hottrack /dev/sdb /mnt >> [ 1505.894078] device label test devid 1 transid 29 /dev/sdb >> [ 1505.952977] btrfs: disk space caching is enabled >> [ 1506.069678] vfs: turning on hot data tracking >> root@debian-i386:~# mount -t debugfs none /sys/kernel/debug >> root@debian-i386:~# ls -l /sys/kernel/debug/vfs_hotdata/ >> total 0 >> drwxr-xr-x 2 root root 0 Aug 8 04:40 sdb >> root@debian-i386:~# ls -l /sys/kernel/debug/vfs_hotdata/sdb >> total 0 >> -rw-r--r-- 1 root root 0 Aug 8 04:40 inode_data >> -rw-r--r-- 1 root root 0 Aug 8 04:40 range_data >> root@debian-i386:~# vi /mnt/file >> root@debian-i386:~# cat /sys/kernel/debug/hot_track/sdb/inode_data >> inode #279, reads 0, writes 1, avg read time 18446744073709551615, >
Re: [RFC v1 00/11] vfs: hot data tracking
On Tue, Sep 18, 2012 at 5:30 AM, Dave Chinner wrote: > On Mon, Sep 17, 2012 at 03:18:34PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> NOTE: >> >> The patchset is currently post out mainly to make sure >> it is going in the correct direction and hope to get some >> helpful comments from other guys. >> >> TODO List: >> >> 1.) Need to do scalability or performance tests. >> 2.) Turn some Micro into tunables >>TIME_TO_KICK, and HEAT_UPDATE_DELAY >> 3.) Rafactor hot_hash_is_aging() >>If you just made the timeout value a timespec and compared >> the _timespecs_, you would be doing a lot fewer conversions. >> 4.) Cleanup some unnecessary lock protect >> 5.) Add more comments to explain how to calc temperature > > 0) Documentation. > >> Zhi Yong Wu (11): >> vfs: introduce one structure hot_info >> vfs: introduce one rb tree - hot_inode_tree >> vfs: introduce 2 rb tree items - inode and range > > These three patches can probably just be flattened into one. > Splitting out the first two doesn't add to the clarity of the > series - add the structures in the patch that actually uses them > so we don't ahve to jump between patches to see how they are used. It make sense to me, OK, thanks. > >> vfs: add support for updating access frequency >> vfs: add one new mount option '-o hottrack' >> vfs: add init and exit support >> vfs: introduce one hash table >> vfs: enable hot data tracking >> vfs: fork one private kthread to update temperature info >> vfs: add 3 new ioctl interfaces >> vfs: add debugfs support >> >> fs/Makefile |3 +- >> fs/compat_ioctl.c |8 + >> fs/dcache.c |2 + >> fs/direct-io.c| 10 + >> fs/hot_debugfs.c | 487 ++ >> fs/hot_debugfs.h | 60 + >> fs/hot_hash.c | 369 ++ >> fs/hot_hash.h | 108 >> fs/hot_rb.c | 648 >> + >> fs/hot_rb.h | 70 + >> fs/hot_track.c| 113 >> fs/hot_track.h| 23 ++ >> fs/ioctl.c| 132 + >> fs/namespace.c| 10 + >> fs/super.c| 11 + >> include/linux/fs.h| 15 + >> include/linux/hot_track.h | 169 >> mm/filemap.c |8 + >> mm/page-writeback.c | 21 ++ >> mm/readahead.c|9 + >> 20 files changed, 2275 insertions(+), 1 deletions(-) >> create mode 100644 fs/hot_debugfs.c >> create mode 100644 fs/hot_debugfs.h >> create mode 100644 fs/hot_hash.c >> create mode 100644 fs/hot_hash.h >> create mode 100644 fs/hot_rb.c >> create mode 100644 fs/hot_rb.h >> create mode 100644 fs/hot_track.c >> create mode 100644 fs/hot_track.h >> create mode 100644 include/linux/hot_track.h > > So, 9 new files for tracking all of this? I'm not sure that so > many new files are needed - putting it all in fs/hot_tracking.[ch] > might make more sense, or if all 8 fs/hot* files remain, putting > them in their own subdirectory might be an idea (like quota). If all functions are in two files, they will be large and the logic is not so clear. so i prefer the latter. thanks. > > I'll comment on the code when I get a bit of time to look at it. Great, very look forward to your more comments. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v1 00/11] vfs: hot data tracking
On Tue, Sep 18, 2012 at 2:20 PM, Dave Chinner wrote: > On Tue, Sep 18, 2012 at 10:24:55AM +0800, Zhi Yong Wu wrote: >> On Tue, Sep 18, 2012 at 5:30 AM, Dave Chinner wrote: >> > On Mon, Sep 17, 2012 at 03:18:34PM +0800, zwu.ker...@gmail.com wrote: >> >> 20 files changed, 2275 insertions(+), 1 deletions(-) >> >> create mode 100644 fs/hot_debugfs.c >> >> create mode 100644 fs/hot_debugfs.h >> >> create mode 100644 fs/hot_hash.c >> >> create mode 100644 fs/hot_hash.h >> >> create mode 100644 fs/hot_rb.c >> >> create mode 100644 fs/hot_rb.h >> >> create mode 100644 fs/hot_track.c >> >> create mode 100644 fs/hot_track.h >> >> create mode 100644 include/linux/hot_track.h >> > >> > So, 9 new files for tracking all of this? I'm not sure that so >> > many new files are needed - putting it all in fs/hot_tracking.[ch] >> > might make more sense, or if all 8 fs/hot* files remain, putting >> > them in their own subdirectory might be an idea (like quota). >> If all functions are in two files, they will be large and the logic is >> not so clear. so i prefer the latter. thanks. > > A single file is much easier to handle when searching and editting, > though. And realistically: > > $ wc -l fs/*.c |sort -nr -k 1 |head -10 > 62483 total >4000 fs/namei.c >3281 fs/buffer.c >3169 fs/dcache.c >2677 fs/namespace.c >2364 fs/locks.c >2321 fs/exec.c >2120 fs/binfmt_elf.c >2053 fs/splice.c >1934 fs/eventpoll.c > $ > > There are plenty of files for a single function/subsystem of around > the aggregate size of this code, so everyone is used to dealing with > this amount of logic for a particular feature in a single file. > > I much prefer it that way, to tell the truth, because I find it > fairly common to have 5-10 files open at once when tracking > something through, say, the buffered IO path. If I've now got to > have another 5-10 files open just to track that same code path > through this index, then that makes it much more difficult to > manage than if were all in one .c file. > > IOWs, splitting code up into multiple files is not a win when the > resultant files are all small and interdependent and you need to > look at multiple files at the same time to understand what the code > does as a whole Just completed move them to one separate directory fs/hot_tracking/. But what you said makes sense to me, OK, i will adopt your suggestion, thanks. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 02/13] vfs: introduce private radix tree structures
On Wed, Oct 10, 2012 at 11:34 PM, David Sterba wrote: > On Wed, Oct 10, 2012 at 06:07:24PM +0800, zwu.ker...@gmail.com wrote: >> +void hot_track_init(struct super_block *sb) >> +{ > ... >> +} > >> +void hot_track_exit(struct super_block *sb) >> +{ >> + hot_cache_exit(); >> +} > > Needs to be exported if btrfs is built as a module, otherwise does not > link > > LDS arch/x86/boot/compressed/vmlinux.lds > AS arch/x86/boot/compressed/head_64.o > CC arch/x86/boot/compressed/misc.o > CC arch/x86/boot/compressed/string.o > CC arch/x86/boot/compressed/cmdline.o > CC arch/x86/boot/compressed/early_serial_console.o > OBJCOPY arch/x86/boot/compressed/vmlinux.bin > ERROR: "hot_track_init" [fs/btrfs/btrfs.ko] undefined! > ERROR: "hot_track_exit" [fs/btrfs/btrfs.ko] undefined! > make[1]: *** [__modpost] Error 1 > make: *** [modules] Error 2 > make: *** Waiting for unfinished jobs Sorry for late response at first. Great, thanks. > > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Thu, Oct 11, 2012 at 12:28 AM, David Sterba wrote: > Hi, > > On Wed, Oct 10, 2012 at 06:07:23PM +0800, zwu.ker...@gmail.com wrote: >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { >> #define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20) >> #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) >> #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) >> +#define BTRFS_MOUNT_HOT_TRACK(1 << 23) > > Please don't forget to add new options to btrfs_show_options(), > otherwise we can't tell what filesystems have hot tracking enabled. Great catch, thanks. > >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -303,7 +304,7 @@ enum { >> Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, >> Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, >> Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, >> - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, >> + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, > > Please add the new option to the end. OK. > >> Opt_check_integrity, Opt_check_integrity_including_extent_data, >> Opt_check_integrity_print_mask, Opt_fatal_errors, >> Opt_err, >> @@ -342,6 +343,7 @@ static match_table_t tokens = { >> {Opt_no_space_cache, "nospace_cache"}, >> {Opt_recovery, "recovery"}, >> {Opt_skip_balance, "skip_balance"}, >> + {Opt_hot_track, "hot_track"}, > > (also this one) ditto. > >> {Opt_check_integrity, "check_int"}, >> {Opt_check_integrity_including_extent_data, "check_int_data"}, >> {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, >> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char >> *options) >> case Opt_skip_balance: >> btrfs_set_opt(info->mount_opt, SKIP_BALANCE); >> break; >> + case Opt_hot_track: > > It's a common patter in the surrounding code that a message is printed > when enabling options, but the vfs prints its own, so I'm not sure if > it's needed here as well. Just thinking, leave it as it is now. OK > >> + btrfs_set_opt(info->mount_opt, HOT_TRACK); >> + break; >> #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY >> case Opt_check_integrity_including_extent_data: >> printk(KERN_INFO "btrfs: enabling check integrity" > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Thu, Oct 11, 2012 at 12:28 AM, David Sterba wrote: > Hi, > > On Wed, Oct 10, 2012 at 06:07:23PM +0800, zwu.ker...@gmail.com wrote: >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1726,6 +1726,7 @@ struct btrfs_ioctl_defrag_range_args { >> #define BTRFS_MOUNT_CHECK_INTEGRITY (1 << 20) >> #define BTRFS_MOUNT_CHECK_INTEGRITY_INCLUDING_EXTENT_DATA (1 << 21) >> #define BTRFS_MOUNT_PANIC_ON_FATAL_ERROR (1 << 22) >> +#define BTRFS_MOUNT_HOT_TRACK(1 << 23) > > Please don't forget to add new options to btrfs_show_options(), > otherwise we can't tell what filesystems have hot tracking enabled. > >> --- a/fs/btrfs/super.c >> +++ b/fs/btrfs/super.c >> @@ -303,7 +304,7 @@ enum { >> Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, >> Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, >> Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, >> - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, >> + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, > > Please add the new option to the end. To be honest, it can't be added to the end, if you check Opt_err's pattern value, you will find it is NULL, it will cause match_one() return 1. So if we add Opt_hot_track to the end of this array, it will be covered by match_token(), so i prefer to add it to Opt_fatal_errors. Do you think of it? > >> Opt_check_integrity, Opt_check_integrity_including_extent_data, >> Opt_check_integrity_print_mask, Opt_fatal_errors, >> Opt_err, >> @@ -342,6 +343,7 @@ static match_table_t tokens = { >> {Opt_no_space_cache, "nospace_cache"}, >> {Opt_recovery, "recovery"}, >> {Opt_skip_balance, "skip_balance"}, >> + {Opt_hot_track, "hot_track"}, > > (also this one) > >> {Opt_check_integrity, "check_int"}, >> {Opt_check_integrity_including_extent_data, "check_int_data"}, >> {Opt_check_integrity_print_mask, "check_int_print_mask=%d"}, >> @@ -553,6 +555,9 @@ int btrfs_parse_options(struct btrfs_root *root, char >> *options) >> case Opt_skip_balance: >> btrfs_set_opt(info->mount_opt, SKIP_BALANCE); >> break; >> + case Opt_hot_track: > > It's a common patter in the surrounding code that a message is printed > when enabling options, but the vfs prints its own, so I'm not sure if > it's needed here as well. Just thinking, leave it as it is now. > >> + btrfs_set_opt(info->mount_opt, HOT_TRACK); >> + break; >> #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY >> case Opt_check_integrity_including_extent_data: >> printk(KERN_INFO "btrfs: enabling check integrity" > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 01/13] btrfs: add one new mount option '-o hot_track'
On Thu, Oct 11, 2012 at 10:41 PM, David Sterba wrote: > On Thu, Oct 11, 2012 at 10:35:28PM +0800, Zhi Yong Wu wrote: >> >> --- a/fs/btrfs/super.c >> >> +++ b/fs/btrfs/super.c >> >> @@ -303,7 +304,7 @@ enum { >> >> Opt_notreelog, Opt_ratio, Opt_flushoncommit, Opt_discard, >> >> Opt_space_cache, Opt_clear_cache, Opt_user_subvol_rm_allowed, >> >> Opt_enospc_debug, Opt_subvolrootid, Opt_defrag, Opt_inode_cache, >> >> - Opt_no_space_cache, Opt_recovery, Opt_skip_balance, >> >> + Opt_no_space_cache, Opt_recovery, Opt_skip_balance, Opt_hot_track, >> > >> > Please add the new option to the end. >> To be honest, it can't be added to the end, if you check Opt_err's >> pattern value, you will find it is NULL, it will cause match_one() >> return 1. So if we add Opt_hot_track to the end of this array, it will >> be covered by match_token(), so i prefer to add it to >> Opt_fatal_errors. Do you think of it? > > Ah, sorry, I was not clear what the 'end' means here. The Opt_err is a > end-of-the-list token, so yes please add hot_track between > Opt_fatal_errors and Opt_err. Done, thanks. It will be appreciated if you can make comments on other patches of this series.:) > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 13/13] vfs: add documentation
On Mon, Oct 15, 2012 at 8:35 AM, Zheng Liu wrote: > Hi Zhi Yong, > > [cut...] >> +3. The Design >> + >> +These include the following parts: >> + >> +* Hooks in existing vfs functions to track data access frequency >> + >> +* New rbtrees for tracking access frequency of inodes and sub-file > ^^^ s/rbtrees/radix-trees >> +ranges (hot_rb.c) > Now it seems that all codes are in the same file. HI, Zheng, Good catch, i will update them, thanks. > > Regards, > Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 00/13] vfs: hot data tracking
On Mon, Oct 15, 2012 at 8:39 AM, Zheng Liu wrote: > On Wed, Oct 10, 2012 at 06:07:22PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> NOTE: >> >> The patchset is currently post out mainly to make sure >> it is going in the correct direction and hope to get some >> helpful comments from other guys. >> For more infomation, please check hot_tracking.txt in Documentation > > Hi Zhi Yong, > > If I want to use this patch set in ext4, could I apply this patch set > directly or I need to call some functions like in btrfs. Thanks. Hi ,Zheng, It is the latter. If you have any questions, please feel free to catch me. > > Regards, > Zheng -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Mon, Oct 15, 2012 at 3:48 PM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> FS_IOC_GET_HEAT_INFO: return a struct containing the various >> metrics collected in btrfs_freq_data structs, and also return a >> calculated data temperature based on those metrics. Optionally, retrieve >> the temperature from the hot data hash list instead of recalculating it. >> >> FS_IOC_GET_HEAT_OPTS: return an integer representing the current >> state of hot data tracking and migration: >> >> 0 = do nothing >> 1 = track frequency of access >> >> FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and >> migration, as described above. > . >> +struct hot_heat_info { >> + __u64 avg_delta_reads; >> + __u64 avg_delta_writes; >> + __u64 last_read_time; >> + __u64 last_write_time; >> + __u32 num_reads; >> + __u32 num_writes; >> + __u32 temperature; >> + __u8 live; >> + char filename[PATH_MAX]; > > Don't put the filename in the ioctl and open the file in the kernel. > Have userspace open the file directly and issue the ioctl on the fd > that is returned. OK, thanks. By the way, do you think that it is necessary to provide another new ioctl interface to set the temperature value? > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 12/13] vfs: add debugfs support
On Mon, Oct 15, 2012 at 3:55 PM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add a /sys/kernel/debug/hot_track// directory for each >> volume that contains two files. The first, `inode_data', contains the >> heat information for inodes that have been brought into the hot data map >> structures. The second, `range_data', contains similar information for >> subfile ranges. > >> + /* create debugfs range_data file */ >> + debugfs_range_entry = debugfs_create_file("range_data", >> + S_IFREG | S_IRUSR | S_IWUSR | S_IRUGO, >> + debugfs_volume_entry, >> + (void *) range_data, >> + &hot_debugfs_range_fops); > > These should not be world readable. 0600 is probably the correct > permissions for them as we do not want random users to be able to > infer what files users are accessing from this information. Good catch, its mode should be S_IFREG | S_IRUSR | S_IWUSR > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 12/13] vfs: add debugfs support
On Mon, Oct 15, 2012 at 4:04 PM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:34PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> Add a /sys/kernel/debug/hot_track// directory for each >> volume that contains two files. The first, `inode_data', contains the >> heat information for inodes that have been brought into the hot data map >> structures. The second, `range_data', contains similar information for >> subfile ranges. >> >> Signed-off-by: Zhi Yong Wu >> --- >> fs/hot_tracking.c | 462 >> + >> fs/hot_tracking.h | 43 + >> 2 files changed, 505 insertions(+), 0 deletions(-) > . >> +static int hot_debugfs_copy(struct debugfs_vol_data *data, char *msg, int >> len) >> +{ >> + struct lstring *debugfs_log = data->debugfs_log; >> + uint new_log_alloc_size; >> + char *new_log; >> + static char err_msg[] = "No more memory!\n"; >> + >> + if (len >= data->log_alloc_size - debugfs_log->len) { > .. >> + } >> + >> + memcpy(debugfs_log->str + debugfs_log->len, data->log_work_buff, len); >> + debugfs_log->len += (unsigned long) len; >> + >> + return len; >> +} >> + >> +/* Returns the number of bytes written to the log. */ >> +static int hot_debugfs_log(struct debugfs_vol_data *data, const char *fmt, >> ...) >> +{ >> + struct lstring *debugfs_log = data->debugfs_log; >> + va_list args; >> + int len; >> + static char trunc_msg[] = >> + "The next message has been truncated.\n"; >> + >> + if (debugfs_log->str == NULL) >> + return -1; >> + >> + spin_lock(&data->log_lock); >> + >> + va_start(args, fmt); >> + len = vsnprintf(data->log_work_buff, >> + sizeof(data->log_work_buff), fmt, args); >> + va_end(args); >> + >> + if (len >= sizeof(data->log_work_buff)) { >> + hot_debugfs_copy(data, trunc_msg, sizeof(trunc_msg)); >> + } >> + >> + len = hot_debugfs_copy(data, data->log_work_buff, len); >> + spin_unlock(&data->log_lock); >> + >> + return len; >> +} > > Aren't you just recreating seq_printf() here? i.e. can't you replace > all this complexity with generic seq_file/seq_operations constructs? It seems to be a good suggestion, let me try it. thanks. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 11/13] vfs: add 3 new ioctl interfaces
On Tue, Oct 16, 2012 at 11:17 AM, Dave Chinner wrote: > On Wed, Oct 10, 2012 at 06:07:33PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> >> FS_IOC_GET_HEAT_INFO: return a struct containing the various >> metrics collected in btrfs_freq_data structs, and also return a > > I think you mean hot_freq_data :P Yeah, sorry. > >> calculated data temperature based on those metrics. Optionally, retrieve >> the temperature from the hot data hash list instead of recalculating it. > > To get the heat info for a specific file you have to know what file > you want to get that info for, right? I can see the usefulness of Yes. > asking for the heat data on a specific file, but how do you find the > hot files in the first place? i.e. the big question the user > interface needs to answer is "what files are hot?". We only tell the user what the files' temperatures are, not what files are hot. Their temperatures are in the output of debugfs. > > Once userspace knows what the hottest files are, it can open them If the user need to know this type of info, it is easy for us to provide it. But i don't know what way the user hope to get it via. > and query the data via the above ioctl, but expecting userspace to > iterate millions of inodes in a filesystem to find hot files is very > inefficient. > > FWIW, if you were to return file handles to the hottest files, then > the application could open and query them without even needing to > know the path name to them. This woul dbe exceedingly useful for > defragmentation programs, especially as that is the way xfs_fsr > already operates on candidate files.(*) ah. > > IOWs, sometimes the pathname is irrelevant to the operations that > applications want to perform - all they care about having an > efficient method of finding the inode they want and getting a file > descriptor that points to the file. Given the heat map info fits > right in to the sort of operations defrag and data mover tools > already do, it kind of makes sense to optimise the interface towards > those uses > > (*) i.e. finds them via bulkstat which returns handle information > along with all the other inode data, then opens the file by handle > to do the defrag work OK. > >> FS_IOC_GET_HEAT_OPTS: return an integer representing the current >> state of hot data tracking and migration: >> >> 0 = do nothing >> 1 = track frequency of access >> >> FS_IOC_SET_HEAT_OPTS: change the state of hot data tracking and >> migration, as described above. > > I can't see how this is a manageable interface. It is not > persistent, so after every filesystem mount you'd have to set the > flag on all your inodes again. Hence, for the moment, I'd suggest > that dropping per-inode tracking control until all the core issues > are sorted out OK. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v1 02/16] vfs: add init and cleanup functions
On Thu, Jan 10, 2013 at 8:48 AM, David Sterba wrote: > On Thu, Dec 20, 2012 at 10:43:21PM +0800, zwu.ker...@gmail.com wrote: >> From: Zhi Yong Wu >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -107,3 +189,38 @@ err: >> kmem_cache_destroy(hot_inode_item_cachep); >> } >> EXPORT_SYMBOL_GPL(hot_cache_init); >> + >> +/* >> + * Initialize the data structures for hot data tracking. >> + */ >> +int hot_track_init(struct super_block *sb) >> +{ >> + struct hot_info *root; >> + int ret = -ENOMEM; >> + >> + root = kzalloc(sizeof(struct hot_info), GFP_NOFS); >> + if (!root) { >> + printk(KERN_ERR "%s: Failed to malloc memory for " >> + "hot_info\n", __func__); >> + return ret; >> + } >> + >> + hot_inode_tree_init(root); > > This function is supposed to be called from the filesystem init, please > add a sanity check that would catch multiple initialization attempts. Good catch, thanks. Done. > >> + >> + sb->s_hot_root = root; >> + >> + printk(KERN_INFO "VFS: Turning on hot data tracking\n"); >> + >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(hot_track_init); >> + >> +void hot_track_exit(struct super_block *sb) >> +{ >> + struct hot_info *root = sb->s_hot_root; > > another sanity check to catch the opposite. ditto. > > Why? The option is parsed and enabled from the filesystems, due to > unexpected bugs eg with remounting or incorrectly handled error paths, > vfs layer should IMHO rather warn than crash. thanks for your expalaination. > >> + >> + hot_inode_tree_exit(root); >> + sb->s_hot_root = NULL; >> + kfree(root); >> +} >> +EXPORT_SYMBOL_GPL(hot_track_exit); > > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v1 03/16] vfs: add I/O frequency update function
gt;> + spin_lock(&he->lock); >> + hot_range_item_init(entry, start, he); >> + rb_link_node(&entry->hot_range.rb_node, parent, p); >> + rb_insert_color(&entry->hot_range.rb_node, >> + &he->hot_range_tree.map); >> + spin_unlock(&he->lock); >> + >> + kref_get(&entry->hot_range.refs); > > and here Done > >> + return entry; >> +} >> + >> +/* >> + * This function does the actual work of updating >> + * the frequency numbers, whatever they turn out to be. > > Can this function be described a bit better? This comment did not help. OK, i will > >> + */ >> +static void hot_rw_freq_calc(struct timespec old_atime, >> + struct timespec cur_time, u64 *avg) >> +{ >> + struct timespec delta_ts; >> + u64 new_delta; >> + >> + delta_ts = timespec_sub(cur_time, old_atime); >> + new_delta = timespec_to_ns(&delta_ts) >> FREQ_POWER; >> + >> + *avg = (*avg << FREQ_POWER) - *avg + new_delta; >> + *avg = *avg >> FREQ_POWER; >> +} >> + >> +static void hot_freq_data_update(struct hot_freq_data *freq_data, bool >> write) >> +{ >> + struct timespec cur_time = current_kernel_time(); >> + >> + if (write) { >> + freq_data->nr_writes += 1; > > The preferred style is > > freq_data->nr_writes++ OK, done. > >> + hot_rw_freq_calc(freq_data->last_write_time, >> + cur_time, >> + &freq_data->avg_delta_writes); >> + freq_data->last_write_time = cur_time; >> + } else { >> + freq_data->nr_reads += 1; > > (...) > >> + hot_rw_freq_calc(freq_data->last_read_time, >> + freq_data->last_read_time, >> + cur_time, >> + &freq_data->avg_delta_reads); >> + freq_data->last_read_time = cur_time; >> + } >> +} >> + > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v1 05/16] vfs: add hooks to enable hot tracking
On Thu, Jan 10, 2013 at 8:52 AM, David Sterba wrote: > On Thu, Dec 20, 2012 at 10:43:24PM +0800, zwu.ker...@gmail.com wrote: >> --- a/fs/direct-io.c >> +++ b/fs/direct-io.c >> @@ -37,6 +37,7 @@ >> #include >> #include >> #include >> +#include "hot_tracking.h" >> >> /* >> * How many user pages to map in one call to get_user_pages(). This >> determines >> @@ -1299,6 +1300,11 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, >> struct inode *inode, >> prefetch(bdev->bd_queue); >> prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); >> >> + /* Hot data tracking */ >> + hot_update_freqs(inode, offset, >> + iov_length(iov, nr_segs), >> + rw & WRITE); > > hot_update_freqs takes an 'int rw' directly, so you should pass plain > 'rw' here and do the 'rw & WRITE' check in hot_freq_data_update itself. OK, done. > >> + >> return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, >>nr_segs, get_block, end_io, >>submit_io, flags); >> --- a/mm/page-writeback.c >> +++ b/mm/page-writeback.c >> @@ -35,6 +35,7 @@ >> #include /* __set_page_dirty_buffers */ >> #include >> #include >> +#include >> #include >> >> /* >> @@ -1902,13 +1903,24 @@ EXPORT_SYMBOL(generic_writepages); >> int do_writepages(struct address_space *mapping, struct writeback_control >> *wbc) >> { >> int ret; >> + loff_t start = 0; >> + size_t count = 0; >> >> if (wbc->nr_to_write <= 0) >> return 0; >> + >> + start = mapping->writeback_index << PAGE_CACHE_SHIFT; >> + count = wbc->nr_to_write; >> + >> if (mapping->a_ops->writepages) >> ret = mapping->a_ops->writepages(mapping, wbc); >> else >> ret = generic_writepages(mapping, wbc); >> + >> + /* Hot data tracking */ >> + hot_update_freqs(mapping->host, start, >> + (count - wbc->nr_to_write) * PAGE_CACHE_SIZE, 1); > > I think the frequencies should not be updated in case of error returned > from writepages. OK, Done. > >> + >> return ret; >> } >> >> --- a/mm/readahead.c >> +++ b/mm/readahead.c >> @@ -138,6 +139,12 @@ static int read_pages(struct address_space *mapping, >> struct file *filp, >> out: >> blk_finish_plug(&plug); >> >> + /* Hot data tracking */ >> + hot_update_freqs(mapping->host, >> + (loff_t)(list_entry(pages->prev, struct page, lru)->index) >> + << PAGE_CACHE_SHIFT, >> + (size_t)nr_pages * PAGE_CACHE_SIZE, 0); > > same comment here Ditto. thanks. > >> + >> return ret; >> } > > > david -- Regards, Zhi Yong Wu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RESEND v1 06/16] vfs: add temp calculation function
On Thu, Jan 10, 2013 at 8:53 AM, David Sterba wrote: > On Thu, Dec 20, 2012 at 10:43:25PM +0800, zwu.ker...@gmail.com wrote: >> --- a/fs/hot_tracking.c >> +++ b/fs/hot_tracking.c >> @@ -25,6 +25,14 @@ >> static struct kmem_cache *hot_inode_item_cachep __read_mostly; >> static struct kmem_cache *hot_range_item_cachep __read_mostly; >> >> +static u64 hot_raw_shift(u64 counter, u32 bits, bool dir) >> +{ >> + if (dir) >> + return counter << bits; >> + else >> + return counter >> bits; >> +} > > I don't understand the purpose of this function, it obscures a simple > bitwise shift. The following words seem to mean that you prefer removing this shifting function? > >> + >> /* >> * Initialize the inode tree. Should be called for each new inode >> * access or other user of the hot_inode interface. >> @@ -315,6 +323,72 @@ static void hot_freq_data_update(struct hot_freq_data >> *freq_data, bool write) >> } >> >> /* >> + * hot_temp_calc() is responsible for distilling the six heat >> + * criteria down into a single temperature value for the data, >> + * which is an integer between 0 and HEAT_MAX_VALUE. > > I didn't find HEAT_MAX_VALUE defined anywhere. This micro is only used in some comments, OK, i will replace it with 255. > >> + */ >> +static u32 hot_temp_calc(struct hot_freq_data *freq_data) >> +{ >> + u32 result = 0; >> + >> + struct timespec ckt = current_kernel_time(); >> + u64 cur_time = timespec_to_ns(&ckt); >> + >> + u32 nrr_heat = (u32)hot_raw_shift((u64)freq_data->nr_reads, >> + NRR_MULTIPLIER_POWER, true); >> + u32 nrw_heat = (u32)hot_raw_shift((u64)freq_data->nr_writes, >> + NRW_MULTIPLIER_POWER, true); > > So many typecasts, some of them unnecessary and in connection with > hot_raw_shift this is hard to read and understand. > > u32 nrr_heat = (u32)((u64)freq_data->nr_reads << > NRR_MULTIPLIER_POWER); Do you prefer this format instead of ? > > is not much better without a comment why this is doing the right thing. > >> + >> + u64 ltr_heat = >> + hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_read_time)), >> + LTR_DIVIDER_POWER, false); >> + u64 ltw_heat = >> + hot_raw_shift((cur_time - timespec_to_ns(&freq_data->last_write_time)), >> + LTW_DIVIDER_POWER, false); >> + >> + u64 avr_heat = >> + hot_raw_shiftu64) -1) - freq_data->avg_delta_reads), >> + AVR_DIVIDER_POWER, false); >> + u64 avw_heat = >> + hot_raw_shiftu64) -1) - freq_data->avg_delta_writes), >> + AVW_DIVIDER_POWER, false); >> + >> + /* ltr_heat is now guaranteed to be u32 safe */ >> + if (ltr_heat >= hot_raw_shift((u64) 1, 32, true)) >> + ltr_heat = 0; >> + else >> + ltr_heat = hot_raw_shift((u64) 1, 32, true) - ltr_heat; >> + >> + /* ltw_heat is now guaranteed to be u32 safe */ >> + if (ltw_heat >= hot_raw_shift((u64) 1, 32, true)) >> + ltw_heat = 0; >> + else >> + ltw_heat = hot_raw_shift((u64) 1, 32, true) - ltw_heat; >> + >> + /* avr_heat is now guaranteed to be u32 safe */ >> + if (avr_heat >= hot_raw_shift((u64) 1, 32, true)) >> + avr_heat = (u32) -1; >> + >> + /* avw_heat is now guaranteed to be u32 safe */ >> + if (avw_heat >= hot_raw_shift((u64) 1, 32, true)) >> + avw_heat = (u32) -1; >> + >> + nrr_heat = (u32)hot_raw_shift((u64)nrr_heat, >> + (3 - NRR_COEFF_POWER), false); >> + nrw_heat = (u32)hot_raw_shift((u64)nrw_heat, >> + (3 - NRW_COEFF_POWER), false); >> + ltr_heat = hot_raw_shift(ltr_heat, (3 - LTR_COEFF_POWER), false); >> + ltw_heat = hot_raw_shift(ltw_heat, (3 - LTW_COEFF_POWER), false); >> + avr_heat = hot_raw_shift(avr_heat, (3 - AVR_COEFF_POWER), false); >> + avw_heat = hot_raw_shift(avw_heat, (3 - AVW_COEFF_POWER), false); >> + >> + result = nrr_heat + nrw_heat + (u32) ltr_heat + >> + (u32) ltw_heat + (u32) avr_heat + (u32) avw_heat; > > Reading through the function up to here I've got lost in the shifts that > I don't see the meaning of the resulting value and how can I interpet it > if I watch it change over time. What are the expected we