Hi Chao, On Sun, Jan 04, 2015 at 04:24:10PM +0800, Chao Yu wrote: > Hi Jaegeuk, > > > -----Original Message----- > > From: Jaegeuk Kim [mailto:jaeg...@kernel.org] > > Sent: Wednesday, December 31, 2014 4:26 PM > > To: Chao Yu > > Cc: 'Changman Lee'; linux-f2fs-de...@lists.sourceforge.net; > > linux-kernel@vger.kernel.org > > Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree > > > > Hi Chao, > > > > On Tue, Dec 30, 2014 at 06:10:21PM +0800, Chao Yu wrote: > > > Hi Jaegeuk, > > > > > > > -----Original Message----- > > > > From: Jaegeuk Kim [mailto:jaeg...@kernel.org] > > > > Sent: Tuesday, December 30, 2014 5:23 AM > > > > To: Chao Yu > > > > Cc: 'Changman Lee'; linux-f2fs-de...@lists.sourceforge.net; > > > > linux-kernel@vger.kernel.org > > > > Subject: Re: [RFC PATCH] f2fs: add extent cache base on rb-tree > > > > > > > > Hi Chao, > > > > > > > > On Mon, Dec 29, 2014 at 03:19:18PM +0800, Chao Yu wrote: > > > > > > > > [snip] > > > > > > > > Nice draft. :) > > > > > > Thanks! :) > > > > > > > > > > > > > > > > > Please see the draft below. > > > > > > > > > > 1) Extent management: > > > > > If we use global management that managing all extents which are from > > > > > different > > > > > inodes in sbi, we will face with serious lock contention when we > > > > > access these > > > > > extents belong to different inodes concurrently, the loss may > > > > > outweights the > > > > > gain. > > > > > > > > Agreed. > > > > > > > > > So we choose a local management for extent which means all extents are > > > > > managed by inode itself to avoid above lock contention. Addtionlly, > > > > > we manage > > > > > all extents globally by linking all inode into a global lru list for > > > > > extent > > > > > cache shrinker. > > > > > Approach: > > > > > a) build extent tree/rwlock/lru list/extent count in each inode. > > > > > *extent tree: link all extent in rb-tree; > > > > > *rwlock: protect fields when accessing extent cache > > > > > concurrently; > > > > > *lru list: sort all extents in accessing time order; > > > > > *extent count: record total count of extents in cache. > > > > > b) use lru shrink list in sbi to manage all inode which cached > > > > > extents. > > > > > *inode will be added or repostioned in this global list > > > > > whenever > > > > > extent is being access in this inode. > > > > > *use spinlock to protect this shrink list. > > > > > > > > 1. How about adding a data structure with inode number instead of > > > > referring > > > > inode pointer? > > > > > > > > 2. How about managing extent entries globally and setting an upper > > > > bound to > > > > the number of extent entries instead of limiting them per each inode? > > > > (The rb-tree will handle many extents per inode.) > > > > > > [assumption] > > > Approach A: global lru list; > > > Approach B: lru list per inode. > > > > > > I have considered Approach A before writing the draft, although Approach > > > A could > > > provide us higher ratio hit due to global lru, it may make lock > > > contention more > > > intensively and has more memory overhead descripted below. So I choose > > > more > > > conservative Approach B. > > > 1) as extent_entry will split in Cow, we may lock extent_list more times > > > when > > > move these separated extent entries in extent_list, unless we have batch > > > mode > > > interface. > > > 2) lock overhead between shrinker and lookuper and updater. > > > e.g. extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU) > > > (#1) has A, C, E, G in rb-tree > > > (#2) has B, D, F in rb-tree > > > shrinker op: lock list -> trylock #1 -> unlock list -> free A -> unlock #1 > > > lock list -> trylock #2 -> unlock list -> free B -> unlock #2 > > > lock list -> trylock #1 -> unlock list -> free C -> unlock #1 > > > ... > > > > lock_list -> list_del(A) -> unlock_list -> lock #1 -> remove_tree(A) -> > > unlock #1 > > -> free A. > > If unlock_list before lock #1, (A) could be modified by other thread between > ->unlock_list > and lock #1, so one more "lookup_tree(A)" is needed under lock #1: > lock #1 -> lookup_tree(A) -> remove_tree(A) -> unlock #1 > > > > > Or, > > lock_list -> list_del(A, B, C) -> unlock_list -> lock #1 -> remove_tree(A, > > C) -> > > unlock #1 -> free A, C -> lock #2 -> remove_tree(B) -> unlock #2 > > ditto. > To avoid unneeded lookup_tree(A), maybe we can use this series: > > lock list -> get A from head -> trylock #1 -> try to get more node(C, E, G) > of #1 in list > -> unlock list -> remove_tree&free(A, C, E, G) -> unlock #1 > > > > > I think a couple of list operations do not cause severe lock contention. > > Yeah, batch operation is better. > But sometimes random write break extent into everywhere in shrinker's list, > so batch operation can gain less in this condition. > > > And, if we assing minimum length for extent caches, the contention would be > > distributed. > > > > This means that, it needs to manage each of all the extents in the cache > > should > > have the length larger than minimum value. > > If I do not misunderstand, what you mean is that if extent number in one > inode cache > is bigger than minimum value, then, we will start to add all extents of this > inode > into shrinker's list, and reposition them in the list when > f2fs_{lookup,update}_extent_cache? > > > > > > *trylock/unlock* for all free extent entries belong to one inode will be > > > separated > > > to lots of parts, resulting in more contention. > > > 3) it has more memory overhead in each extent entry: > > > Approach A: need add inode number for backref and list_head * > > > > It doesn't need to add ino. Just remain them, and shrinker can remove empty > > ino_entry_for_extents periodically. > > I do not understand why we don't need ino, :(. > Without ino, how can we get the rb-tree root pointer for rb_erase(node, root)?
Let me describe in more detail. For example, - global radix tree in sbi --> ino #1 --> rb_tree (rb_tree, rb_tree_lock, refcount=0) --> ino #2 --> rb_tree --> ino #3 --> rb_tree `-> extent A `-> extent B (fofs, blkaddr, length, *list_head=empty) - global extent list in sbi extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU) (#1) has A, C, E, G in rb-tree (#2) has B, D, F in rb-tree 1) update extents (#1) - lock_radix_tree if (notfound #1) allocate #1; #1's refcount++; - unlock_radix_tree - lock_rb_tree (#1) update A rb_delete E, if E->length is smaller than MIN_LEN. allocate & rb_add K, if K->length is larget than MIN_LEN. list_lock list_move(A) list_del(E) list_add_tail(K) list_unlock - unlock_rb_tree (#1) - #1's rafcount--; 2) shrinker - list_lock list_del A, B, C, ... - list_unlock - gang_look_up_radix_tree_with_lock { #N's refcount++; lock_rb_tree (#N) for_each_rb_node { if (list_empty(extent->list_head)) { remove_rb_node(extent); free(extent); } } unlock_rb_tree (#N) #N's refcount--; - } - for_each_radix_tree_with_lock { if (radix_node->refcount == 0 && rb_tree_is_empty) free(radix_node); - } 3) lookup extents (#1) - lock_radix_tree if (notfound #1) goto out; #1's refcount++; - unlock_radix_tree - lock_rb_tree (#1) found extent A list_lock list_move(A) list_unlock - unlock_rb_tree (#1) - #1's rafcount--; > > > > > > Approach B: need add list_head * > > > > > > Or maybe it's not a problem because we can afford all these above. > > > > > > Anyway, I'm not against. > > > > > > > > > > > 3. It needs to set a minimum length for the candidate of extent cache. > > > > (e.g., 64) > > > > > > Is this used for shrinker? Can you please explain more about this minimum > > > length? > > > > > > > > > > > So, for example, > > > > struct ino_entry_for_extents { > > > > inode number; > > > > rb_tree for extent_entry objects; > > > > rwlock; > > > > }; > > > > > > > > struct extent_entry { > > > > blkaddr, len; > > > > list_head *; > > > > > > We should add one other field 'inode number' here, as through it we can > > > get > > > rb-tree root from ino_entry_for_extents for rb_erase when we free the > > > extent > > > entry in shrinker. > > > > > > > }; > > > > > > > > Something like this. > > > > > > > > [A, B, C, ... are extent entry] > > > > > > > > The sbi has > > > > 1. an extent_list: (LRU) A -> B -> C -> D -> E -> F -> G (MRU) > > > > 2. radix_tree: ino_entry_for_extents (#10) has D, B in rb-tree > > > > ` ino_entry_for_extents (#11) has A, C in rb-tree > > > > ` ino_entry_for_extents (#12) has F in rb-tree > > > > ` ino_entry_for_extents (#13) has G, E in rb-tree > > > > > > > > In f2fs_update_extent_cache and __get_data_block for #10, > > > > ino_entry_for_extents (#10) was founded and updated D or B. > > > > Then, updated entries are moved to MRU. > > > > > > > > In f2fs_evict_inode for #11, A and C are moved to LRU. > > > > But, if this inode is unlinked, all the A, C, and ino_entry_for_extens > > > > (#11) > > > > should be released. > > > > > > > > In f2fs_balance_fs_bg, some LRU extents are released according to the > > > > amount > > > > of consumed memory. Then, it frees any ino_entry_for_extents having no > > > > extent. > > > > > > > > IMO, we don't need to consider readahead for this, since get_data_block > > > > will > > > > be called by VFS readahead. > > > > > > In get_data_block we can cache only one blkaddr once after > > > get_dnode_of_data > > > returned, it seems less efficient. So why not readahead selectively more > > > contiguous valid blkaddr in get_dnode_of_data once? > > > > Why do you think only one blkaddr? > > get_data_block searches extents as many as possible. > > Surely not one blkaddr. > > Actually, what I mean is that: > ->get_dnode_of_data > ->f2fs_ra_extent_cache() > > f2fs_ra_extent_cache(node_page, offset) { > for (i = offset; i < max; i++) { > if (is_valid() && is_contiguous()) > len++; > } > f2fs_update_extent_cache(inode, fofs, blk_addr, len) > } > > But not: > ->__get_data_block > ->get_dnode_of_data > ->f2fs_update_extent_cache(inode, fofs, blk_addr, 1); > > Thanks, > Yu > > > > > Thanks, > > > > > > > > > > > > > Furthermore, we need to think about whether LRU is really best or not. > > > > IMO, the extent cache aims to improve second access speed, rather than > > > > initial > > > > cold misses. So, maybe MRU or another algorithms would be better. > > > > > > Agree, let's rethink about this. :) > > > > > > Thanks, > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > 2) Limitation: > > > > > In one inode, as we split or add extent in extent cache when > > > > > read/write, extent > > > > > number will enlarge, so memory and CPU overhead will increase. > > > > > In order to control the overhead of memory and CPU, we try to set a > > > > > upper bound > > > > > number to limit total extent number in each inode, This number is > > > > > global > > > > > configuration which is visable to all inode. This number will be > > > > > exported to > > > > > sysfs for configuring according to requirement of user. By default, > > > > > designed > > > > > number is 8. > > > > > > > > > > 3) Shrinker: > > > > > There are two shrink paths: > > > > > a) one is triggered when extent count has exceed the upper > > > > > bound of > > > > > inode's extent cache. We will try to release extent(s) from > > > > > head of > > > > > inode's inner extent lru list until extent count is equal to > > > > > upper bound. > > > > > This operation could be in f2fs_update_extent_cache(). > > > > > b) the other one is triggered when memory util exceed > > > > > threshold, we try > > > > > get inode from head of global lru list(s), and release > > > > > extent(s) with > > > > > fixed number (by default: 64 extents) in inode one by one. > > > > > This operation could be in f2fs_balance_fs_bg(). > > > > > > > > > > 4) Revalidation: > > > > > In ->evict(), extent cache will be released, in order to reuse extent > > > > > cache > > > > > of inode when reopen for high hit ratio, a proper way is to add > > > > > cached extent > > > > > tree into a hash table (could be radix tree), revalidate extent tree > > > > > and recover > > > > > it to inode when reopen. > > > > > Besides, a global lru list is needed here for shrinker. > > > > > > > > > > 5) Readahead: > > > > > Expand extent cache by readaheading when we call get_dnode_of_data in > > > > > non-emergency > > > > > path. Note, ra can affect lock contention for both ->rwlock in extent > > > > > cache and > > > > > ->lock of global shrink list. > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/