On Fri, Nov 18, 2016 at 7:31 PM, Matthew Wilcox <mawil...@microsoft.com> wrote:
> From: Konstantin Khlebnikov [mailto:koc...@gmail.com]
>> On Thu, Nov 17, 2016 at 3:17 AM, Matthew Wilcox
>> <mawil...@linuxonhyperv.com> wrote:
>> > This fixes several interlinked problems with the iterators in the
>> > presence of multiorder entries.
>> >
>> > 1. radix_tree_iter_next() would only advance by one slot, which would
>> > result in the iterators returning the same entry more than once if there
>> > were sibling entries.
>>
>> Is this a problem? Do we have users who cannot evalate length of entry
>> by looking into it head?
>
> At the moment we have no users in tree :-)  The two users I know of are the 
> page cache and DAX.  The page cache stores a pointer to a struct page, which 
> has compound_order() to tell you the size.  DAX uses a couple of bits in the 
> radix tree entry to describe whether this is a PTE/PMD/PUD and so also knows 
> the size of the entry that it found.  We also store swap cache entries in the 
> same radix tree (tagged exceptional).  These currently have no information in 
> them to describe their size because each one represents only one page.  The 
> latest patchset to support swapping huge pages inserts 512 entries into the 
> radix tree instead of taking advantage of the multiorder entry 
> infrastructure.  Hopefully that gets fixed soon, but it will require stealing 
> a bit from either the number of swap files allowed or from the maximum size 
> of each swap file (currently 32/128GB)
>
> I think what you're suggesting is that we introduce a new API:
>
>  slot = radix_tree_iter_save(&iter, order);
>
> where the caller tells us the order of the entry it just consumed.  Or maybe 
> you're suggesting
>
>  slot = radix_tree_iter_advance(&iter, newindex)

Yes, someting like that.

>
> which would allow us to skip to any index.  Although ... isn't that just 
> radix_tree_iter_init()?

Iterator could keep pointer to current node and reuse it for next
iteration if possible.

>
> It does push a bit of complexity onto the callers.  We have 7 callers of 
> radix_tree_iter_next() in my current tree (after applying this patch set, so 
> range_tag_if_tagged and locate_item have been pushed into their callers): 
> btrfs, kugepaged, page-writeback and shmem.  btrfs knows its objects occupy 
> one slot.  khugepaged knows that its page is order 0 at the time it calls 
> radix_tree_iter_next().  Page-writeback has a struct page and can simply use 
> compound_order().  It's shmem where things get sticky, although it's all 
> solvable with some temporary variables.
>

Users who work only with single slot enties don't get any complications,
all other already manage these multiorder entries somehow.

> MM people, what do you think of this patch?  It's on top of my current IDR 
> tree, although I'd fold it into patch 20 of the series if it's acceptable.
>
> diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
> index 6d3457a..7f864ec 100644
> --- a/fs/btrfs/tests/btrfs-tests.c
> +++ b/fs/btrfs/tests/btrfs-tests.c
> @@ -162,7 +162,7 @@ void btrfs_free_dummy_fs_info(struct btrfs_fs_info 
> *fs_info)
>                                 slot = radix_tree_iter_retry(&iter);
>                         continue;
>                 }
> -               slot = radix_tree_iter_next(slot, &iter);
> +               slot = radix_tree_iter_save(&iter, 0);
>                 spin_unlock(&fs_info->buffer_lock);
>                 free_extent_buffer_stale(eb);
>                 spin_lock(&fs_info->buffer_lock);
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4e42d4d..4419325 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -421,15 +421,22 @@ __radix_tree_iter_add(struct radix_tree_iter *iter, 
> unsigned long slots)
>  }
>
>  /**
> - * radix_tree_iter_next - resume iterating when the chunk may be invalid
> - * @iter:      iterator state
> + * radix_tree_iter_save - resume iterating when the chunk may be invalid
> + * @iter: iterator state
> + * @order: order of the entry that was just processed
>   *
> - * If the iterator needs to release then reacquire a lock, the chunk may
> - * have been invalidated by an insertion or deletion.  Call this function
> + * If the iteration needs to release then reacquire a lock, the chunk may
> + * be invalidated by an insertion or deletion.  Call this function
>   * before releasing the lock to continue the iteration from the next index.
>   */
> -void **__must_check radix_tree_iter_next(void **slot,
> -                                       struct radix_tree_iter *iter);
> +static inline void **__must_check
> +radix_tree_iter_save(struct radix_tree_iter *iter, unsigned order)
> +{
> +       iter->next_index = round_up(iter->index, 1 << order);
> +       iter->index = 0;
> +       iter->tags = 0;
> +       return NULL;
> +}
>
>  /**
>   * radix_tree_chunk_size - get current chunk size
> @@ -467,7 +474,7 @@ static inline void ** __radix_tree_next_slot(void **slot,
>   * For tagged lookup it also eats @iter->tags.
>   *
>   * There are several cases where 'slot' can be passed in as NULL to this
> - * function.  These cases result from the use of radix_tree_iter_next() or
> + * function.  These cases result from the use of radix_tree_iter_save() or
>   * radix_tree_iter_retry().  In these cases we don't end up dereferencing
>   * 'slot' because either:
>   * a) we are doing tagged iteration and iter->tags has been set to 0, or
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7e2469b..fcbadad 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1245,6 +1245,7 @@ static inline void __set_iter_shift(struct 
> radix_tree_iter *iter,
>  #endif
>  }
>
> +#ifdef CONFIG_RADIX_TREE_MULTIORDER
>  static void ** __radix_tree_iter_next(struct radix_tree_node **nodep,
>                         void **slot, struct radix_tree_iter *iter)
>  {
> @@ -1263,7 +1264,6 @@ static void ** __radix_tree_iter_next(struct 
> radix_tree_node **nodep,
>         return NULL;
>  }
>
> -#ifdef CONFIG_RADIX_TREE_MULTIORDER
>  void ** __radix_tree_next_slot(void **slot, struct radix_tree_iter *iter,
>                                         unsigned flags)
>  {
> @@ -1321,20 +1321,6 @@ void ** __radix_tree_next_slot(void **slot, struct 
> radix_tree_iter *iter,
>  EXPORT_SYMBOL(__radix_tree_next_slot);
>  #endif
>
> -void **radix_tree_iter_next(void **slot, struct radix_tree_iter *iter)
> -{
> -       struct radix_tree_node *node;
> -
> -       slot++;
> -       iter->index = __radix_tree_iter_add(iter, 1);
> -       node = rcu_dereference_raw(*slot);
> -       __radix_tree_iter_next(&node, slot, iter);
> -       iter->next_index = iter->index;
> -       iter->tags = 0;
> -       return NULL;
> -}
> -EXPORT_SYMBOL(radix_tree_iter_next);
> -
>  /**
>   * radix_tree_next_chunk - find next chunk of slots for iteration
>   *
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 46155d1..54446e6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1614,7 +1614,7 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
>                 present++;
>
>                 if (need_resched()) {
> -                       slot = radix_tree_iter_next(slot, &iter);
> +                       slot = radix_tree_iter_save(&iter, 0);
>                         cond_resched_rcu();
>                 }
>         }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index c593715..7d6b870 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2119,7 +2119,7 @@ void tag_pages_for_writeback(struct address_space 
> *mapping,
>                 tagged++;
>                 if ((tagged % WRITEBACK_TAG_BATCH) != 0)
>                         continue;
> -               slot = radix_tree_iter_next(slot, &iter);
> +               slot = radix_tree_iter_save(&iter, compound_order(*slot));
>                 spin_unlock_irq(&mapping->tree_lock);
>                 cond_resched();
>                 spin_lock_irq(&mapping->tree_lock);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8f9c9aa..3f2d07a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -644,6 +644,7 @@ unsigned long shmem_partial_swap_usage(struct 
> address_space *mapping,
>         rcu_read_lock();
>
>         radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> +               unsigned int order = 0;
>                 if (iter.index >= end)
>                         break;
>
> @@ -656,9 +657,11 @@ unsigned long shmem_partial_swap_usage(struct 
> address_space *mapping,
>
>                 if (radix_tree_exceptional_entry(page))
>                         swapped++;
> +               else
> +                       order = compound_order(page);
>
>                 if (need_resched()) {
> -                       slot = radix_tree_iter_next(slot, &iter);
> +                       slot = radix_tree_iter_save(&iter, order);
>                         cond_resched_rcu();
>                 }
>         }
> @@ -1062,7 +1065,7 @@ static unsigned long find_swap_entry(struct 
> radix_tree_root *root, void *item)
>                 checked++;
>                 if ((checked % 4096) != 0)
>                         continue;
> -               slot = radix_tree_iter_next(slot, &iter);
> +               slot = radix_tree_iter_save(&iter, 0);
>                 cond_resched_rcu();
>         }
>
> @@ -2444,21 +2447,25 @@ static void shmem_tag_pins(struct address_space 
> *mapping)
>         rcu_read_lock();
>
>         radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
> +               unsigned int order = 0;
>                 page = radix_tree_deref_slot(slot);
>                 if (!page || radix_tree_exception(page)) {
>                         if (radix_tree_deref_retry(page)) {
>                                 slot = radix_tree_iter_retry(&iter);
>                                 continue;
>                         }
> -               } else if (page_count(page) - page_mapcount(page) > 1) {
> -                       spin_lock_irq(&mapping->tree_lock);
> -                       radix_tree_tag_set(&mapping->page_tree, iter.index,
> -                                          SHMEM_TAG_PINNED);
> -                       spin_unlock_irq(&mapping->tree_lock);
> +               } else {
> +                       order = compound_order(page);
> +                       if (page_count(page) - page_mapcount(page) > 1) {
> +                               spin_lock_irq(&mapping->tree_lock);
> +                               radix_tree_tag_set(&mapping->page_tree,
> +                                               iter.index, SHMEM_TAG_PINNED);
> +                               spin_unlock_irq(&mapping->tree_lock);
> +                       }
>                 }
>
>                 if (need_resched()) {
> -                       slot = radix_tree_iter_next(slot, &iter);
> +                       slot = radix_tree_iter_save(&iter, order);
>                         cond_resched_rcu();
>                 }
>         }
> @@ -2528,7 +2535,10 @@ static int shmem_wait_for_pins(struct address_space 
> *mapping)
>                         spin_unlock_irq(&mapping->tree_lock);
>  continue_resched:
>                         if (need_resched()) {
> -                               slot = radix_tree_iter_next(slot, &iter);
> +                               unsigned int order = 0;
> +                               if (page)
> +                                       order = compound_order(page);
> +                               slot = radix_tree_iter_save(&iter, order);
>                                 cond_resched_rcu();
>                         }
>                 }

Reply via email to