Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On 02/28/18 12:19, Andy Shevchenko wrote: > On Wed, Feb 28, 2018 at 9:44 PM, Frank Rowandwrote: >> On 02/28/18 11:31, Andy Shevchenko wrote: >>> On Wed, Feb 28, 2018 at 9:04 PM, wrote: > >>> The question is why O(1) is so important? O(log(n)) wouldn't work? >> >> O(1) is not critical. It was just a nice side result. >> >> >>> Using radix_tree() I suppose allows to dynamically extend or shrink >>> the cache which would work with DT overlays. >> >> The memory usage of the phandle cache in this patch is fairly small. >> The memory overhead of a radix_tree() would not be justified. > > OTOH the advantage I mentioned isn't a good argument? No. Deleting and re-creating the cache to resize it (when applying an overlay) would be a rare event that would happen as desired by the overlay application code. There is no real gain by having extension or shrinkage occur automatically and if the overlay application code desires the resizing it is trivial to implement (a single function call).
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On 02/28/18 12:19, Andy Shevchenko wrote: > On Wed, Feb 28, 2018 at 9:44 PM, Frank Rowand wrote: >> On 02/28/18 11:31, Andy Shevchenko wrote: >>> On Wed, Feb 28, 2018 at 9:04 PM, wrote: > >>> The question is why O(1) is so important? O(log(n)) wouldn't work? >> >> O(1) is not critical. It was just a nice side result. >> >> >>> Using radix_tree() I suppose allows to dynamically extend or shrink >>> the cache which would work with DT overlays. >> >> The memory usage of the phandle cache in this patch is fairly small. >> The memory overhead of a radix_tree() would not be justified. > > OTOH the advantage I mentioned isn't a good argument? No. Deleting and re-creating the cache to resize it (when applying an overlay) would be a rare event that would happen as desired by the overlay application code. There is no real gain by having extension or shrinkage occur automatically and if the overlay application code desires the resizing it is trivial to implement (a single function call).
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On Wed, Feb 28, 2018 at 2:19 PM, Andy Shevchenkowrote: > On Wed, Feb 28, 2018 at 9:44 PM, Frank Rowand wrote: >> On 02/28/18 11:31, Andy Shevchenko wrote: >>> On Wed, Feb 28, 2018 at 9:04 PM, wrote: > >>> The question is why O(1) is so important? O(log(n)) wouldn't work? >> >> O(1) is not critical. It was just a nice side result. >> >> >>> Using radix_tree() I suppose allows to dynamically extend or shrink >>> the cache which would work with DT overlays. >> >> The memory usage of the phandle cache in this patch is fairly small. >> The memory overhead of a radix_tree() would not be justified. > > OTOH the advantage I mentioned isn't a good argument? Yes, but still one that ignores memory usage. I'll take whatever solution doesn't undo this[1]. Rob [1] https://lwn.net/Articles/735839/
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On Wed, Feb 28, 2018 at 2:19 PM, Andy Shevchenko wrote: > On Wed, Feb 28, 2018 at 9:44 PM, Frank Rowand wrote: >> On 02/28/18 11:31, Andy Shevchenko wrote: >>> On Wed, Feb 28, 2018 at 9:04 PM, wrote: > >>> The question is why O(1) is so important? O(log(n)) wouldn't work? >> >> O(1) is not critical. It was just a nice side result. >> >> >>> Using radix_tree() I suppose allows to dynamically extend or shrink >>> the cache which would work with DT overlays. >> >> The memory usage of the phandle cache in this patch is fairly small. >> The memory overhead of a radix_tree() would not be justified. > > OTOH the advantage I mentioned isn't a good argument? Yes, but still one that ignores memory usage. I'll take whatever solution doesn't undo this[1]. Rob [1] https://lwn.net/Articles/735839/
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On Wed, Feb 28, 2018 at 9:44 PM, Frank Rowandwrote: > On 02/28/18 11:31, Andy Shevchenko wrote: >> On Wed, Feb 28, 2018 at 9:04 PM, wrote: >> The question is why O(1) is so important? O(log(n)) wouldn't work? > > O(1) is not critical. It was just a nice side result. > > >> Using radix_tree() I suppose allows to dynamically extend or shrink >> the cache which would work with DT overlays. > > The memory usage of the phandle cache in this patch is fairly small. > The memory overhead of a radix_tree() would not be justified. OTOH the advantage I mentioned isn't a good argument? -- With Best Regards, Andy Shevchenko
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On Wed, Feb 28, 2018 at 9:44 PM, Frank Rowand wrote: > On 02/28/18 11:31, Andy Shevchenko wrote: >> On Wed, Feb 28, 2018 at 9:04 PM, wrote: >> The question is why O(1) is so important? O(log(n)) wouldn't work? > > O(1) is not critical. It was just a nice side result. > > >> Using radix_tree() I suppose allows to dynamically extend or shrink >> the cache which would work with DT overlays. > > The memory usage of the phandle cache in this patch is fairly small. > The memory overhead of a radix_tree() would not be justified. OTOH the advantage I mentioned isn't a good argument? -- With Best Regards, Andy Shevchenko
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On 02/28/18 11:31, Andy Shevchenko wrote: > On Wed, Feb 28, 2018 at 9:04 PM,wrote: > >> Create a cache of the nodes that contain a phandle property. Use this >> cache to find the node for a given phandle value instead of scanning >> the devicetree to find the node. If the phandle value is not found >> in the cache, of_find_node_by_phandle() will fall back to the tree >> scan algorithm. >> >> The cache is initialized in of_core_init(). >> >> The cache is freed via a late_initcall_sync() if modules are not >> enabled. >> >> If the devicetree is created by the dtc compiler, with all phandle >> property values auto generated, then the size required by the cache >> could be 4 * (1 + number of phandles) bytes. This results in an O(1) >> node lookup cost for a given phandle value. Due to a concern that the >> phandle property values might not be consistent with what is generated >> by the dtc compiler, a mask has been added to the cache lookup algorithm. >> To maintain the O(1) node lookup cost, the size of the cache has been >> increased by rounding the number of entries up to the next power of >> two. >> >> The overhead of finding the devicetree node containing a given phandle >> value has been noted by several people in the recent past, in some cases >> with a patch to add a hashed index of devicetree nodes, based on the >> phandle value of the node. One concern with this approach is the extra >> space added to each node. This patch takes advantage of the phandle >> property values auto generated by the dtc compiler, which begin with >> one and monotonically increase by one, resulting in a range of 1..n >> for n phandle values. This implementation should also provide a good >> reduction of overhead for any range of phandle values that are mostly >> in a monotonic range. >> >> Performance measurements by Chintan Pandya >> of several implementations of patches that are similar to this one >> suggest an expected reduction of boot time by ~400ms for his test >> system. If the cache size was decreased to 64 entries, the boot >> time was reduced by ~340 ms. The measurements were on a 4.9.73 kernel >> for arch/arm64/boot/dts/qcom/sda670-mtp.dts, contains 2371 nodes and >> 814 phandle values. > > The question is why O(1) is so important? O(log(n)) wouldn't work? O(1) is not critical. It was just a nice side result. > Using radix_tree() I suppose allows to dynamically extend or shrink > the cache which would work with DT overlays. The memory usage of the phandle cache in this patch is fairly small. The memory overhead of a radix_tree() would not be justified.
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On 02/28/18 11:31, Andy Shevchenko wrote: > On Wed, Feb 28, 2018 at 9:04 PM, wrote: > >> Create a cache of the nodes that contain a phandle property. Use this >> cache to find the node for a given phandle value instead of scanning >> the devicetree to find the node. If the phandle value is not found >> in the cache, of_find_node_by_phandle() will fall back to the tree >> scan algorithm. >> >> The cache is initialized in of_core_init(). >> >> The cache is freed via a late_initcall_sync() if modules are not >> enabled. >> >> If the devicetree is created by the dtc compiler, with all phandle >> property values auto generated, then the size required by the cache >> could be 4 * (1 + number of phandles) bytes. This results in an O(1) >> node lookup cost for a given phandle value. Due to a concern that the >> phandle property values might not be consistent with what is generated >> by the dtc compiler, a mask has been added to the cache lookup algorithm. >> To maintain the O(1) node lookup cost, the size of the cache has been >> increased by rounding the number of entries up to the next power of >> two. >> >> The overhead of finding the devicetree node containing a given phandle >> value has been noted by several people in the recent past, in some cases >> with a patch to add a hashed index of devicetree nodes, based on the >> phandle value of the node. One concern with this approach is the extra >> space added to each node. This patch takes advantage of the phandle >> property values auto generated by the dtc compiler, which begin with >> one and monotonically increase by one, resulting in a range of 1..n >> for n phandle values. This implementation should also provide a good >> reduction of overhead for any range of phandle values that are mostly >> in a monotonic range. >> >> Performance measurements by Chintan Pandya >> of several implementations of patches that are similar to this one >> suggest an expected reduction of boot time by ~400ms for his test >> system. If the cache size was decreased to 64 entries, the boot >> time was reduced by ~340 ms. The measurements were on a 4.9.73 kernel >> for arch/arm64/boot/dts/qcom/sda670-mtp.dts, contains 2371 nodes and >> 814 phandle values. > > The question is why O(1) is so important? O(log(n)) wouldn't work? O(1) is not critical. It was just a nice side result. > Using radix_tree() I suppose allows to dynamically extend or shrink > the cache which would work with DT overlays. The memory usage of the phandle cache in this patch is fairly small. The memory overhead of a radix_tree() would not be justified.
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On Wed, Feb 28, 2018 at 9:04 PM,wrote: > Create a cache of the nodes that contain a phandle property. Use this > cache to find the node for a given phandle value instead of scanning > the devicetree to find the node. If the phandle value is not found > in the cache, of_find_node_by_phandle() will fall back to the tree > scan algorithm. > > The cache is initialized in of_core_init(). > > The cache is freed via a late_initcall_sync() if modules are not > enabled. > > If the devicetree is created by the dtc compiler, with all phandle > property values auto generated, then the size required by the cache > could be 4 * (1 + number of phandles) bytes. This results in an O(1) > node lookup cost for a given phandle value. Due to a concern that the > phandle property values might not be consistent with what is generated > by the dtc compiler, a mask has been added to the cache lookup algorithm. > To maintain the O(1) node lookup cost, the size of the cache has been > increased by rounding the number of entries up to the next power of > two. > > The overhead of finding the devicetree node containing a given phandle > value has been noted by several people in the recent past, in some cases > with a patch to add a hashed index of devicetree nodes, based on the > phandle value of the node. One concern with this approach is the extra > space added to each node. This patch takes advantage of the phandle > property values auto generated by the dtc compiler, which begin with > one and monotonically increase by one, resulting in a range of 1..n > for n phandle values. This implementation should also provide a good > reduction of overhead for any range of phandle values that are mostly > in a monotonic range. > > Performance measurements by Chintan Pandya > of several implementations of patches that are similar to this one > suggest an expected reduction of boot time by ~400ms for his test > system. If the cache size was decreased to 64 entries, the boot > time was reduced by ~340 ms. The measurements were on a 4.9.73 kernel > for arch/arm64/boot/dts/qcom/sda670-mtp.dts, contains 2371 nodes and > 814 phandle values. The question is why O(1) is so important? O(log(n)) wouldn't work? Using radix_tree() I suppose allows to dynamically extend or shrink the cache which would work with DT overlays. -- With Best Regards, Andy Shevchenko
Re: [PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
On Wed, Feb 28, 2018 at 9:04 PM, wrote: > Create a cache of the nodes that contain a phandle property. Use this > cache to find the node for a given phandle value instead of scanning > the devicetree to find the node. If the phandle value is not found > in the cache, of_find_node_by_phandle() will fall back to the tree > scan algorithm. > > The cache is initialized in of_core_init(). > > The cache is freed via a late_initcall_sync() if modules are not > enabled. > > If the devicetree is created by the dtc compiler, with all phandle > property values auto generated, then the size required by the cache > could be 4 * (1 + number of phandles) bytes. This results in an O(1) > node lookup cost for a given phandle value. Due to a concern that the > phandle property values might not be consistent with what is generated > by the dtc compiler, a mask has been added to the cache lookup algorithm. > To maintain the O(1) node lookup cost, the size of the cache has been > increased by rounding the number of entries up to the next power of > two. > > The overhead of finding the devicetree node containing a given phandle > value has been noted by several people in the recent past, in some cases > with a patch to add a hashed index of devicetree nodes, based on the > phandle value of the node. One concern with this approach is the extra > space added to each node. This patch takes advantage of the phandle > property values auto generated by the dtc compiler, which begin with > one and monotonically increase by one, resulting in a range of 1..n > for n phandle values. This implementation should also provide a good > reduction of overhead for any range of phandle values that are mostly > in a monotonic range. > > Performance measurements by Chintan Pandya > of several implementations of patches that are similar to this one > suggest an expected reduction of boot time by ~400ms for his test > system. If the cache size was decreased to 64 entries, the boot > time was reduced by ~340 ms. The measurements were on a 4.9.73 kernel > for arch/arm64/boot/dts/qcom/sda670-mtp.dts, contains 2371 nodes and > 814 phandle values. The question is why O(1) is so important? O(log(n)) wouldn't work? Using radix_tree() I suppose allows to dynamically extend or shrink the cache which would work with DT overlays. -- With Best Regards, Andy Shevchenko
[PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
From: Frank RowandCreate a cache of the nodes that contain a phandle property. Use this cache to find the node for a given phandle value instead of scanning the devicetree to find the node. If the phandle value is not found in the cache, of_find_node_by_phandle() will fall back to the tree scan algorithm. The cache is initialized in of_core_init(). The cache is freed via a late_initcall_sync() if modules are not enabled. If the devicetree is created by the dtc compiler, with all phandle property values auto generated, then the size required by the cache could be 4 * (1 + number of phandles) bytes. This results in an O(1) node lookup cost for a given phandle value. Due to a concern that the phandle property values might not be consistent with what is generated by the dtc compiler, a mask has been added to the cache lookup algorithm. To maintain the O(1) node lookup cost, the size of the cache has been increased by rounding the number of entries up to the next power of two. The overhead of finding the devicetree node containing a given phandle value has been noted by several people in the recent past, in some cases with a patch to add a hashed index of devicetree nodes, based on the phandle value of the node. One concern with this approach is the extra space added to each node. This patch takes advantage of the phandle property values auto generated by the dtc compiler, which begin with one and monotonically increase by one, resulting in a range of 1..n for n phandle values. This implementation should also provide a good reduction of overhead for any range of phandle values that are mostly in a monotonic range. Performance measurements by Chintan Pandya of several implementations of patches that are similar to this one suggest an expected reduction of boot time by ~400ms for his test system. If the cache size was decreased to 64 entries, the boot time was reduced by ~340 ms. The measurements were on a 4.9.73 kernel for arch/arm64/boot/dts/qcom/sda670-mtp.dts, contains 2371 nodes and 814 phandle values. Reported-by: Chintan Pandya Signed-off-by: Frank Rowand --- Changes since v3: - of_populate_phandle_cache(): add check for failed memory allocation Changes since v2: - add mask to calculation of phandle cache entry - which results in better overhead reduction for devicetrees with phandle properties not allocated in the monotonically increasing range of 1..n - due to mask, number of entries in cache potentially increased to next power of two - minor fixes as suggested by reviewers - no longer using live_tree_max_phandle() so do not move it from drivers/of/resolver.c to drivers/of/base.c Changes since v1: - change short description from of: cache phandle nodes to reduce cost of of_find_node_by_phandle() - rebase on v4.16-rc1 - reorder new functions in base.c to avoid forward declaration - add locking around kfree(phandle_cache) for memory ordering - add explicit check for non-null of phandle_cache in of_find_node_by_phandle(). There is already a check for !handle, which prevents accessing a null phandle_cache, but that dependency is not obvious, so this check makes it more apparent. - do not free phandle_cache if modules are enabled, so that cached phandles will be available when modules are loaded drivers/of/base.c | 86 ++--- drivers/of/of_private.h | 3 ++ drivers/of/resolver.c | 3 -- 3 files changed, 85 insertions(+), 7 deletions(-) diff --git a/drivers/of/base.c b/drivers/of/base.c index ad28de96e13f..e71d157d7149 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -91,10 +91,72 @@ int __weak of_node_to_nid(struct device_node *np) } #endif +static struct device_node **phandle_cache; +static u32 phandle_cache_mask; + +/* + * Assumptions behind phandle_cache implementation: + * - phandle property values are in a contiguous range of 1..n + * + * If the assumptions do not hold, then + * - the phandle lookup overhead reduction provided by the cache + * will likely be less + */ +static void of_populate_phandle_cache(void) +{ + unsigned long flags; + u32 cache_entries; + struct device_node *np; + u32 phandles = 0; + + raw_spin_lock_irqsave(_lock, flags); + + kfree(phandle_cache); + phandle_cache = NULL; + + for_each_of_allnodes(np) + if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) + phandles++; + + cache_entries = roundup_pow_of_two(phandles); + phandle_cache_mask = cache_entries - 1; + + phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache), + GFP_ATOMIC); + if (!phandle_cache) + goto out; + + for_each_of_allnodes(np) + if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) +
[PATCH v4 1/2] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()
From: Frank Rowand Create a cache of the nodes that contain a phandle property. Use this cache to find the node for a given phandle value instead of scanning the devicetree to find the node. If the phandle value is not found in the cache, of_find_node_by_phandle() will fall back to the tree scan algorithm. The cache is initialized in of_core_init(). The cache is freed via a late_initcall_sync() if modules are not enabled. If the devicetree is created by the dtc compiler, with all phandle property values auto generated, then the size required by the cache could be 4 * (1 + number of phandles) bytes. This results in an O(1) node lookup cost for a given phandle value. Due to a concern that the phandle property values might not be consistent with what is generated by the dtc compiler, a mask has been added to the cache lookup algorithm. To maintain the O(1) node lookup cost, the size of the cache has been increased by rounding the number of entries up to the next power of two. The overhead of finding the devicetree node containing a given phandle value has been noted by several people in the recent past, in some cases with a patch to add a hashed index of devicetree nodes, based on the phandle value of the node. One concern with this approach is the extra space added to each node. This patch takes advantage of the phandle property values auto generated by the dtc compiler, which begin with one and monotonically increase by one, resulting in a range of 1..n for n phandle values. This implementation should also provide a good reduction of overhead for any range of phandle values that are mostly in a monotonic range. Performance measurements by Chintan Pandya of several implementations of patches that are similar to this one suggest an expected reduction of boot time by ~400ms for his test system. If the cache size was decreased to 64 entries, the boot time was reduced by ~340 ms. The measurements were on a 4.9.73 kernel for arch/arm64/boot/dts/qcom/sda670-mtp.dts, contains 2371 nodes and 814 phandle values. Reported-by: Chintan Pandya Signed-off-by: Frank Rowand --- Changes since v3: - of_populate_phandle_cache(): add check for failed memory allocation Changes since v2: - add mask to calculation of phandle cache entry - which results in better overhead reduction for devicetrees with phandle properties not allocated in the monotonically increasing range of 1..n - due to mask, number of entries in cache potentially increased to next power of two - minor fixes as suggested by reviewers - no longer using live_tree_max_phandle() so do not move it from drivers/of/resolver.c to drivers/of/base.c Changes since v1: - change short description from of: cache phandle nodes to reduce cost of of_find_node_by_phandle() - rebase on v4.16-rc1 - reorder new functions in base.c to avoid forward declaration - add locking around kfree(phandle_cache) for memory ordering - add explicit check for non-null of phandle_cache in of_find_node_by_phandle(). There is already a check for !handle, which prevents accessing a null phandle_cache, but that dependency is not obvious, so this check makes it more apparent. - do not free phandle_cache if modules are enabled, so that cached phandles will be available when modules are loaded drivers/of/base.c | 86 ++--- drivers/of/of_private.h | 3 ++ drivers/of/resolver.c | 3 -- 3 files changed, 85 insertions(+), 7 deletions(-) diff --git a/drivers/of/base.c b/drivers/of/base.c index ad28de96e13f..e71d157d7149 100644 --- a/drivers/of/base.c +++ b/drivers/of/base.c @@ -91,10 +91,72 @@ int __weak of_node_to_nid(struct device_node *np) } #endif +static struct device_node **phandle_cache; +static u32 phandle_cache_mask; + +/* + * Assumptions behind phandle_cache implementation: + * - phandle property values are in a contiguous range of 1..n + * + * If the assumptions do not hold, then + * - the phandle lookup overhead reduction provided by the cache + * will likely be less + */ +static void of_populate_phandle_cache(void) +{ + unsigned long flags; + u32 cache_entries; + struct device_node *np; + u32 phandles = 0; + + raw_spin_lock_irqsave(_lock, flags); + + kfree(phandle_cache); + phandle_cache = NULL; + + for_each_of_allnodes(np) + if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) + phandles++; + + cache_entries = roundup_pow_of_two(phandles); + phandle_cache_mask = cache_entries - 1; + + phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache), + GFP_ATOMIC); + if (!phandle_cache) + goto out; + + for_each_of_allnodes(np) + if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL) + phandle_cache[np->phandle & phandle_cache_mask] = np; + +out: +