Re: [PATCH -next] tools/testing/nvdimm: Make symbol '__nfit_test_ioremap' static
Hi Zou, Zou Wei writes: > The sparse tool complains as follows: > > tools/testing/nvdimm/test/iomap.c:65:14: warning: > symbol '__nfit_test_ioremap' was not declared. Should it be static? > > This symbol is not used outside of security.c, so this s/security.c/iomap.c/ Thanks, Santosh > commit marks it static. > > Reported-by: Hulk Robot > Signed-off-by: Zou Wei > --- > tools/testing/nvdimm/test/iomap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/testing/nvdimm/test/iomap.c > b/tools/testing/nvdimm/test/iomap.c > index c62d372..ed563bd 100644 > --- a/tools/testing/nvdimm/test/iomap.c > +++ b/tools/testing/nvdimm/test/iomap.c > @@ -62,7 +62,7 @@ struct nfit_test_resource *get_nfit_res(resource_size_t > resource) > } > EXPORT_SYMBOL(get_nfit_res); > > -void __iomem *__nfit_test_ioremap(resource_size_t offset, unsigned long size, > +static void __iomem *__nfit_test_ioremap(resource_size_t offset, unsigned > long size, > void __iomem *(*fallback_fn)(resource_size_t, unsigned long)) > { > struct nfit_test_resource *nfit_res = get_nfit_res(offset); > -- > 2.6.2 > ___ > Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org > To unsubscribe send an email to linux-nvdimm-le...@lists.01.org ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re:Re: [PATCH] libnvdimm.h: Remove duplicate struct declaration
>On Mon, Apr 19, 2021 at 07:27:25PM +0800, Wan Jiabing wrote:>> struct device >is declared at 133rd line. >> The declaration here is unnecessary. Remove it. >> >> Signed-off-by: Wan Jiabing >> --- >> include/linux/libnvdimm.h | 1 - >> 1 file changed, 1 deletion(-) >> >> diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h >> index 01f251b6e36c..89b69e645ac7 100644 >> --- a/include/linux/libnvdimm.h >> +++ b/include/linux/libnvdimm.h >> @@ -141,7 +141,6 @@ static inline void __iomem *devm_nvdimm_ioremap(struct >> device *dev, >> >> struct nvdimm_bus; >> struct module; >> -struct device; >> struct nd_blk_region; > >What is the coding style preference for pre-declarations like this? Should >they be placed at the top of the file? > >The patch is reasonable but if the intent is to declare right before use for >clarity, both devm_nvdimm_memremap() and nd_blk_region_desc() use struct >device. So perhaps this duplicate is on purpose? > >Ira OK, my script just catch this duplicate. And I will report the duplicate if there is no MACRO dependence. But I hadn't thought of whether the duplicate is a prompt on purpose. Sorry. Thanks for your reply. Wan Jiabing >> struct nd_blk_region_desc { >> int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev); >> -- >> 2.25.1 >> ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
RE: [PATCH v1 1/1] libnvdimm: Don't use GUID APIs against raw buffer
+Bob and Rafael > -Original Message- > From: Dan Williams > Sent: Friday, April 16, 2021 3:09 PM > To: Andy Shevchenko > Cc: linux-nvdimm ; Linux Kernel Mailing List > ; Verma, Vishal L > ; Jiang, Dave ; Weiny, Ira > ; Kaneda, Erik > Subject: Re: [PATCH v1 1/1] libnvdimm: Don't use GUID APIs against raw > buffer > > On Fri, Apr 16, 2021 at 1:42 PM Andy Shevchenko > wrote: > > > > On Fri, Apr 16, 2021 at 01:08:06PM -0700, Dan Williams wrote: > > > [ add Erik ] > > > > > > On Fri, Apr 16, 2021 at 10:36 AM Andy Shevchenko > > > wrote: > > > > > > > > On Thu, Apr 15, 2021 at 05:37:54PM +0300, Andy Shevchenko wrote: > > > > > Strictly speaking the comparison between guid_t and raw buffer > > > > > is not correct. Return to plain memcmp() since the data structures > > > > > haven't changed to use uuid_t / guid_t the current state of affairs > > > > > is inconsistent. Either it should be changed altogether or left > > > > > as is. > > > > > > > > Dan, please review this one as well. I think here you may agree with me. > > > > > > You know, this is all a problem because ACPICA is using a raw buffer. > > > > And this is fine. It might be any other representation as well. > > > > > Erik, would it be possible to use the guid_t type in ACPICA? That > > > would allow NFIT to drop some ugly casts. > > > > guid_t is internal kernel type. If we ever decide to deviate from the > > current > > representation it wouldn't be possible in case a 3rd party is using it 1:1 > > (via typedef or so). > Hi Dan, > I'm thinking something like ACPICA defining that space as a union with > the correct typing just for Linux. I'm assuming that you mean using something like guid_t type for ACPI data table fields like NFIT rather than objects returned by ACPI namespace object such as _DSD. For ACPI data tables, we could to use macros so that different operating systems can provide their own definitions for a guid. For ACPICA, it will expands to a 16-byte array. Linux can have it's own definition that contains a union or their own guid type (guid_t). As long as the OS-supplied definition is 16bytes, I don't think there would be an issue. Bob, do you have any thoughts on this? Erik ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
RE: Proforma Invoice// GreenChem Purchase Order PO-71336
___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()
As of now put_unlocked_entry() always wakes up next waiter. In next patches we want to wake up all waiters at one callsite. Hence, add a parameter to the function. This patch does not introduce any change of behavior. Suggested-by: Dan Williams Signed-off-by: Vivek Goyal --- fs/dax.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 00978d0838b1..f19d76a6a493 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -275,11 +275,12 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry) finish_wait(wq, &ewait.wait); } -static void put_unlocked_entry(struct xa_state *xas, void *entry) +static void put_unlocked_entry(struct xa_state *xas, void *entry, + enum dax_entry_wake_mode mode) { /* If we were the only waiter woken, wake the next one */ if (entry && !dax_is_conflict(entry)) - dax_wake_entry(xas, entry, WAKE_NEXT); + dax_wake_entry(xas, entry, mode); } /* @@ -633,7 +634,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, entry = get_unlocked_entry(&xas, 0); if (entry) page = dax_busy_page(entry); - put_unlocked_entry(&xas, entry); + put_unlocked_entry(&xas, entry, WAKE_NEXT); if (page) break; if (++scanned % XA_CHECK_SCHED) @@ -675,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, mapping->nrexceptional--; ret = 1; out: - put_unlocked_entry(&xas, entry); + put_unlocked_entry(&xas, entry, WAKE_NEXT); xas_unlock_irq(&xas); return ret; } @@ -954,7 +955,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, return ret; put_unlocked: - put_unlocked_entry(xas, entry); + put_unlocked_entry(xas, entry, WAKE_NEXT); return ret; } @@ -1695,7 +1696,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order) /* Did we race with someone splitting entry or so? */ if (!entry || dax_is_conflict(entry) || (order == 0 && !dax_is_pte_entry(entry))) { - put_unlocked_entry(&xas, entry); + put_unlocked_entry(&xas, entry, WAKE_NEXT); xas_unlock_irq(&xas); trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, VM_FAULT_NOPAGE); -- 2.25.4 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v3 1/3] dax: Add an enum for specifying dax wakup mode
Dan mentioned that he is not very fond of passing around a boolean true/false to specify if only next waiter should be woken up or all waiters should be woken up. He instead prefers that we introduce an enum and make it very explicity at the callsite itself. Easier to read code. This patch should not introduce any change of behavior. Suggested-by: Dan Williams Signed-off-by: Vivek Goyal --- fs/dax.c | 23 +-- 1 file changed, 17 insertions(+), 6 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index b3d27fdc6775..00978d0838b1 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -144,6 +144,16 @@ struct wait_exceptional_entry_queue { struct exceptional_entry_key key; }; +/** + * enum dax_entry_wake_mode: waitqueue wakeup toggle + * @WAKE_NEXT: entry was not mutated + * @WAKE_ALL: entry was invalidated, or resized + */ +enum dax_entry_wake_mode { + WAKE_NEXT, + WAKE_ALL, +}; + static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas, void *entry, struct exceptional_entry_key *key) { @@ -182,7 +192,8 @@ static int wake_exceptional_entry_func(wait_queue_entry_t *wait, * The important information it's conveying is whether the entry at * this index used to be a PMD entry. */ -static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all) +static void dax_wake_entry(struct xa_state *xas, void *entry, + enum dax_entry_wake_mode mode) { struct exceptional_entry_key key; wait_queue_head_t *wq; @@ -196,7 +207,7 @@ static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all) * must be in the waitqueue and the following check will see them. */ if (waitqueue_active(wq)) - __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key); + __wake_up(wq, TASK_NORMAL, mode == WAKE_ALL ? 0 : 1, &key); } /* @@ -268,7 +279,7 @@ static void put_unlocked_entry(struct xa_state *xas, void *entry) { /* If we were the only waiter woken, wake the next one */ if (entry && !dax_is_conflict(entry)) - dax_wake_entry(xas, entry, false); + dax_wake_entry(xas, entry, WAKE_NEXT); } /* @@ -286,7 +297,7 @@ static void dax_unlock_entry(struct xa_state *xas, void *entry) old = xas_store(xas, entry); xas_unlock_irq(xas); BUG_ON(!dax_is_locked(old)); - dax_wake_entry(xas, entry, false); + dax_wake_entry(xas, entry, WAKE_NEXT); } /* @@ -524,7 +535,7 @@ static void *grab_mapping_entry(struct xa_state *xas, dax_disassociate_entry(entry, mapping, false); xas_store(xas, NULL); /* undo the PMD join */ - dax_wake_entry(xas, entry, true); + dax_wake_entry(xas, entry, WAKE_ALL); mapping->nrexceptional--; entry = NULL; xas_set(xas, index); @@ -937,7 +948,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, xas_lock_irq(xas); xas_store(xas, entry); xas_clear_mark(xas, PAGECACHE_TAG_DIRTY); - dax_wake_entry(xas, entry, false); + dax_wake_entry(xas, entry, WAKE_NEXT); trace_dax_writeback_one(mapping->host, index, count); return ret; -- 2.25.4 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v3 3/3] dax: Wake up all waiters after invalidating dax entry
I am seeing missed wakeups which ultimately lead to a deadlock when I am using virtiofs with DAX enabled and running "make -j". I had to mount virtiofs as rootfs and also reduce to dax window size to 256M to reproduce the problem consistently. So here is the problem. put_unlocked_entry() wakes up waiters only if entry is not null as well as !dax_is_conflict(entry). But if I call multiple instances of invalidate_inode_pages2() in parallel, then I can run into a situation where there are waiters on this index but nobody will wait these. invalidate_inode_pages2() invalidate_inode_pages2_range() invalidate_exceptional_entry2() dax_invalidate_mapping_entry_sync() __dax_invalidate_entry() { xas_lock_irq(&xas); entry = get_unlocked_entry(&xas, 0); ... ... dax_disassociate_entry(entry, mapping, trunc); xas_store(&xas, NULL); ... ... put_unlocked_entry(&xas, entry); xas_unlock_irq(&xas); } Say a fault in in progress and it has locked entry at offset say "0x1c". Now say three instances of invalidate_inode_pages2() are in progress (A, B, C) and they all try to invalidate entry at offset "0x1c". Given dax entry is locked, all tree instances A, B, C will wait in wait queue. When dax fault finishes, say A is woken up. It will store NULL entry at index "0x1c" and wake up B. When B comes along it will find "entry=0" at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And this means put_unlocked_entry() will not wake up next waiter, given the current code. And that means C continues to wait and is not woken up. This patch fixes the issue by waking up all waiters when a dax entry has been invalidated. This seems to fix the deadlock I am facing and I can make forward progress. Reported-by: Sergio Lopez Fixes: ac401cc78242 ("dax: New fault locking") Suggested-by: Dan Williams Signed-off-by: Vivek Goyal --- fs/dax.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index f19d76a6a493..cc497519be83 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -676,7 +676,7 @@ static int __dax_invalidate_entry(struct address_space *mapping, mapping->nrexceptional--; ret = 1; out: - put_unlocked_entry(&xas, entry, WAKE_NEXT); + put_unlocked_entry(&xas, entry, WAKE_ALL); xas_unlock_irq(&xas); return ret; } -- 2.25.4 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH v3 0/3] dax: Fix missed wakeup in put_unlocked_entry()
Hi, This is V3 of patches. Posted V2 here. https://lore.kernel.org/linux-fsdevel/20210419184516.gc1472...@redhat.com/ Changes since v2: - Broke down patch in to a patch series (Dan) - Added an enum to communicate wake mode (Dan) Thanks Vivek Vivek Goyal (3): dax: Add an enum for specifying dax wakup mode dax: Add a wakeup mode parameter to put_unlocked_entry() dax: Wake up all waiters after invalidating dax entry fs/dax.c | 34 +++--- 1 file changed, 23 insertions(+), 11 deletions(-) -- 2.25.4 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation
On Mon, Apr 19, 2021 at 04:39:47PM -0400, Vivek Goyal wrote: > On Mon, Apr 19, 2021 at 12:48:58PM -0700, Dan Williams wrote: > > On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal wrote: > > > > > > This is V2 of the patch. Posted V1 here. > > > > > > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/ > > > > > > Based on feedback from Dan and Jan, modified the patch to wake up > > > all waiters when dax entry is invalidated. This solves the issues > > > of missed wakeups. > > > > Care to send a formal patch with this commentary moved below the --- line? > > > > One style fixup below... > > > > > > > > I am seeing missed wakeups which ultimately lead to a deadlock when I am > > > using virtiofs with DAX enabled and running "make -j". I had to mount > > > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce > > > the problem consistently. > > > > > > So here is the problem. put_unlocked_entry() wakes up waiters only > > > if entry is not null as well as !dax_is_conflict(entry). But if I > > > call multiple instances of invalidate_inode_pages2() in parallel, > > > then I can run into a situation where there are waiters on > > > this index but nobody will wait these. > > > > > > invalidate_inode_pages2() > > > invalidate_inode_pages2_range() > > > invalidate_exceptional_entry2() > > > dax_invalidate_mapping_entry_sync() > > > __dax_invalidate_entry() { > > > xas_lock_irq(&xas); > > > entry = get_unlocked_entry(&xas, 0); > > > ... > > > ... > > > dax_disassociate_entry(entry, mapping, trunc); > > > xas_store(&xas, NULL); > > > ... > > > ... > > > put_unlocked_entry(&xas, entry); > > > xas_unlock_irq(&xas); > > > } > > > > > > Say a fault in in progress and it has locked entry at offset say "0x1c". > > > Now say three instances of invalidate_inode_pages2() are in progress > > > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given > > > dax entry is locked, all tree instances A, B, C will wait in wait queue. > > > > > > When dax fault finishes, say A is woken up. It will store NULL entry > > > at index "0x1c" and wake up B. When B comes along it will find "entry=0" > > > at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And > > > this means put_unlocked_entry() will not wake up next waiter, given > > > the current code. And that means C continues to wait and is not woken > > > up. > > > > > > This patch fixes the issue by waking up all waiters when a dax entry > > > has been invalidated. This seems to fix the deadlock I am facing > > > and I can make forward progress. > > > > > > Reported-by: Sergio Lopez > > > Signed-off-by: Vivek Goyal > > > --- > > > fs/dax.c | 12 ++-- > > > 1 file changed, 6 insertions(+), 6 deletions(-) > > > > > > Index: redhat-linux/fs/dax.c > > > === > > > --- redhat-linux.orig/fs/dax.c 2021-04-16 14:16:44.332140543 -0400 > > > +++ redhat-linux/fs/dax.c 2021-04-19 11:24:11.465213474 -0400 > > > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x > > > finish_wait(wq, &ewait.wait); > > > } > > > > > > -static void put_unlocked_entry(struct xa_state *xas, void *entry) > > > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool > > > wake_all) > > > { > > > /* If we were the only waiter woken, wake the next one */ > > > if (entry && !dax_is_conflict(entry)) > > > - dax_wake_entry(xas, entry, false); > > > + dax_wake_entry(xas, entry, wake_all); > > > } > > > > > > /* > > > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range( > > > entry = get_unlocked_entry(&xas, 0); > > > if (entry) > > > page = dax_busy_page(entry); > > > - put_unlocked_entry(&xas, entry); > > > + put_unlocked_entry(&xas, entry, false); > > > > I'm not a fan of raw true/false arguments because if you read this > > line in isolation you need to go read put_unlocked_entry() to recall > > what that argument means. So lets add something like: > > > > /** > > * enum dax_entry_wake_mode: waitqueue wakeup toggle > > * @WAKE_NEXT: entry was not mutated > > * @WAKE_ALL: entry was invalidated, or resized > > */ > > enum dax_entry_wake_mode { > > WAKE_NEXT, > > WAKE_ALL, > > } > > > > ...and use that as the arg for dax_wake_entry(). So I'd expect this to > > be a 3 patch series, introduce dax_entry_wake_mode for > > dax_wake_entry(), introduce the argument for put_unlocked_entry() > > without changing the logic, and finally this bug fix. Feel free to add > > 'Fixes: ac401cc78242 ("dax: New fault locking")' in case you feel this > > needs to be backported. > > Hi Dan, > > I will mak
Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation
On Mon, Apr 19, 2021 at 12:48:58PM -0700, Dan Williams wrote: > On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal wrote: > > > > This is V2 of the patch. Posted V1 here. > > > > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/ > > > > Based on feedback from Dan and Jan, modified the patch to wake up > > all waiters when dax entry is invalidated. This solves the issues > > of missed wakeups. > > Care to send a formal patch with this commentary moved below the --- line? > > One style fixup below... > > > > > I am seeing missed wakeups which ultimately lead to a deadlock when I am > > using virtiofs with DAX enabled and running "make -j". I had to mount > > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce > > the problem consistently. > > > > So here is the problem. put_unlocked_entry() wakes up waiters only > > if entry is not null as well as !dax_is_conflict(entry). But if I > > call multiple instances of invalidate_inode_pages2() in parallel, > > then I can run into a situation where there are waiters on > > this index but nobody will wait these. > > > > invalidate_inode_pages2() > > invalidate_inode_pages2_range() > > invalidate_exceptional_entry2() > > dax_invalidate_mapping_entry_sync() > > __dax_invalidate_entry() { > > xas_lock_irq(&xas); > > entry = get_unlocked_entry(&xas, 0); > > ... > > ... > > dax_disassociate_entry(entry, mapping, trunc); > > xas_store(&xas, NULL); > > ... > > ... > > put_unlocked_entry(&xas, entry); > > xas_unlock_irq(&xas); > > } > > > > Say a fault in in progress and it has locked entry at offset say "0x1c". > > Now say three instances of invalidate_inode_pages2() are in progress > > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given > > dax entry is locked, all tree instances A, B, C will wait in wait queue. > > > > When dax fault finishes, say A is woken up. It will store NULL entry > > at index "0x1c" and wake up B. When B comes along it will find "entry=0" > > at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And > > this means put_unlocked_entry() will not wake up next waiter, given > > the current code. And that means C continues to wait and is not woken > > up. > > > > This patch fixes the issue by waking up all waiters when a dax entry > > has been invalidated. This seems to fix the deadlock I am facing > > and I can make forward progress. > > > > Reported-by: Sergio Lopez > > Signed-off-by: Vivek Goyal > > --- > > fs/dax.c | 12 ++-- > > 1 file changed, 6 insertions(+), 6 deletions(-) > > > > Index: redhat-linux/fs/dax.c > > === > > --- redhat-linux.orig/fs/dax.c 2021-04-16 14:16:44.332140543 -0400 > > +++ redhat-linux/fs/dax.c 2021-04-19 11:24:11.465213474 -0400 > > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x > > finish_wait(wq, &ewait.wait); > > } > > > > -static void put_unlocked_entry(struct xa_state *xas, void *entry) > > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool > > wake_all) > > { > > /* If we were the only waiter woken, wake the next one */ > > if (entry && !dax_is_conflict(entry)) > > - dax_wake_entry(xas, entry, false); > > + dax_wake_entry(xas, entry, wake_all); > > } > > > > /* > > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range( > > entry = get_unlocked_entry(&xas, 0); > > if (entry) > > page = dax_busy_page(entry); > > - put_unlocked_entry(&xas, entry); > > + put_unlocked_entry(&xas, entry, false); > > I'm not a fan of raw true/false arguments because if you read this > line in isolation you need to go read put_unlocked_entry() to recall > what that argument means. So lets add something like: > > /** > * enum dax_entry_wake_mode: waitqueue wakeup toggle > * @WAKE_NEXT: entry was not mutated > * @WAKE_ALL: entry was invalidated, or resized > */ > enum dax_entry_wake_mode { > WAKE_NEXT, > WAKE_ALL, > } > > ...and use that as the arg for dax_wake_entry(). So I'd expect this to > be a 3 patch series, introduce dax_entry_wake_mode for > dax_wake_entry(), introduce the argument for put_unlocked_entry() > without changing the logic, and finally this bug fix. Feel free to add > 'Fixes: ac401cc78242 ("dax: New fault locking")' in case you feel this > needs to be backported. Hi Dan, I will make changes as you suggested and post another version. I am wondering what to do with dax_wake_entry(). It also has a boolean parameter wake_all. Should that be converted as well to make use of enum dax_entry_wake_mode? Thanks Vivek ___ Linux-nvdimm
Re: [PATCH][v2] dax: Fix missed wakeup during dax entry invalidation
On Mon, Apr 19, 2021 at 11:45 AM Vivek Goyal wrote: > > This is V2 of the patch. Posted V1 here. > > https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/ > > Based on feedback from Dan and Jan, modified the patch to wake up > all waiters when dax entry is invalidated. This solves the issues > of missed wakeups. Care to send a formal patch with this commentary moved below the --- line? One style fixup below... > > I am seeing missed wakeups which ultimately lead to a deadlock when I am > using virtiofs with DAX enabled and running "make -j". I had to mount > virtiofs as rootfs and also reduce to dax window size to 256M to reproduce > the problem consistently. > > So here is the problem. put_unlocked_entry() wakes up waiters only > if entry is not null as well as !dax_is_conflict(entry). But if I > call multiple instances of invalidate_inode_pages2() in parallel, > then I can run into a situation where there are waiters on > this index but nobody will wait these. > > invalidate_inode_pages2() > invalidate_inode_pages2_range() > invalidate_exceptional_entry2() > dax_invalidate_mapping_entry_sync() > __dax_invalidate_entry() { > xas_lock_irq(&xas); > entry = get_unlocked_entry(&xas, 0); > ... > ... > dax_disassociate_entry(entry, mapping, trunc); > xas_store(&xas, NULL); > ... > ... > put_unlocked_entry(&xas, entry); > xas_unlock_irq(&xas); > } > > Say a fault in in progress and it has locked entry at offset say "0x1c". > Now say three instances of invalidate_inode_pages2() are in progress > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given > dax entry is locked, all tree instances A, B, C will wait in wait queue. > > When dax fault finishes, say A is woken up. It will store NULL entry > at index "0x1c" and wake up B. When B comes along it will find "entry=0" > at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And > this means put_unlocked_entry() will not wake up next waiter, given > the current code. And that means C continues to wait and is not woken > up. > > This patch fixes the issue by waking up all waiters when a dax entry > has been invalidated. This seems to fix the deadlock I am facing > and I can make forward progress. > > Reported-by: Sergio Lopez > Signed-off-by: Vivek Goyal > --- > fs/dax.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > Index: redhat-linux/fs/dax.c > === > --- redhat-linux.orig/fs/dax.c 2021-04-16 14:16:44.332140543 -0400 > +++ redhat-linux/fs/dax.c 2021-04-19 11:24:11.465213474 -0400 > @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x > finish_wait(wq, &ewait.wait); > } > > -static void put_unlocked_entry(struct xa_state *xas, void *entry) > +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool > wake_all) > { > /* If we were the only waiter woken, wake the next one */ > if (entry && !dax_is_conflict(entry)) > - dax_wake_entry(xas, entry, false); > + dax_wake_entry(xas, entry, wake_all); > } > > /* > @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range( > entry = get_unlocked_entry(&xas, 0); > if (entry) > page = dax_busy_page(entry); > - put_unlocked_entry(&xas, entry); > + put_unlocked_entry(&xas, entry, false); I'm not a fan of raw true/false arguments because if you read this line in isolation you need to go read put_unlocked_entry() to recall what that argument means. So lets add something like: /** * enum dax_entry_wake_mode: waitqueue wakeup toggle * @WAKE_NEXT: entry was not mutated * @WAKE_ALL: entry was invalidated, or resized */ enum dax_entry_wake_mode { WAKE_NEXT, WAKE_ALL, } ...and use that as the arg for dax_wake_entry(). So I'd expect this to be a 3 patch series, introduce dax_entry_wake_mode for dax_wake_entry(), introduce the argument for put_unlocked_entry() without changing the logic, and finally this bug fix. Feel free to add 'Fixes: ac401cc78242 ("dax: New fault locking")' in case you feel this needs to be backported. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH][v2] dax: Fix missed wakeup during dax entry invalidation
This is V2 of the patch. Posted V1 here. https://lore.kernel.org/linux-fsdevel/20210416173524.ga1379...@redhat.com/ Based on feedback from Dan and Jan, modified the patch to wake up all waiters when dax entry is invalidated. This solves the issues of missed wakeups. I am seeing missed wakeups which ultimately lead to a deadlock when I am using virtiofs with DAX enabled and running "make -j". I had to mount virtiofs as rootfs and also reduce to dax window size to 256M to reproduce the problem consistently. So here is the problem. put_unlocked_entry() wakes up waiters only if entry is not null as well as !dax_is_conflict(entry). But if I call multiple instances of invalidate_inode_pages2() in parallel, then I can run into a situation where there are waiters on this index but nobody will wait these. invalidate_inode_pages2() invalidate_inode_pages2_range() invalidate_exceptional_entry2() dax_invalidate_mapping_entry_sync() __dax_invalidate_entry() { xas_lock_irq(&xas); entry = get_unlocked_entry(&xas, 0); ... ... dax_disassociate_entry(entry, mapping, trunc); xas_store(&xas, NULL); ... ... put_unlocked_entry(&xas, entry); xas_unlock_irq(&xas); } Say a fault in in progress and it has locked entry at offset say "0x1c". Now say three instances of invalidate_inode_pages2() are in progress (A, B, C) and they all try to invalidate entry at offset "0x1c". Given dax entry is locked, all tree instances A, B, C will wait in wait queue. When dax fault finishes, say A is woken up. It will store NULL entry at index "0x1c" and wake up B. When B comes along it will find "entry=0" at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And this means put_unlocked_entry() will not wake up next waiter, given the current code. And that means C continues to wait and is not woken up. This patch fixes the issue by waking up all waiters when a dax entry has been invalidated. This seems to fix the deadlock I am facing and I can make forward progress. Reported-by: Sergio Lopez Signed-off-by: Vivek Goyal --- fs/dax.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) Index: redhat-linux/fs/dax.c === --- redhat-linux.orig/fs/dax.c 2021-04-16 14:16:44.332140543 -0400 +++ redhat-linux/fs/dax.c 2021-04-19 11:24:11.465213474 -0400 @@ -264,11 +264,11 @@ static void wait_entry_unlocked(struct x finish_wait(wq, &ewait.wait); } -static void put_unlocked_entry(struct xa_state *xas, void *entry) +static void put_unlocked_entry(struct xa_state *xas, void *entry, bool wake_all) { /* If we were the only waiter woken, wake the next one */ if (entry && !dax_is_conflict(entry)) - dax_wake_entry(xas, entry, false); + dax_wake_entry(xas, entry, wake_all); } /* @@ -622,7 +622,7 @@ struct page *dax_layout_busy_page_range( entry = get_unlocked_entry(&xas, 0); if (entry) page = dax_busy_page(entry); - put_unlocked_entry(&xas, entry); + put_unlocked_entry(&xas, entry, false); if (page) break; if (++scanned % XA_CHECK_SCHED) @@ -664,7 +664,7 @@ static int __dax_invalidate_entry(struct mapping->nrexceptional--; ret = 1; out: - put_unlocked_entry(&xas, entry); + put_unlocked_entry(&xas, entry, true); xas_unlock_irq(&xas); return ret; } @@ -943,7 +943,7 @@ static int dax_writeback_one(struct xa_s return ret; put_unlocked: - put_unlocked_entry(xas, entry); + put_unlocked_entry(xas, entry, false); return ret; } @@ -1684,7 +1684,7 @@ dax_insert_pfn_mkwrite(struct vm_fault * /* Did we race with someone splitting entry or so? */ if (!entry || dax_is_conflict(entry) || (order == 0 && !dax_is_pte_entry(entry))) { - put_unlocked_entry(&xas, entry); + put_unlocked_entry(&xas, entry, false); xas_unlock_irq(&xas); trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf, VM_FAULT_NOPAGE); ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On Mon, Apr 19, 2021 at 01:21:56PM +0100, Matthew Wilcox wrote: > On Mon, Apr 19, 2021 at 02:56:17PM +0300, Mike Rapoport wrote: > > > > With that fixed, you'll have a head page that you can use for testing, > > > which means you don't need to test PageCompound() (because you know the > > > page isn't PageTail), you can just test PageHead(). > > > > I can't say I follow you here. page_is_secretmem() is intended to be a > > generic test, so I don't see how it will get PageHead() if it is called > > from other places. > > static inline bool head_is_secretmem(struct page *head) > { > if (PageHead(head)) > return false; > ... > } > > static inline bool page_is_secretmem(struct page *page) > { > if (PageTail(page)) > return false; > return head_is_secretmem(page); > } > > (yes, calling it head is a misnomer, because it's not necessarily a head, > it might be a base page, but until we have a name for pages which might > be a head page or a base page, it'll have to do ...) To me this looks less clean and readable and in the end we have an extra check for PG_Head if that won't get optimized away. Does not seem to me there would be a measurable difference, but if this is important for future conversion to folio I don't mind using {head,page}_is_secretmem. -- Sincerely yours, Mike. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] libnvdimm.h: Remove duplicate struct declaration
On Mon, Apr 19, 2021 at 07:27:25PM +0800, Wan Jiabing wrote: > struct device is declared at 133rd line. > The declaration here is unnecessary. Remove it. > > Signed-off-by: Wan Jiabing > --- > include/linux/libnvdimm.h | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h > index 01f251b6e36c..89b69e645ac7 100644 > --- a/include/linux/libnvdimm.h > +++ b/include/linux/libnvdimm.h > @@ -141,7 +141,6 @@ static inline void __iomem *devm_nvdimm_ioremap(struct > device *dev, > > struct nvdimm_bus; > struct module; > -struct device; > struct nd_blk_region; What is the coding style preference for pre-declarations like this? Should they be placed at the top of the file? The patch is reasonable but if the intent is to declare right before use for clarity, both devm_nvdimm_memremap() and nd_blk_region_desc() use struct device. So perhaps this duplicate is on purpose? Ira > struct nd_blk_region_desc { > int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev); > -- > 2.25.1 > ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On Mon, Apr 19, 2021 at 02:56:17PM +0300, Mike Rapoport wrote: > On Mon, Apr 19, 2021 at 12:23:02PM +0100, Matthew Wilcox wrote: > > So you're calling page_is_secretmem() on a struct page without having > > a refcount on it. That is definitely not allowed. secretmem seems to > > be full of these kinds of races; I know this isn't the first one I've > > seen in it. I don't think this patchset is ready for this merge window. > > There were races in the older version that did caching of large pages and > those were fixed then, but this is anyway irrelevant because all that code > was dropped in the latest respins. > > I don't think that the fix of the race in gup_pte_range is that significant > to wait 3 more months because of it. I have no particular interest in secretmem, but it seems that every time I come across it while looking at something else, I see these kinds of major mistakes in it. That says to me it's not ready and hasn't seen enough review. > > With that fixed, you'll have a head page that you can use for testing, > > which means you don't need to test PageCompound() (because you know the > > page isn't PageTail), you can just test PageHead(). > > I can't say I follow you here. page_is_secretmem() is intended to be a > generic test, so I don't see how it will get PageHead() if it is called > from other places. static inline bool head_is_secretmem(struct page *head) { if (PageHead(head)) return false; ... } static inline bool page_is_secretmem(struct page *page) { if (PageTail(page)) return false; return head_is_secretmem(page); } (yes, calling it head is a misnomer, because it's not necessarily a head, it might be a base page, but until we have a name for pages which might be a head page or a base page, it'll have to do ...) ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On Mon, Apr 19, 2021 at 12:23:02PM +0100, Matthew Wilcox wrote: > On Mon, Apr 19, 2021 at 11:42:18AM +0300, Mike Rapoport wrote: > > The perf profile of the test indicated that the regression is caused by > > page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): > > Uhh ... you're calling it in the wrong place! > > VM_BUG_ON(!pfn_valid(pte_pfn(pte))); > page = pte_page(pte); > > if (page_is_secretmem(page)) > goto pte_unmap; > > head = try_grab_compound_head(page, 1, flags); > if (!head) > goto pte_unmap; > > So you're calling page_is_secretmem() on a struct page without having > a refcount on it. That is definitely not allowed. secretmem seems to > be full of these kinds of races; I know this isn't the first one I've > seen in it. I don't think this patchset is ready for this merge window. There were races in the older version that did caching of large pages and those were fixed then, but this is anyway irrelevant because all that code was dropped in the latest respins. I don't think that the fix of the race in gup_pte_range is that significant to wait 3 more months because of it. > With that fixed, you'll have a head page that you can use for testing, > which means you don't need to test PageCompound() (because you know the > page isn't PageTail), you can just test PageHead(). I can't say I follow you here. page_is_secretmem() is intended to be a generic test, so I don't see how it will get PageHead() if it is called from other places. -- Sincerely yours, Mike. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On Mon, Apr 19, 2021 at 12:36:19PM +0300, Mike Rapoport wrote: > Well, most if the -4.2% of the performance regression kbuild reported were > due to repeated compount_head(page) in page_mapping(). So the whole point > of this patch is to avoid calling page_mapping(). It's quite ludicrous how many times we call compound_head() in page_mapping() today: page = compound_head(page); if (__builtin_expect(!!(PageSlab(page)), 0)) if (__builtin_expect(!!(PageSwapCache(page)), 0)) { TESTPAGEFLAG(Slab, slab, PF_NO_TAIL) expands to: static __always_inline int PageSlab(struct page *page) { PF_POISONED_CHECK(compound_head(page)); return test_bit(PG_slab, &compound_head(page)); } static __always_inline int PageSwapCache(struct page *page) { page = compound_head(page); return PageSwapBacked(page) && test_bit(PG_swapcache, &page->flags); } but then! TESTPAGEFLAG(SwapBacked, swapbacked, PF_NO_TAIL) also expands like Slab does. So that's six calls to compound_head(), depending what Kconfig options you have enabled. And folio_mapping() is one of the functions I add in the first batch of patches, so review, etc will be helpful. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[ndctl PATCH 2/2] Error injection support for PAPR
Add support for error injection on PAPR family of devices. This is particularly useful in running 'make check' on non-nfit platforms. Signed-off-by: Santosh Sivaraj --- ndctl/lib/libndctl.c | 1 + ndctl/lib/papr.c | 134 ++ ndctl/lib/private.h | 1 + ndctl/libndctl-papr.h | 7 +++ 4 files changed, 143 insertions(+) create mode 100644 ndctl/libndctl-papr.h diff --git a/ndctl/lib/libndctl.c b/ndctl/lib/libndctl.c index 2364578..2c2f485 100644 --- a/ndctl/lib/libndctl.c +++ b/ndctl/lib/libndctl.c @@ -904,6 +904,7 @@ static void *add_bus(void *parent, int id, const char *ctl_base) else { bus->has_of_node = 1; bus_name = "papr"; + bus->ops = papr_bus_ops; } sprintf(path, "%s/device/%s/dsm_mask", ctl_base, bus_name); diff --git a/ndctl/lib/papr.c b/ndctl/lib/papr.c index f94f8aa..6ac3d3e 100644 --- a/ndctl/lib/papr.c +++ b/ndctl/lib/papr.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include "papr.h" @@ -38,6 +39,33 @@ /* return the pdsm command */ #define to_pdsm_cmd(C) ((enum papr_pdsm)to_ndcmd(C)->nd_command) +/** + * ndctl_bus_is_papr_cmd_supported - check if command is supported on @bus. + * @bus: ndctl_bus instance + * @cmd: papr command number (defined as PAPR_PDSM_XXX in papr-pdsm.h) + * + * Return 1: command is supported. Return 0: command is not supported. + * + */ +NDCTL_EXPORT int ndctl_bus_is_papr_cmd_supported(struct ndctl_bus *bus, +int cmd) +{ + return !!(bus->nfit_dsm_mask & (1ULL << cmd)); +} + +static int papr_is_errinj_supported(struct ndctl_bus *bus) +{ + if (!ndctl_bus_is_papr_scm(bus)) + return 0; + + if (ndctl_bus_is_papr_cmd_supported(bus, PAPR_PDSM_INJECT_SET) && + ndctl_bus_is_papr_cmd_supported(bus, PAPR_PDSM_INJECT_CLEAR) && + ndctl_bus_is_papr_cmd_supported(bus, PAPR_PDSM_INJECT_GET)) + return 1; + + return 0; +} + static bool papr_cmd_is_supported(struct ndctl_dimm *dimm, int cmd) { /* Handle this separately to support monitor mode */ @@ -559,3 +587,109 @@ struct ndctl_dimm_ops * const papr_dimm_ops = &(struct ndctl_dimm_ops) { = papr_cmd_smart_threshold_set_ctrl_temperature, .smart_threshold_set_spares = papr_cmd_smart_threshold_set_spares, }; + +static u32 bus_get_firmware_status(struct ndctl_cmd *cmd) +{ + struct nd_cmd_bus *cmd_bus = cmd->cmd_bus; + + switch (cmd_bus->gen.nd_command) { + case PAPR_PDSM_INJECT_SET: + return cmd_bus->err_inj.status; + case PAPR_PDSM_INJECT_CLEAR: + return cmd_bus->err_inj_clr.status; + case PAPR_PDSM_INJECT_GET: + return cmd_bus->err_inj_stat.status; + } + + return -1U; +} + +static struct ndctl_cmd *papr_bus_cmd_new_err_inj(struct ndctl_bus *bus) +{ + size_t size, cmd_length; + struct nd_cmd_pkg *pkg; + struct ndctl_cmd *cmd; + + cmd_length = sizeof(struct nd_cmd_ars_err_inj); + size = sizeof(*cmd) + sizeof(*pkg) + cmd_length; + cmd = calloc(1, size); + if (!cmd) + return NULL; + + cmd->bus = bus; + ndctl_cmd_ref(cmd); + cmd->type = ND_CMD_CALL; + cmd->get_firmware_status = bus_get_firmware_status; + cmd->size = size; + cmd->status = 1; + pkg = (struct nd_cmd_pkg *)&cmd->cmd_buf[0]; + pkg->nd_command = PAPR_PDSM_INJECT_SET; + pkg->nd_size_in = offsetof(struct nd_cmd_ars_err_inj, status); + pkg->nd_size_out = cmd_length - pkg->nd_size_in; + pkg->nd_fw_size = pkg->nd_size_out; + + return cmd; +} + +static struct ndctl_cmd *papr_bus_cmd_new_err_inj_clr(struct ndctl_bus *bus) +{ + size_t size, cmd_length; + struct nd_cmd_pkg *pkg; + struct ndctl_cmd *cmd; + + cmd_length = sizeof(struct nd_cmd_ars_err_inj_clr); + size = sizeof(*cmd) + sizeof(*pkg) + cmd_length; + cmd = calloc(1, size); + if (!cmd) + return NULL; + + cmd->bus = bus; + ndctl_cmd_ref(cmd); + cmd->type = ND_CMD_CALL; + cmd->get_firmware_status = bus_get_firmware_status; + cmd->size = size; + cmd->status = 1; + pkg = (struct nd_cmd_pkg *)&cmd->cmd_buf[0]; + pkg->nd_command = PAPR_PDSM_INJECT_CLEAR; + pkg->nd_size_in = offsetof(struct nd_cmd_ars_err_inj_clr, status); + pkg->nd_size_out = cmd_length - pkg->nd_size_in; + pkg->nd_fw_size = pkg->nd_size_out; + + return cmd; +} + +static struct ndctl_cmd *papr_bus_cmd_new_err_inj_stat(struct ndctl_bus *bus, + u32 buf_size) +{ + size_t size, cmd_length; + struct nd_cmd_pkg *pkg; + struct ndctl_cmd *cmd; + + + cmd_length = sizeof(struct nd_cmd_ars_err_inj_stat); + size = sizeof(*cmd) + sizeof(*pkg) + cmd_length + buf_siz
[ndctl PATCH 1/2] inject-error: Remove assumptions on error injection support
Currently the code assumes that only nfit supports error injection, remove those assumptions before error injection support for PAPR is added. Add bus operations similar to that of DIMM operations with injection operations in it. Signed-off-by: Santosh Sivaraj --- This patch series is on top of Shiva's SMART test patches for PAPR[1]. [1]: https://lkml.kernel.org/r/161711738576.593.320964839920955692.stgit@9add658da52e ndctl/lib/inject.c | 96 ndctl/lib/libndctl.c | 11 +++-- ndctl/lib/nfit.c | 20 + ndctl/lib/private.h | 12 ++ 4 files changed, 83 insertions(+), 56 deletions(-) diff --git a/ndctl/lib/inject.c b/ndctl/lib/inject.c index d61c02c..35fd11e 100644 --- a/ndctl/lib/inject.c +++ b/ndctl/lib/inject.c @@ -6,19 +6,15 @@ #include #include #include -#include #include #include "private.h" NDCTL_EXPORT int ndctl_bus_has_error_injection(struct ndctl_bus *bus) { - /* Currently, only nfit buses have error injection */ - if (!bus || !ndctl_bus_has_nfit(bus)) + if (!bus) return 0; - if (ndctl_bus_is_nfit_cmd_supported(bus, NFIT_CMD_ARS_INJECT_SET) && - ndctl_bus_is_nfit_cmd_supported(bus, NFIT_CMD_ARS_INJECT_GET) && - ndctl_bus_is_nfit_cmd_supported(bus, NFIT_CMD_ARS_INJECT_CLEAR)) + if (bus->ops->err_inj_supported(bus)) return 1; return 0; @@ -151,7 +147,7 @@ static int ndctl_namespace_inject_one_error(struct ndctl_namespace *ndns, length = clear_unit; } - cmd = ndctl_bus_cmd_new_err_inj(bus); + cmd = bus->ops->new_err_inj(bus); if (!cmd) return -ENOMEM; @@ -185,8 +181,6 @@ NDCTL_EXPORT int ndctl_namespace_inject_error2(struct ndctl_namespace *ndns, if (!ndctl_bus_has_error_injection(bus)) return -EOPNOTSUPP; - if (!ndctl_bus_has_nfit(bus)) - return -EOPNOTSUPP; for (i = 0; i < count; i++) { rc = ndctl_namespace_inject_one_error(ndns, block + i, flags); @@ -231,7 +225,7 @@ static int ndctl_namespace_uninject_one_error(struct ndctl_namespace *ndns, length = clear_unit; } - cmd = ndctl_bus_cmd_new_err_inj_clr(bus); + cmd = bus->ops->new_err_inj_clr(bus); if (!cmd) return -ENOMEM; @@ -263,8 +257,6 @@ NDCTL_EXPORT int ndctl_namespace_uninject_error2(struct ndctl_namespace *ndns, if (!ndctl_bus_has_error_injection(bus)) return -EOPNOTSUPP; - if (!ndctl_bus_has_nfit(bus)) - return -EOPNOTSUPP; for (i = 0; i < count; i++) { rc = ndctl_namespace_uninject_one_error(ndns, block + i, @@ -445,51 +437,49 @@ NDCTL_EXPORT int ndctl_namespace_injection_status(struct ndctl_namespace *ndns) if (!ndctl_bus_has_error_injection(bus)) return -EOPNOTSUPP; - if (ndctl_bus_has_nfit(bus)) { - rc = ndctl_namespace_get_injection_bounds(ndns, &ns_offset, - &ns_size); - if (rc) - return rc; + rc = ndctl_namespace_get_injection_bounds(ndns, &ns_offset, + &ns_size); + if (rc) + return rc; - cmd = ndctl_bus_cmd_new_ars_cap(bus, ns_offset, ns_size); - if (!cmd) { - err(ctx, "%s: failed to create cmd\n", - ndctl_namespace_get_devname(ndns)); - return -ENOTTY; - } - rc = ndctl_cmd_submit(cmd); - if (rc < 0) { - dbg(ctx, "Error submitting ars_cap: %d\n", rc); - goto out; - } - buf_size = ndctl_cmd_ars_cap_get_size(cmd); - if (buf_size == 0) { - dbg(ctx, "Got an invalid max_ars_out from ars_cap\n"); - rc = -EINVAL; - goto out; - } - ndctl_cmd_unref(cmd); + cmd = ndctl_bus_cmd_new_ars_cap(bus, ns_offset, ns_size); + if (!cmd) { + err(ctx, "%s: failed to create cmd\n", + ndctl_namespace_get_devname(ndns)); + return -ENOTTY; + } + rc = ndctl_cmd_submit(cmd); + if (rc < 0) { + dbg(ctx, "Error submitting ars_cap: %d\n", rc); + goto out; + } + buf_size = ndctl_cmd_ars_cap_get_size(cmd); + if (buf_size == 0) { + dbg(ctx, "Got an invalid max_ars_out from ars_cap\n"); + rc = -EINVAL; + goto out; + } + ndctl_cmd_unref(cmd); - cmd = ndctl_bus_cmd_new_err_inj_stat(bus, buf_size); - if (!cmd) - return -ENOMEM; + cmd = bus->ops->new_err_inj_stat(bus, buf_size); +
[PATCH] libnvdimm.h: Remove duplicate struct declaration
struct device is declared at 133rd line. The declaration here is unnecessary. Remove it. Signed-off-by: Wan Jiabing --- include/linux/libnvdimm.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index 01f251b6e36c..89b69e645ac7 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -141,7 +141,6 @@ static inline void __iomem *devm_nvdimm_ioremap(struct device *dev, struct nvdimm_bus; struct module; -struct device; struct nd_blk_region; struct nd_blk_region_desc { int (*enable)(struct nvdimm_bus *nvdimm_bus, struct device *dev); -- 2.25.1 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On Mon, Apr 19, 2021 at 11:42:18AM +0300, Mike Rapoport wrote: > The perf profile of the test indicated that the regression is caused by > page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): Uhh ... you're calling it in the wrong place! VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page = pte_page(pte); if (page_is_secretmem(page)) goto pte_unmap; head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; So you're calling page_is_secretmem() on a struct page without having a refcount on it. That is definitely not allowed. secretmem seems to be full of these kinds of races; I know this isn't the first one I've seen in it. I don't think this patchset is ready for this merge window. With that fixed, you'll have a head page that you can use for testing, which means you don't need to test PageCompound() (because you know the page isn't PageTail), you can just test PageHead(). ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 12:14, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:40:56AM +0200, David Hildenbrand wrote: On 19.04.21 11:38, David Hildenbrand wrote: On 19.04.21 11:36, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can hardly estimate which pages are going to be checked. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. Well, most if the -4.2% of the performance regression kbuild reported were due to repeated compount_head(page) in page_mapping(). So the whole point of this patch is to avoid calling page_mapping(). I would have thought the fast path "(PageCompound(page) || !PageLRU(page))" would already avoid calling page_mapping() in many cases. (and I do wonder if a generic page_mapping() optimization would make sense instead) Not sure. Replacing page_mapping() with page_file_mapping() at the call sites at fs/ and mm/ increased the defconfig image by nearly 2k and page_file_mapping() is way simpler than page_mapping() add/remove: 1/0 grow/shrink: 35/0 up/down: 1960/0 (1960) Function old new delta shrink_page_list34143670+256 __set_page_dirty_nobuffers 242 349+107 check_move_unevictable_pages 904 987 +83 move_to_new_page 591 671 +80 shrink_active_list 912 970 +58 move_pages_to_lru911 965 +54 migrate_pages 25002554 +54 shmem_swapin_page 11451197 +52 shmem_undo_range16691719 +50 __test_set_page_writeback620 670 +50 __set_page_dirty_buffers 187 237 +50 __pagevec_lru_add757 807 +50 __munlock_pagevec 11551205 +50 __dump_page 11011151 +50 __cancel_dirty_page 182 232 +50 __remove_mapping 461 510 +49 rmap_walk_file 402 449 +47 isolate_movable_page 240 287 +47 test_clear_page_writeback668 714 +46 page_cache_pipe_buf_try_steal171 217 +46 page_endio
Re: [PATCH] secretmem: optimize page_is_secretmem()
On Mon, Apr 19, 2021 at 11:40:56AM +0200, David Hildenbrand wrote: > On 19.04.21 11:38, David Hildenbrand wrote: > > On 19.04.21 11:36, Mike Rapoport wrote: > > > On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: > > > > On 19.04.21 10:42, Mike Rapoport wrote: > > > > > From: Mike Rapoport > > > > > > > > > > Kernel test robot reported -4.2% regression of > > > > > will-it-scale.per_thread_ops > > > > > due to commit "mm: introduce memfd_secret system call to create > > > > > "secret" > > > > > memory areas". > > > > > > > > > > The perf profile of the test indicated that the regression is caused > > > > > by > > > > > page_is_secretmem() called from gup_pte_range() (inlined by > > > > > gup_pgd_range): > > > > > > > > > > 27.76 +2.5 30.23 > > > > > perf-profile.children.cycles-pp.gup_pgd_range > > > > > 0.00 +3.2 3.19 ± 2% > > > > > perf-profile.children.cycles-pp.page_mapping > > > > > 0.00 +3.7 3.66 ± 2% > > > > > perf-profile.children.cycles-pp.page_is_secretmem > > > > > > > > > > Further analysis showed that the slow down happens because neither > > > > > page_is_secretmem() nor page_mapping() are not inline and moreover, > > > > > multiple page flags checks in page_mapping() involve calling > > > > > compound_head() several times for the same page. > > > > > > > > > > Make page_is_secretmem() inline and replace page_mapping() with page > > > > > flag > > > > > checks that do not imply page-to-head conversion. > > > > > > > > > > Reported-by: kernel test robot > > > > > Signed-off-by: Mike Rapoport > > > > > --- > > > > > > > > > > @Andrew, > > > > > The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if > > > > > it would > > > > > be added as a fixup to the memfd_secret series. > > > > > > > > > > include/linux/secretmem.h | 26 +- > > > > > mm/secretmem.c| 12 +--- > > > > > 2 files changed, 26 insertions(+), 12 deletions(-) > > > > > > > > > > diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h > > > > > index 907a6734059c..b842b38cbeb1 100644 > > > > > --- a/include/linux/secretmem.h > > > > > +++ b/include/linux/secretmem.h > > > > > @@ -4,8 +4,32 @@ > > > > > #ifdef CONFIG_SECRETMEM > > > > > +extern const struct address_space_operations secretmem_aops; > > > > > + > > > > > +static inline bool page_is_secretmem(struct page *page) > > > > > +{ > > > > > + struct address_space *mapping; > > > > > + > > > > > + /* > > > > > + * Using page_mapping() is quite slow because of the actual call > > > > > + * instruction and repeated compound_head(page) inside the > > > > > + * page_mapping() function. > > > > > + * We know that secretmem pages are not compound and LRU so we > > > > > can > > > > > + * save a couple of cycles here. > > > > > + */ > > > > > + if (PageCompound(page) || !PageLRU(page)) > > > > > + return false; > > > > > > > > I'd assume secretmem pages are rare in basically every setup out there. > > > > So > > > > maybe throwing in a couple of likely()/unlikely() might make sense. > > > > > > I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I > > > can > > > hardly estimate which pages are going to be checked. > > > > > + > > > > > + mapping = (struct address_space *) > > > > > + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); > > > > > + > > > > > > > > Not sure if open-coding page_mapping is really a good idea here -- or > > > > even > > > > necessary after the fast path above is in place. Anyhow, just my 2 > > > > cents. > > > > > > Well, most if the -4.2% of the performance regression kbuild reported were > > > due to repeated compount_head(page) in page_mapping(). So the whole point > > > of this patch is to avoid calling page_mapping(). > > > > I would have thought the fast path "(PageCompound(page) || > > !PageLRU(page))" would already avoid calling page_mapping() in many cases. > > (and I do wonder if a generic page_mapping() optimization would make sense > instead) Not sure. Replacing page_mapping() with page_file_mapping() at the call sites at fs/ and mm/ increased the defconfig image by nearly 2k and page_file_mapping() is way simpler than page_mapping() add/remove: 1/0 grow/shrink: 35/0 up/down: 1960/0 (1960) Function old new delta shrink_page_list34143670+256 __set_page_dirty_nobuffers 242 349+107 check_move_unevictable_pages 904 987 +83 move_to_new_page 591 671 +80 shrink_active_list 912 970 +58 move_pages_to_lru911 965 +54 migrate_pages 25002554 +54 shmem_swapin_page 11451197 +52 shmem_undo_range
Re: [PATCH] dax: Fix missed wakeup in put_unlocked_entry()
On Fri 16-04-21 17:24:49, Vivek Goyal wrote: > On Fri, Apr 16, 2021 at 12:56:05PM -0700, Dan Williams wrote: > > On Fri, Apr 16, 2021 at 10:35 AM Vivek Goyal wrote: > > > > > > I am seeing missed wakeups which ultimately lead to a deadlock when I am > > > using virtiofs with DAX enabled and running "make -j". I had to mount > > > virtiofs as rootfs and also reduce to dax window size to 32M to reproduce > > > the problem consistently. > > > > > > This is not a complete patch. I am just proposing this partial fix to > > > highlight the issue and trying to figure out how it should be fixed. > > > Should it be fixed in generic dax code or should filesystem > > > (fuse/virtiofs) > > > take care of this. > > > > > > So here is the problem. put_unlocked_entry() wakes up waiters only > > > if entry is not null as well as !dax_is_conflict(entry). But if I > > > call multiple instances of invalidate_inode_pages2() in parallel, > > > then I can run into a situation where there are waiters on > > > this index but nobody will wait these. > > > > > > invalidate_inode_pages2() > > > invalidate_inode_pages2_range() > > > invalidate_exceptional_entry2() > > > dax_invalidate_mapping_entry_sync() > > > __dax_invalidate_entry() { > > > xas_lock_irq(&xas); > > > entry = get_unlocked_entry(&xas, 0); > > > ... > > > ... > > > dax_disassociate_entry(entry, mapping, trunc); > > > xas_store(&xas, NULL); > > > ... > > > ... > > > put_unlocked_entry(&xas, entry); > > > xas_unlock_irq(&xas); > > > } > > > > > > Say a fault in in progress and it has locked entry at offset say "0x1c". > > > Now say three instances of invalidate_inode_pages2() are in progress > > > (A, B, C) and they all try to invalidate entry at offset "0x1c". Given > > > dax entry is locked, all tree instances A, B, C will wait in wait queue. > > > > > > When dax fault finishes, say A is woken up. It will store NULL entry > > > at index "0x1c" and wake up B. When B comes along it will find "entry=0" > > > at page offset 0x1c and it will call put_unlocked_entry(&xas, 0). And > > > this means put_unlocked_entry() will not wake up next waiter, given > > > the current code. And that means C continues to wait and is not woken > > > up. > > > > > > In my case I am seeing that dax page fault path itself is waiting > > > on grab_mapping_entry() and also invalidate_inode_page2() is > > > waiting in get_unlocked_entry() but entry has already been cleaned > > > up and nobody woke up these processes. Atleast I think that's what > > > is happening. > > > > > > This patch wakes up a process even if entry=0. And deadlock does not > > > happen. I am running into some OOM issues, that will debug. > > > > > > So my question is that is it a dax issue and should it be fixed in > > > dax layer. Or should it be handled in fuse to make sure that > > > multiple instances of invalidate_inode_pages2() on same inode > > > don't make progress in parallel and introduce enough locking > > > around it. > > > > > > Right now fuse_finish_open() calls invalidate_inode_pages2() without > > > any locking. That allows it to make progress in parallel to dax > > > fault path as well as allows multiple instances of > > > invalidate_inode_pages2() > > > to run in parallel. > > > > > > Not-yet-signed-off-by: Vivek Goyal > > > --- > > > fs/dax.c |7 --- > > > 1 file changed, 4 insertions(+), 3 deletions(-) > > > > > > Index: redhat-linux/fs/dax.c > > > === > > > --- redhat-linux.orig/fs/dax.c 2021-04-16 12:50:40.141363317 -0400 > > > +++ redhat-linux/fs/dax.c 2021-04-16 12:51:42.385926390 -0400 > > > @@ -266,9 +266,10 @@ static void wait_entry_unlocked(struct x > > > > > > static void put_unlocked_entry(struct xa_state *xas, void *entry) > > > { > > > - /* If we were the only waiter woken, wake the next one */ > > > - if (entry && !dax_is_conflict(entry)) > > > - dax_wake_entry(xas, entry, false); > > > + if (dax_is_conflict(entry)) > > > + return; > > > + > > > + dax_wake_entry(xas, entry, false); > > > > Hi Dan, > > > How does this work if entry is NULL? dax_entry_waitqueue() will not > > know if it needs to adjust the index. > > Wake waiters both at current index as well PMD adjusted index. It feels > little ugly though. > > > I think the fix might be to > > specify that put_unlocked_entry() in the invalidate path needs to do a > > wake_up_all(). > > Doing a wake_up_all() when we invalidate an entry, sounds good. I will give > it a try. Yeah, that's what I'd suggest as well. After invalidating entry, there's no point to let other waiters sleep. Trying to optimize for thundering herd problems in face of entry invalidation is really fragile as you noticed.
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 11:38, David Hildenbrand wrote: On 19.04.21 11:36, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can hardly estimate which pages are going to be checked. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. Well, most if the -4.2% of the performance regression kbuild reported were due to repeated compount_head(page) in page_mapping(). So the whole point of this patch is to avoid calling page_mapping(). I would have thought the fast path "(PageCompound(page) || !PageLRU(page))" would already avoid calling page_mapping() in many cases. (and I do wonder if a generic page_mapping() optimization would make sense instead) Willy can most probably give the best advise here :) -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 11:36, Mike Rapoport wrote: On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can hardly estimate which pages are going to be checked. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. Well, most if the -4.2% of the performance regression kbuild reported were due to repeated compount_head(page) in page_mapping(). So the whole point of this patch is to avoid calling page_mapping(). I would have thought the fast path "(PageCompound(page) || !PageLRU(page))" would already avoid calling page_mapping() in many cases. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On Mon, Apr 19, 2021 at 11:15:02AM +0200, David Hildenbrand wrote: > On 19.04.21 10:42, Mike Rapoport wrote: > > From: Mike Rapoport > > > > Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops > > due to commit "mm: introduce memfd_secret system call to create "secret" > > memory areas". > > > > The perf profile of the test indicated that the regression is caused by > > page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): > > > > 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range > >0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping > >0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem > > > > Further analysis showed that the slow down happens because neither > > page_is_secretmem() nor page_mapping() are not inline and moreover, > > multiple page flags checks in page_mapping() involve calling > > compound_head() several times for the same page. > > > > Make page_is_secretmem() inline and replace page_mapping() with page flag > > checks that do not imply page-to-head conversion. > > > > Reported-by: kernel test robot > > Signed-off-by: Mike Rapoport > > --- > > > > @Andrew, > > The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would > > be added as a fixup to the memfd_secret series. > > > > include/linux/secretmem.h | 26 +- > > mm/secretmem.c| 12 +--- > > 2 files changed, 26 insertions(+), 12 deletions(-) > > > > diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h > > index 907a6734059c..b842b38cbeb1 100644 > > --- a/include/linux/secretmem.h > > +++ b/include/linux/secretmem.h > > @@ -4,8 +4,32 @@ > > #ifdef CONFIG_SECRETMEM > > +extern const struct address_space_operations secretmem_aops; > > + > > +static inline bool page_is_secretmem(struct page *page) > > +{ > > + struct address_space *mapping; > > + > > + /* > > +* Using page_mapping() is quite slow because of the actual call > > +* instruction and repeated compound_head(page) inside the > > +* page_mapping() function. > > +* We know that secretmem pages are not compound and LRU so we can > > +* save a couple of cycles here. > > +*/ > > + if (PageCompound(page) || !PageLRU(page)) > > + return false; > > I'd assume secretmem pages are rare in basically every setup out there. So > maybe throwing in a couple of likely()/unlikely() might make sense. I'd say we could do unlikely(page_is_secretmem()) at call sites. Here I can hardly estimate which pages are going to be checked. > > + > > + mapping = (struct address_space *) > > + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); > > + > > Not sure if open-coding page_mapping is really a good idea here -- or even > necessary after the fast path above is in place. Anyhow, just my 2 cents. Well, most if the -4.2% of the performance regression kbuild reported were due to repeated compount_head(page) in page_mapping(). So the whole point of this patch is to avoid calling page_mapping(). > The idea of the patch makes sense to me. -- Sincerely yours, Mike. ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
Re: [PATCH] secretmem: optimize page_is_secretmem()
On 19.04.21 10:42, Mike Rapoport wrote: From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; I'd assume secretmem pages are rare in basically every setup out there. So maybe throwing in a couple of likely()/unlikely() might make sense. + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + Not sure if open-coding page_mapping is really a good idea here -- or even necessary after the fast path above is in place. Anyhow, just my 2 cents. The idea of the patch makes sense to me. -- Thanks, David / dhildenb ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org
[PATCH] secretmem: optimize page_is_secretmem()
From: Mike Rapoport Kernel test robot reported -4.2% regression of will-it-scale.per_thread_ops due to commit "mm: introduce memfd_secret system call to create "secret" memory areas". The perf profile of the test indicated that the regression is caused by page_is_secretmem() called from gup_pte_range() (inlined by gup_pgd_range): 27.76 +2.5 30.23 perf-profile.children.cycles-pp.gup_pgd_range 0.00 +3.2 3.19 ± 2% perf-profile.children.cycles-pp.page_mapping 0.00 +3.7 3.66 ± 2% perf-profile.children.cycles-pp.page_is_secretmem Further analysis showed that the slow down happens because neither page_is_secretmem() nor page_mapping() are not inline and moreover, multiple page flags checks in page_mapping() involve calling compound_head() several times for the same page. Make page_is_secretmem() inline and replace page_mapping() with page flag checks that do not imply page-to-head conversion. Reported-by: kernel test robot Signed-off-by: Mike Rapoport --- @Andrew, The patch is vs v5.12-rc7-mmots-2021-04-15-16-28, I'd appreciate if it would be added as a fixup to the memfd_secret series. include/linux/secretmem.h | 26 +- mm/secretmem.c| 12 +--- 2 files changed, 26 insertions(+), 12 deletions(-) diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h index 907a6734059c..b842b38cbeb1 100644 --- a/include/linux/secretmem.h +++ b/include/linux/secretmem.h @@ -4,8 +4,32 @@ #ifdef CONFIG_SECRETMEM +extern const struct address_space_operations secretmem_aops; + +static inline bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping; + + /* +* Using page_mapping() is quite slow because of the actual call +* instruction and repeated compound_head(page) inside the +* page_mapping() function. +* We know that secretmem pages are not compound and LRU so we can +* save a couple of cycles here. +*/ + if (PageCompound(page) || !PageLRU(page)) + return false; + + mapping = (struct address_space *) + ((unsigned long)page->mapping & ~PAGE_MAPPING_FLAGS); + + if (mapping != page->mapping) + return false; + + return page->mapping->a_ops == &secretmem_aops; +} + bool vma_is_secretmem(struct vm_area_struct *vma); -bool page_is_secretmem(struct page *page); bool secretmem_active(void); #else diff --git a/mm/secretmem.c b/mm/secretmem.c index 3b1ba3991964..0bcd15e1b549 100644 --- a/mm/secretmem.c +++ b/mm/secretmem.c @@ -151,22 +151,12 @@ static void secretmem_freepage(struct page *page) clear_highpage(page); } -static const struct address_space_operations secretmem_aops = { +const struct address_space_operations secretmem_aops = { .freepage = secretmem_freepage, .migratepage= secretmem_migratepage, .isolate_page = secretmem_isolate_page, }; -bool page_is_secretmem(struct page *page) -{ - struct address_space *mapping = page_mapping(page); - - if (!mapping) - return false; - - return mapping->a_ops == &secretmem_aops; -} - static struct vfsmount *secretmem_mnt; static struct file *secretmem_file_create(unsigned long flags) -- 2.28.0 ___ Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org To unsubscribe send an email to linux-nvdimm-le...@lists.01.org