from:"Baolin Wang"

[PATCH v3] mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages

2024-02-19 Thread Baolin Wang

Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison
to ensure that enough freepages are isolated in isolate_freepages(), however
it just decreases the cc->nr_freepages without updating cc->nr_migratepages
in compaction_alloc(), which will waste more CPU cycles and cause too many
freepages to be isolated.

So we should also update the cc->nr_migratepages when allocating or freeing
the freepages to avoid isolating excess freepages. And I can see fewer free
pages are scanned and isolated when running thpcompact on my Arm64 server:
   k6.7 k6.7_patched
Ops Compaction pages isolated  120692036.00   118160797.00
Ops Compaction migrate scanned 131210329.00   154093268.00
Ops Compaction free scanned   1090587971.00  1080632536.00
Ops Compact scan efficiency   12.03  14.26

Moreover, I did not see an obvious latency improvements, this is likely because
isolating freepages is not the bottleneck in the thpcompact test case.
  k6.7  k6.7_patched
Amean fault-both-1  1089.76 (   0.00%) 1080.16 *   0.88%*
Amean fault-both-3  1616.48 (   0.00%) 1636.65 *  -1.25%*
Amean fault-both-5  2266.66 (   0.00%) 2219.20 *   2.09%*
Amean fault-both-7  2909.84 (   0.00%) 2801.90 *   3.71%*
Amean fault-both-12 4861.26 (   0.00%) 4733.25 *   2.63%*
Amean fault-both-18 7351.11 (   0.00%) 6950.51 *   5.45%*
Amean fault-both-24 9059.30 (   0.00%) 9159.99 *  -1.11%*
Amean fault-both-3010685.68 (   0.00%)11399.02 *  -6.68%*

Signed-off-by: Baolin Wang 
Acked-by: Mel Gorman 
---
Hi Andrew, please use this patch to replace below 2 old patches. Thanks.
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=cb30899d456d64fb83ee3e3d95cd9fbb18735d87
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/commit/?h=mm-everything-2024-02-20-04-19=65713d1c4fc498d35dc1f7c1234ef815aa128429

Changes from v2:
 - Add acked tag from Mel.
 - Fix the mm_compaction_migratepages trace event.
Changes from v1:
 - Rebased on the latest mm-unstable branch.
---
 include/trace/events/compaction.h |  6 +++---
 mm/compaction.c   | 12 ++--
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/compaction.h 
b/include/trace/events/compaction.h
index 2b2a975efd20..d05759d18538 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -78,10 +78,10 @@ DEFINE_EVENT(mm_compaction_isolate_template, 
mm_compaction_fast_isolate_freepage
 #ifdef CONFIG_COMPACTION
 TRACE_EVENT(mm_compaction_migratepages,
 
-   TP_PROTO(struct compact_control *cc,
+   TP_PROTO(unsigned int nr_migratepages,
unsigned int nr_succeeded),
 
-   TP_ARGS(cc, nr_succeeded),
+   TP_ARGS(nr_migratepages, nr_succeeded),
 
TP_STRUCT__entry(
__field(unsigned long, nr_migrated)
@@ -90,7 +90,7 @@ TRACE_EVENT(mm_compaction_migratepages,
 
TP_fast_assign(
__entry->nr_migrated = nr_succeeded;
-   __entry->nr_failed = cc->nr_migratepages - nr_succeeded;
+   __entry->nr_failed = nr_migratepages - nr_succeeded;
),
 
TP_printk("nr_migrated=%lu nr_failed=%lu",
diff --git a/mm/compaction.c b/mm/compaction.c
index 4494b2914386..218089b29f13 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1798,6 +1798,7 @@ static struct folio *compaction_alloc(struct folio *src, 
unsigned long data)
dst = list_entry(cc->freepages.next, struct folio, lru);
list_del(>lru);
cc->nr_freepages--;
+   cc->nr_migratepages--;
 
return dst;
 }
@@ -1813,6 +1814,7 @@ static void compaction_free(struct folio *dst, unsigned 
long data)
 
list_add(>lru, >freepages);
cc->nr_freepages++;
+   cc->nr_migratepages++;
 }
 
 /* possible outcome of isolate_migratepages */
@@ -2435,7 +2437,7 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)
unsigned long last_migrated_pfn;
const bool sync = cc->mode != MIGRATE_ASYNC;
bool update_cached;
-   unsigned int nr_succeeded = 0;
+   unsigned int nr_succeeded = 0, nr_migratepages;
 
/*
 * These counters track activities during zone compaction.  Initialize
@@ -2553,11 +2555,17 @@ compact_zone(struct compact_control *cc, struct 
capture_control *capc)
pageblock_start_pfn(cc->migrate_pfn - 1));
}
 
+   /*
+* Record the number of pages to migrate since the
+* compaction_alloc/free() will update cc->nr_migratepages
+* properly.
+*/
+   nr_migratepages = cc->nr_migra

Re: [PATCH v2 1/2] fuse: Fix possible deadlock when writing back dirty pages

2021-04-14 Thread Baolin Wang





在 2021/4/14 17:47, Miklos Szeredi 写道:

On Wed, Apr 14, 2021 at 11:22 AM Baolin Wang
 wrote:




在 2021/4/14 17:02, Miklos Szeredi 写道:

On Wed, Apr 14, 2021 at 10:42 AM Baolin Wang
 wrote:


Sorry I missed this patch before, and I've tested this patch, it seems
can solve the deadlock issue I met before.


Great, thanks for testing.


But look at this patch in detail, I think this patch only reduced the
deadlock window, but did not remove the possible deadlock scenario
completely like I explained in the commit log.

Since the fuse_fill_write_pages() can still lock the partitail page in
your patch, and will be wait for the partitail page waritehack is
completed if writeback is set in fuse_send_write_pages().

But at the same time, a writeback worker thread may be waiting for
trying to lock the partitail page to write a bunch of dirty pages by
fuse_writepages().


As you say, fuse_fill_write_pages() will lock a partial page.  This
page cannot become dirty, only after being read completely, which
first requires the page lock.  So dirtying this page can only happen
after the writeback of the fragment was completed.


What I mean is the writeback worker had looked up the dirty pages in
write_cache_pages() and stored them into a temporary pagevec, then try
to lock dirty page one by one and write them.

For example, suppose it looked up 2 dirty pages (named page 1 and page
2), and writed down page 1 by fuse_writepages_fill(), unlocked page 1.
Then try to lock page 2.

At the same time, suppose the fuse_fill_write_pages() will write the
same page 1 and partitail page 2, and it will lock partital page 2 and
wait for the page 1's writeback is completed. But page 1's writeback can
not be completed, since the writeback worker is waiting for locking page
2, which was already locked by fuse_fill_write_pages().


How would page2 become not uptodate, when it was already collected by
write_cache_pages()?  I.e. page2 is a dirty page, hence it must be
uptodate, and fuse_writepages_fill() will not keep it locked.


Read your patch carefully again, now I realized you are right, and your 
patch can solve the deadlock issue I met. Please feel free to add my 
tested-by tag for your patch. Thanks.


Tested-by: Baolin Wang

Re: [PATCH v2 1/2] fuse: Fix possible deadlock when writing back dirty pages

2021-04-14 Thread Baolin Wang





在 2021/4/14 17:02, Miklos Szeredi 写道:

On Wed, Apr 14, 2021 at 10:42 AM Baolin Wang
 wrote:


Sorry I missed this patch before, and I've tested this patch, it seems
can solve the deadlock issue I met before.


Great, thanks for testing.


But look at this patch in detail, I think this patch only reduced the
deadlock window, but did not remove the possible deadlock scenario
completely like I explained in the commit log.

Since the fuse_fill_write_pages() can still lock the partitail page in
your patch, and will be wait for the partitail page waritehack is
completed if writeback is set in fuse_send_write_pages().

But at the same time, a writeback worker thread may be waiting for
trying to lock the partitail page to write a bunch of dirty pages by
fuse_writepages().


As you say, fuse_fill_write_pages() will lock a partial page.  This
page cannot become dirty, only after being read completely, which
first requires the page lock.  So dirtying this page can only happen
after the writeback of the fragment was completed.


What I mean is the writeback worker had looked up the dirty pages in 
write_cache_pages() and stored them into a temporary pagevec, then try 
to lock dirty page one by one and write them.


For example, suppose it looked up 2 dirty pages (named page 1 and page 
2), and writed down page 1 by fuse_writepages_fill(), unlocked page 1. 
Then try to lock page 2.


At the same time, suppose the fuse_fill_write_pages() will write the 
same page 1 and partitail page 2, and it will lock partital page 2 and 
wait for the page 1's writeback is completed. But page 1's writeback can 
not be completed, since the writeback worker is waiting for locking page 
2, which was already locked by fuse_fill_write_pages().


Does that make sense to you? Or I missed something else?

Re: [PATCH v2 1/2] fuse: Fix possible deadlock when writing back dirty pages

2021-04-14 Thread Baolin Wang


Hi,

在 2021/4/13 16:57, Miklos Szeredi 写道:

On Mon, Apr 12, 2021 at 3:23 PM Baolin Wang
 wrote:


Hi Miklos,

在 2021/3/27 14:36, Baolin Wang 写道:

We can meet below deadlock scenario when writing back dirty pages, and
writing files at the same time. The deadlock scenario can be reproduced
by:

- A writeback worker thread A is trying to write a bunch of dirty pages by
fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
add it into rb_tree with setting writeback flag, and unlock this page 1,
then try to lock next page (named page 2).

- But at the same time a file writing can be triggered by another process B,
to write several pages by fuse_perform_write(), the fuse_perform_write()
will lock all required pages firstly, then wait for all writeback pages
are completed by fuse_wait_on_page_writeback().

- Now the process B can already lock page 1 and page 2, and wait for page 1
waritehack is completed (page 1 is under writeback set by process A). But
process A can not complete the writeback of page 1, since it is still
waiting for locking page 2, which was locked by process B already.

A deadlock is occurred.

To fix this issue, we should make sure each page writeback is completed
after lock the page in fuse_fill_write_pages() separately, and then write
them together when all pages are stable.

[1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 
seconds.
[1450578.796179] kworker/u259:6  D0 119885  2 0x0028
[1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
[1450578.796188] Call trace:
[1450578.798804]  __switch_to+0xd8/0x148
[1450578.802458]  __schedule+0x280/0x6a0
[1450578.806112]  schedule+0x34/0xe8
[1450578.809413]  io_schedule+0x20/0x40
[1450578.812977]  __lock_page+0x164/0x278
[1450578.816718]  write_cache_pages+0x2b0/0x4a8
[1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
[1450578.825592]  do_writepages+0x58/0x108
[1450578.829412]  __writeback_single_inode+0x48/0x448
[1450578.834217]  writeback_sb_inodes+0x220/0x520
[1450578.838647]  __writeback_inodes_wb+0x50/0xe8
[1450578.843080]  wb_writeback+0x294/0x3b8
[1450578.846906]  wb_do_writeback+0x2ec/0x388
[1450578.850992]  wb_workfn+0x80/0x1e0
[1450578.854472]  process_one_work+0x1bc/0x3f0
[1450578.858645]  worker_thread+0x164/0x468
[1450578.862559]  kthread+0x108/0x138
[1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
[1450578.888321] doioD0 207752 207740 0x
[1450578.888329] Call trace:
[1450578.890945]  __switch_to+0xd8/0x148
[1450578.894599]  __schedule+0x280/0x6a0
[1450578.898255]  schedule+0x34/0xe8
[1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
[1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
[1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
[1450578.917207]  do_iter_readv_writev+0x110/0x188
[1450578.921724]  do_iter_write+0x90/0x1c8
[1450578.925598]  vfs_writev+0x84/0xf8
[1450578.929071]  do_writev+0x70/0x110
[1450578.932552]  __arm64_sys_writev+0x24/0x30
[1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
[1450578.941694]  el0_svc_handler+0x30/0x80
[1450578.945606]  el0_svc+0x10/0x14

Suggested-by: Peng Tao 
Signed-off-by: Baolin Wang 


Do you have any comments for this patch set? Thanks.


Hi,

I guess this is related:

https://lore.kernel.org/linux-fsdevel/20210209100115.gb1208...@miu.piliscsaba.redhat.com/

Can you verify that the patch at the above link fixes your issue?


Sorry I missed this patch before, and I've tested this patch, it seems 
can solve the deadlock issue I met before.


But look at this patch in detail, I think this patch only reduced the 
deadlock window, but did not remove the possible deadlock scenario 
completely like I explained in the commit log.


Since the fuse_fill_write_pages() can still lock the partitail page in 
your patch, and will be wait for the partitail page waritehack is 
completed if writeback is set in fuse_send_write_pages().


But at the same time, a writeback worker thread may be waiting for 
trying to lock the partitail page to write a bunch of dirty pages by 
fuse_writepages().


Then the deadlock issue can still be occurred. And I think the deadlock 
issue I met is not same with the deadlock issue solved by your patch.

Re: [PATCH v2 1/2] fuse: Fix possible deadlock when writing back dirty pages

2021-04-12 Thread Baolin Wang


Hi Miklos,

在 2021/3/27 14:36, Baolin Wang 写道:

We can meet below deadlock scenario when writing back dirty pages, and
writing files at the same time. The deadlock scenario can be reproduced
by:

- A writeback worker thread A is trying to write a bunch of dirty pages by
fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
add it into rb_tree with setting writeback flag, and unlock this page 1,
then try to lock next page (named page 2).

- But at the same time a file writing can be triggered by another process B,
to write several pages by fuse_perform_write(), the fuse_perform_write()
will lock all required pages firstly, then wait for all writeback pages
are completed by fuse_wait_on_page_writeback().

- Now the process B can already lock page 1 and page 2, and wait for page 1
waritehack is completed (page 1 is under writeback set by process A). But
process A can not complete the writeback of page 1, since it is still
waiting for locking page 2, which was locked by process B already.

A deadlock is occurred.

To fix this issue, we should make sure each page writeback is completed
after lock the page in fuse_fill_write_pages() separately, and then write
them together when all pages are stable.

[1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 
seconds.
[1450578.796179] kworker/u259:6  D0 119885  2 0x0028
[1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
[1450578.796188] Call trace:
[1450578.798804]  __switch_to+0xd8/0x148
[1450578.802458]  __schedule+0x280/0x6a0
[1450578.806112]  schedule+0x34/0xe8
[1450578.809413]  io_schedule+0x20/0x40
[1450578.812977]  __lock_page+0x164/0x278
[1450578.816718]  write_cache_pages+0x2b0/0x4a8
[1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
[1450578.825592]  do_writepages+0x58/0x108
[1450578.829412]  __writeback_single_inode+0x48/0x448
[1450578.834217]  writeback_sb_inodes+0x220/0x520
[1450578.838647]  __writeback_inodes_wb+0x50/0xe8
[1450578.843080]  wb_writeback+0x294/0x3b8
[1450578.846906]  wb_do_writeback+0x2ec/0x388
[1450578.850992]  wb_workfn+0x80/0x1e0
[1450578.854472]  process_one_work+0x1bc/0x3f0
[1450578.858645]  worker_thread+0x164/0x468
[1450578.862559]  kthread+0x108/0x138
[1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
[1450578.888321] doioD0 207752 207740 0x
[1450578.888329] Call trace:
[1450578.890945]  __switch_to+0xd8/0x148
[1450578.894599]  __schedule+0x280/0x6a0
[1450578.898255]  schedule+0x34/0xe8
[1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
[1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
[1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
[1450578.917207]  do_iter_readv_writev+0x110/0x188
[1450578.921724]  do_iter_write+0x90/0x1c8
[1450578.925598]  vfs_writev+0x84/0xf8
[1450578.929071]  do_writev+0x70/0x110
[1450578.932552]  __arm64_sys_writev+0x24/0x30
[1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
[1450578.941694]  el0_svc_handler+0x30/0x80
[1450578.945606]  el0_svc+0x10/0x14

Suggested-by: Peng Tao 
Signed-off-by: Baolin Wang 


Do you have any comments for this patch set? Thanks.


---
Changes from v1:
  - Use fuse_wait_on_page_writeback() instead to wait for page stable.
---
  fs/fuse/file.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb..9a30093 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1101,9 +1101,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args 
*ia,
unsigned int offset, i;
int err;
  
-	for (i = 0; i < ap->num_pages; i++)

-   fuse_wait_on_page_writeback(inode, ap->pages[i]->index);
-
fuse_write_args_fill(ia, ff, pos, count);
ia->write.in.flags = fuse_write_flags(iocb);
if (fm->fc->handle_killpriv_v2 && !capable(CAP_FSETID))
@@ -1140,6 +1137,7 @@ static ssize_t fuse_fill_write_pages(struct 
fuse_args_pages *ap,
 unsigned int max_pages)
  {
struct fuse_conn *fc = get_fuse_conn(mapping->host);
+   struct inode *inode = mapping->host;
unsigned offset = pos & (PAGE_SIZE - 1);
size_t count = 0;
int err;
@@ -1166,6 +1164,8 @@ static ssize_t fuse_fill_write_pages(struct 
fuse_args_pages *ap,
if (!page)
break;
  
+		fuse_wait_on_page_writeback(inode, page->index);

+
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

[PATCH v2 2/2] fuse: Remove unused parameter

2021-03-27 Thread Baolin Wang

Since we move the fuse_wait_on_page_writeback() to fuse_fill_write_pages(),
thus remove the unused 'inode' parameter of fuse_send_write_pages().

Signed-off-by: Baolin Wang 
---
Changes from v1:
 - New patch.
---
 fs/fuse/file.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 9a30093..40e2902 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1091,7 +1091,7 @@ bool fuse_write_update_size(struct inode *inode, loff_t 
pos)
 }
 
 static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
-struct kiocb *iocb, struct inode *inode,
+struct kiocb *iocb,
 loff_t pos, size_t count)
 {
struct fuse_args_pages *ap = >ap;
@@ -1238,8 +1238,7 @@ static ssize_t fuse_perform_write(struct kiocb *iocb,
if (count <= 0) {
err = count;
} else {
-   err = fuse_send_write_pages(, iocb, inode,
-   pos, count);
+   err = fuse_send_write_pages(, iocb, pos, count);
if (!err) {
size_t num_written = ia.write.out.size;
 
-- 
1.8.3.1

[PATCH v2 1/2] fuse: Fix possible deadlock when writing back dirty pages

2021-03-27 Thread Baolin Wang

We can meet below deadlock scenario when writing back dirty pages, and
writing files at the same time. The deadlock scenario can be reproduced
by:

- A writeback worker thread A is trying to write a bunch of dirty pages by
fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
add it into rb_tree with setting writeback flag, and unlock this page 1,
then try to lock next page (named page 2).

- But at the same time a file writing can be triggered by another process B,
to write several pages by fuse_perform_write(), the fuse_perform_write()
will lock all required pages firstly, then wait for all writeback pages
are completed by fuse_wait_on_page_writeback().

- Now the process B can already lock page 1 and page 2, and wait for page 1
waritehack is completed (page 1 is under writeback set by process A). But
process A can not complete the writeback of page 1, since it is still
waiting for locking page 2, which was locked by process B already.

A deadlock is occurred.

To fix this issue, we should make sure each page writeback is completed
after lock the page in fuse_fill_write_pages() separately, and then write
them together when all pages are stable.

[1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 
seconds.
[1450578.796179] kworker/u259:6  D0 119885  2 0x0028
[1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
[1450578.796188] Call trace:
[1450578.798804]  __switch_to+0xd8/0x148
[1450578.802458]  __schedule+0x280/0x6a0
[1450578.806112]  schedule+0x34/0xe8
[1450578.809413]  io_schedule+0x20/0x40
[1450578.812977]  __lock_page+0x164/0x278
[1450578.816718]  write_cache_pages+0x2b0/0x4a8
[1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
[1450578.825592]  do_writepages+0x58/0x108
[1450578.829412]  __writeback_single_inode+0x48/0x448
[1450578.834217]  writeback_sb_inodes+0x220/0x520
[1450578.838647]  __writeback_inodes_wb+0x50/0xe8
[1450578.843080]  wb_writeback+0x294/0x3b8
[1450578.846906]  wb_do_writeback+0x2ec/0x388
[1450578.850992]  wb_workfn+0x80/0x1e0
[1450578.854472]  process_one_work+0x1bc/0x3f0
[1450578.858645]  worker_thread+0x164/0x468
[1450578.862559]  kthread+0x108/0x138
[1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
[1450578.888321] doioD0 207752 207740 0x
[1450578.888329] Call trace:
[1450578.890945]  __switch_to+0xd8/0x148
[1450578.894599]  __schedule+0x280/0x6a0
[1450578.898255]  schedule+0x34/0xe8
[1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
[1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
[1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
[1450578.917207]  do_iter_readv_writev+0x110/0x188
[1450578.921724]  do_iter_write+0x90/0x1c8
[1450578.925598]  vfs_writev+0x84/0xf8
[1450578.929071]  do_writev+0x70/0x110
[1450578.932552]  __arm64_sys_writev+0x24/0x30
[1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
[1450578.941694]  el0_svc_handler+0x30/0x80
[1450578.945606]  el0_svc+0x10/0x14

Suggested-by: Peng Tao 
Signed-off-by: Baolin Wang 
---
Changes from v1:
 - Use fuse_wait_on_page_writeback() instead to wait for page stable.
---
 fs/fuse/file.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb..9a30093 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1101,9 +1101,6 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args 
*ia,
unsigned int offset, i;
int err;
 
-   for (i = 0; i < ap->num_pages; i++)
-   fuse_wait_on_page_writeback(inode, ap->pages[i]->index);
-
fuse_write_args_fill(ia, ff, pos, count);
ia->write.in.flags = fuse_write_flags(iocb);
if (fm->fc->handle_killpriv_v2 && !capable(CAP_FSETID))
@@ -1140,6 +1137,7 @@ static ssize_t fuse_fill_write_pages(struct 
fuse_args_pages *ap,
 unsigned int max_pages)
 {
struct fuse_conn *fc = get_fuse_conn(mapping->host);
+   struct inode *inode = mapping->host;
unsigned offset = pos & (PAGE_SIZE - 1);
size_t count = 0;
int err;
@@ -1166,6 +1164,8 @@ static ssize_t fuse_fill_write_pages(struct 
fuse_args_pages *ap,
if (!page)
break;
 
+   fuse_wait_on_page_writeback(inode, page->index);
+
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
 
-- 
1.8.3.1

Re: [PATCH] fuse: Fix possible deadlock when writing back dirty pages

2021-03-26 Thread Baolin Wang


Hi,


We can meet below deadlock scenario when writing back dirty pages, and
writing files at the same time. The deadlock scenario can be reproduced
by:

- A writeback worker thread A is trying to write a bunch of dirty pages by
fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
add it into rb_tree with setting writeback flag, and unlock this page 1,
then try to lock next page (named page 2).

- But at the same time a file writing can be triggered by another process B,
to write several pages by fuse_perform_write(), the fuse_perform_write()
will lock all required pages firstly, then wait for all writeback pages
are completed by fuse_wait_on_page_writeback().

- Now the process B can already lock page 1 and page 2, and wait for page 1
waritehack is completed (page 1 is under writeback set by process A). But
process A can not complete the writeback of page 1, since it is still
waiting for locking page 2, which was locked by process B already.

A deadlock is occurred.

To fix this issue, we should make sure each page writeback is completed after
lock the page in fuse_fill_write_pages(), and then write them together when
all pages are stable.

[1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 
seconds.
[1450578.796179] kworker/u259:6  D0 119885  2 0x0028
[1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
[1450578.796188] Call trace:
[1450578.798804]  __switch_to+0xd8/0x148
[1450578.802458]  __schedule+0x280/0x6a0
[1450578.806112]  schedule+0x34/0xe8
[1450578.809413]  io_schedule+0x20/0x40
[1450578.812977]  __lock_page+0x164/0x278
[1450578.816718]  write_cache_pages+0x2b0/0x4a8
[1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
[1450578.825592]  do_writepages+0x58/0x108
[1450578.829412]  __writeback_single_inode+0x48/0x448
[1450578.834217]  writeback_sb_inodes+0x220/0x520
[1450578.838647]  __writeback_inodes_wb+0x50/0xe8
[1450578.843080]  wb_writeback+0x294/0x3b8
[1450578.846906]  wb_do_writeback+0x2ec/0x388
[1450578.850992]  wb_workfn+0x80/0x1e0
[1450578.854472]  process_one_work+0x1bc/0x3f0
[1450578.858645]  worker_thread+0x164/0x468
[1450578.862559]  kthread+0x108/0x138
[1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
[1450578.888321] doioD0 207752 207740 0x
[1450578.888329] Call trace:
[1450578.890945]  __switch_to+0xd8/0x148
[1450578.894599]  __schedule+0x280/0x6a0
[1450578.898255]  schedule+0x34/0xe8
[1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
[1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
[1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
[1450578.917207]  do_iter_readv_writev+0x110/0x188
[1450578.921724]  do_iter_write+0x90/0x1c8
[1450578.925598]  vfs_writev+0x84/0xf8
[1450578.929071]  do_writev+0x70/0x110
[1450578.932552]  __arm64_sys_writev+0x24/0x30
[1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
[1450578.941694]  el0_svc_handler+0x30/0x80
[1450578.945606]  el0_svc+0x10/0x14

Suggested-by: Peng Tao 
Signed-off-by: Baolin Wang 
---
  fs/fuse/file.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb..af082b6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1166,6 +1166,8 @@ static ssize_t fuse_fill_write_pages(struct 
fuse_args_pages *ap,
if (!page)
break;
  
+		wait_on_page_writeback(page);


After talked with Tao, I should use fuse_wait_on_page_writeback() 
instead to wait for each page stable in fuse, and also will remove

the fuse_wait_on_page_writeback() in fuse_send_write_pages().

I will send a V2, please ignore this patch. Thanks.

[PATCH] fuse: Fix possible deadlock when writing back dirty pages

2021-03-26 Thread Baolin Wang

We can meet below deadlock scenario when writing back dirty pages, and
writing files at the same time. The deadlock scenario can be reproduced
by:

- A writeback worker thread A is trying to write a bunch of dirty pages by
fuse_writepages(), and the fuse_writepages() will lock one page (named page 1),
add it into rb_tree with setting writeback flag, and unlock this page 1,
then try to lock next page (named page 2).

- But at the same time a file writing can be triggered by another process B,
to write several pages by fuse_perform_write(), the fuse_perform_write()
will lock all required pages firstly, then wait for all writeback pages
are completed by fuse_wait_on_page_writeback().

- Now the process B can already lock page 1 and page 2, and wait for page 1
waritehack is completed (page 1 is under writeback set by process A). But
process A can not complete the writeback of page 1, since it is still
waiting for locking page 2, which was locked by process B already.

A deadlock is occurred.

To fix this issue, we should make sure each page writeback is completed after
lock the page in fuse_fill_write_pages(), and then write them together when
all pages are stable.

[1450578.772896] INFO: task kworker/u259:6:119885 blocked for more than 120 
seconds.
[1450578.796179] kworker/u259:6  D0 119885  2 0x0028
[1450578.796185] Workqueue: writeback wb_workfn (flush-0:78)
[1450578.796188] Call trace:
[1450578.798804]  __switch_to+0xd8/0x148
[1450578.802458]  __schedule+0x280/0x6a0
[1450578.806112]  schedule+0x34/0xe8
[1450578.809413]  io_schedule+0x20/0x40
[1450578.812977]  __lock_page+0x164/0x278
[1450578.816718]  write_cache_pages+0x2b0/0x4a8
[1450578.820986]  fuse_writepages+0x84/0x100 [fuse]
[1450578.825592]  do_writepages+0x58/0x108
[1450578.829412]  __writeback_single_inode+0x48/0x448
[1450578.834217]  writeback_sb_inodes+0x220/0x520
[1450578.838647]  __writeback_inodes_wb+0x50/0xe8
[1450578.843080]  wb_writeback+0x294/0x3b8
[1450578.846906]  wb_do_writeback+0x2ec/0x388
[1450578.850992]  wb_workfn+0x80/0x1e0
[1450578.854472]  process_one_work+0x1bc/0x3f0
[1450578.858645]  worker_thread+0x164/0x468
[1450578.862559]  kthread+0x108/0x138
[1450578.865960] INFO: task doio:207752 blocked for more than 120 seconds.
[1450578.888321] doioD0 207752 207740 0x
[1450578.888329] Call trace:
[1450578.890945]  __switch_to+0xd8/0x148
[1450578.894599]  __schedule+0x280/0x6a0
[1450578.898255]  schedule+0x34/0xe8
[1450578.901568]  fuse_wait_on_page_writeback+0x8c/0xc8 [fuse]
[1450578.907128]  fuse_perform_write+0x240/0x4e0 [fuse]
[1450578.912082]  fuse_file_write_iter+0x1dc/0x290 [fuse]
[1450578.917207]  do_iter_readv_writev+0x110/0x188
[1450578.921724]  do_iter_write+0x90/0x1c8
[1450578.925598]  vfs_writev+0x84/0xf8
[1450578.929071]  do_writev+0x70/0x110
[1450578.932552]  __arm64_sys_writev+0x24/0x30
[1450578.936727]  el0_svc_common.constprop.0+0x80/0x1f8
[1450578.941694]  el0_svc_handler+0x30/0x80
[1450578.945606]  el0_svc+0x10/0x14

Suggested-by: Peng Tao 
Signed-off-by: Baolin Wang 
---
 fs/fuse/file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb..af082b6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1166,6 +1166,8 @@ static ssize_t fuse_fill_write_pages(struct 
fuse_args_pages *ap,
if (!page)
break;
 
+   wait_on_page_writeback(page);
+
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);
 
-- 
1.8.3.1

[PATCH] mm: cma: Use pr_err_ratelimited for CMA warning

2021-03-09 Thread Baolin Wang

If we did not reserve extra CMA memory, the log buffer can be
easily filled up by CMA failure warning when the devices calling
dmam_alloc_coherent() to alloc DMA memory. Thus we can use
pr_err_ratelimited() instead to reduce the duplicate CMA warning.

Signed-off-by: Baolin Wang 
---
 mm/cma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 54eee21..d101bdb 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -500,8 +500,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
}
 
if (ret && !no_warn) {
-   pr_err("%s: %s: alloc failed, req-size: %zu pages, ret: %d\n",
-  __func__, cma->name, count, ret);
+   pr_err_ratelimited("%s: %s: alloc failed, req-size: %zu pages, 
ret: %d\n",
+  __func__, cma->name, count, ret);
cma_debug_show_areas(cma);
}
 
-- 
1.8.3.1

Re: [PATCH v2 3/3] mailbox: sprd: Add supplementary inbox support

2021-03-07 Thread Baolin Wang

On Sat, Feb 13, 2021 at 8:23 PM Orson Zhai  wrote:
>
> From: Orson Zhai 
>
> Some sensors connected to Unisoc mailbox will send data very frequently.
> This makes channel 0 very busy and the messages from other remote cores
> not able to be handled as soon as possible.
>
> It's a trick (un-documented) from Unisoc ASIC designers to resolve this
> special requirement that an inbox assigned to one of the remote cores
> before was modified to be exposed to host cpu core.
>
> Then from host side, a supplementary inbox is added for transferring mass
> but not emergency messages from the remote cores, such as step counting
> sensor, with an independent FIFO and interrupt which is as same as channel
> 0. Meanwihle, inbox part of this channel is still kept for original remote
> core to use.
>
> Signed-off-by: Orson Zhai 

Reviewed-by: Baolin Wang 

> ---
>  drivers/mailbox/sprd-mailbox.c | 88 +++---
>  1 file changed, 71 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/mailbox/sprd-mailbox.c b/drivers/mailbox/sprd-mailbox.c
> index 920de7b9dce1..7abd6c6d655d 100644
> --- a/drivers/mailbox/sprd-mailbox.c
> +++ b/drivers/mailbox/sprd-mailbox.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>
> @@ -50,13 +51,17 @@
>  #define SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ BIT(0)
>  #define SPRD_OUTBOX_FIFO_IRQ_MASK  GENMASK(4, 0)
>
> +#define SPRD_OUTBOX_BASE_SPAN  0x1000
>  #define SPRD_MBOX_CHAN_MAX 8
> +#define SPRD_SUPP_INBOX_ID_SC9863A 7
>
>  struct sprd_mbox_priv {
> struct mbox_controller  mbox;
> struct device   *dev;
> void __iomem*inbox_base;
> void __iomem*outbox_base;
> +   /*  Base register address for supplementary outbox */
> +   void __iomem*supp_base;
> struct clk  *clk;
> u32 outbox_fifo_depth;
>
> @@ -96,14 +101,13 @@ static u32 sprd_mbox_get_fifo_len(struct sprd_mbox_priv 
> *priv, u32 fifo_sts)
> return fifo_len;
>  }
>
> -static irqreturn_t sprd_mbox_outbox_isr(int irq, void *data)
> +static irqreturn_t do_outbox_isr(void __iomem *base, struct sprd_mbox_priv 
> *priv)
>  {
> -   struct sprd_mbox_priv *priv = data;
> struct mbox_chan *chan;
> u32 fifo_sts, fifo_len, msg[2];
> int i, id;
>
> -   fifo_sts = readl(priv->outbox_base + SPRD_MBOX_FIFO_STS);
> +   fifo_sts = readl(base + SPRD_MBOX_FIFO_STS);
>
> fifo_len = sprd_mbox_get_fifo_len(priv, fifo_sts);
> if (!fifo_len) {
> @@ -112,9 +116,9 @@ static irqreturn_t sprd_mbox_outbox_isr(int irq, void 
> *data)
> }
>
> for (i = 0; i < fifo_len; i++) {
> -   msg[0] = readl(priv->outbox_base + SPRD_MBOX_MSG_LOW);
> -   msg[1] = readl(priv->outbox_base + SPRD_MBOX_MSG_HIGH);
> -   id = readl(priv->outbox_base + SPRD_MBOX_ID);
> +   msg[0] = readl(base + SPRD_MBOX_MSG_LOW);
> +   msg[1] = readl(base + SPRD_MBOX_MSG_HIGH);
> +   id = readl(base + SPRD_MBOX_ID);
>
> chan = >chan[id];
> if (chan->cl)
> @@ -124,15 +128,29 @@ static irqreturn_t sprd_mbox_outbox_isr(int irq, void 
> *data)
> "message's been dropped at ch[%d]\n", id);
>
> /* Trigger to update outbox FIFO pointer */
> -   writel(0x1, priv->outbox_base + SPRD_MBOX_TRIGGER);
> +   writel(0x1, base + SPRD_MBOX_TRIGGER);
> }
>
> /* Clear irq status after reading all message. */
> -   writel(SPRD_MBOX_IRQ_CLR, priv->outbox_base + SPRD_MBOX_IRQ_STS);
> +   writel(SPRD_MBOX_IRQ_CLR, base + SPRD_MBOX_IRQ_STS);
>
> return IRQ_HANDLED;
>  }
>
> +static irqreturn_t sprd_mbox_outbox_isr(int irq, void *data)
> +{
> +   struct sprd_mbox_priv *priv = data;
> +
> +   return do_outbox_isr(priv->outbox_base, priv);
> +}
> +
> +static irqreturn_t sprd_mbox_supp_isr(int irq, void *data)
> +{
> +   struct sprd_mbox_priv *priv = data;
> +
> +   return do_outbox_isr(priv->supp_base, priv);
> +}
> +
>  static irqreturn_t sprd_mbox_inbox_isr(int irq, void *data)
>  {
> struct sprd_mbox_priv *priv = data;
> @@ -235,6 +253,14 @@ static int sprd_mbox_startup(struct mbox_chan *chan)
> val = readl(priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> writel(val, priv-

Re: [PATCH v2 1/3] mailbox: sprd: Introduce refcnt when clients requests/free channels

2021-03-07 Thread Baolin Wang

On Sat, Feb 13, 2021 at 8:22 PM Orson Zhai  wrote:
>
> From: Orson Zhai 
>
> Unisoc mailbox has no way to be enabled/disabled for any single channel.
> They can only be set to startup or shutdown as a whole device at same time.
>
> Add a variable to count references to avoid mailbox FIFO being reset
> unexpectedly when clients are requesting or freeing channels.
>
> Also add a lock to dismiss possible conflicts from register r/w in
> different startup or shutdown threads. And fix the crash problem when early
> interrupts come from channel which has not been requested by client yet.
>
> Fixes: ca27fc26cd22 ("mailbox: sprd: Add Spreadtrum mailbox driver")
> Signed-off-by: Orson Zhai 

Sorry for the late reply. LGTM.
Reviewed-by: Baolin Wang 

> ---
>  drivers/mailbox/sprd-mailbox.c | 43 +++---
>  1 file changed, 29 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/mailbox/sprd-mailbox.c b/drivers/mailbox/sprd-mailbox.c
> index f6fab24ae8a9..920de7b9dce1 100644
> --- a/drivers/mailbox/sprd-mailbox.c
> +++ b/drivers/mailbox/sprd-mailbox.c
> @@ -60,6 +60,8 @@ struct sprd_mbox_priv {
> struct clk  *clk;
> u32 outbox_fifo_depth;
>
> +   struct mutexlock;
> +   u32 refcnt;
> struct mbox_chanchan[SPRD_MBOX_CHAN_MAX];
>  };
>
> @@ -115,7 +117,11 @@ static irqreturn_t sprd_mbox_outbox_isr(int irq, void 
> *data)
> id = readl(priv->outbox_base + SPRD_MBOX_ID);
>
> chan = >chan[id];
> -   mbox_chan_received_data(chan, (void *)msg);
> +   if (chan->cl)
> +   mbox_chan_received_data(chan, (void *)msg);
> +   else
> +   dev_warn_ratelimited(priv->dev,
> +   "message's been dropped at ch[%d]\n", id);
>
> /* Trigger to update outbox FIFO pointer */
> writel(0x1, priv->outbox_base + SPRD_MBOX_TRIGGER);
> @@ -215,18 +221,22 @@ static int sprd_mbox_startup(struct mbox_chan *chan)
> struct sprd_mbox_priv *priv = to_sprd_mbox_priv(chan->mbox);
> u32 val;
>
> -   /* Select outbox FIFO mode and reset the outbox FIFO status */
> -   writel(0x0, priv->outbox_base + SPRD_MBOX_FIFO_RST);
> +   mutex_lock(>lock);
> +   if (priv->refcnt++ == 0) {
> +   /* Select outbox FIFO mode and reset the outbox FIFO status */
> +   writel(0x0, priv->outbox_base + SPRD_MBOX_FIFO_RST);
>
> -   /* Enable inbox FIFO overflow and delivery interrupt */
> -   val = readl(priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> -   val &= ~(SPRD_INBOX_FIFO_OVERFLOW_IRQ | SPRD_INBOX_FIFO_DELIVER_IRQ);
> -   writel(val, priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> +   /* Enable inbox FIFO overflow and delivery interrupt */
> +   val = readl(priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> +   val &= ~(SPRD_INBOX_FIFO_OVERFLOW_IRQ | 
> SPRD_INBOX_FIFO_DELIVER_IRQ);
> +   writel(val, priv->inbox_base + SPRD_MBOX_IRQ_MSK);
>
> -   /* Enable outbox FIFO not empty interrupt */
> -   val = readl(priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> -   val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> -   writel(val, priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> +   /* Enable outbox FIFO not empty interrupt */
> +   val = readl(priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> +   val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> +   writel(val, priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> +   }
> +   mutex_unlock(>lock);
>
> return 0;
>  }
> @@ -235,9 +245,13 @@ static void sprd_mbox_shutdown(struct mbox_chan *chan)
>  {
> struct sprd_mbox_priv *priv = to_sprd_mbox_priv(chan->mbox);
>
> -   /* Disable inbox & outbox interrupt */
> -   writel(SPRD_INBOX_FIFO_IRQ_MASK, priv->inbox_base + 
> SPRD_MBOX_IRQ_MSK);
> -   writel(SPRD_OUTBOX_FIFO_IRQ_MASK, priv->outbox_base + 
> SPRD_MBOX_IRQ_MSK);
> +   mutex_lock(>lock);
> +   if (--priv->refcnt == 0) {
> +   /* Disable inbox & outbox interrupt */
> +   writel(SPRD_INBOX_FIFO_IRQ_MASK, priv->inbox_base + 
> SPRD_MBOX_IRQ_MSK);
> +   writel(SPRD_OUTBOX_FIFO_IRQ_MASK, priv->outbox_base + 
> SPRD_MBOX_IRQ_MSK);
> +   }
> +   mutex_unlock(>lock);
>  }
>
>  static const struct mbox_chan_ops sprd_mbox_ops = {
> @@ -266,6 +280,7 @@ static int sprd_mbox_probe(struct platform_device *pdev)
> return -ENOMEM;
>
> priv->dev = dev;
> +   mutex_init(>lock);
>
> /*
>  * The Spreadtrum mailbox uses an inbox to send messages to the target
> --
> 2.17.1
>


-- 
Baolin Wang

Re: [PATCH 2/3] mailbox: sprd: Add supplementary inbox support

2021-02-09 Thread Baolin Wang

On Tue, Feb 9, 2021 at 12:09 PM Orson Zhai  wrote:
>
> On Mon, Feb 08, 2021 at 10:27:47PM +0800, Baolin Wang wrote:
> > On Mon, Feb 8, 2021 at 7:52 PM Orson Zhai  wrote:
> > >
> > > From: Orson Zhai 
> > >
> > > Some sensors connected to Unisoc mailbox will send data very frequently.
> > > This makes channel 0 very busy and the messages from other remote cores
> > > not able to be handled as soon as possible.
> > >
> > > Then a supplementary inbox is added to the host core side for transferring
> > > mass but not emergency messages from the remote cores, such as step
> > > counting sensor, with an independent FIFO and interrupt.
> >
> > So this is another part of the mailbox hardware, containing a batch of
> > hardware channels?
>
> No. Actually it is an inbox assigned to one of the remote cores before but
> being exposed to host cpu core for now.
>
> > I did not see it before, its function is similar
> > with inbox/outbox?
>
> Exactly same with any other channel.
> But only the part of outbox is exposed to host side. Inbox part of this 
> channel
> is still kept for original remote core to use.
>
> It's a trick (un-documented) from our ASIC designers to resolve some special 
> requirements.
> I was also shocked when hearing it :)

Understood :)

>
> I guess other vendors will add another mailbox module to resovle this, but 
> our guys might
> consider the hardware cost...

OK. Thanks for your explanation. It will be helpful if you can add
these backgroud into the comments in case someone else will be
confusing again for the new supplementary inbox :)

> > >
> > > Signed-off-by: Orson Zhai 
> > > ---
> > >  drivers/mailbox/sprd-mailbox.c | 93 
> > > ++
> > >  1 file changed, 75 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/drivers/mailbox/sprd-mailbox.c 
> > > b/drivers/mailbox/sprd-mailbox.c
> > > index e606f52..74648db 100644
> > > --- a/drivers/mailbox/sprd-mailbox.c
> > > +++ b/drivers/mailbox/sprd-mailbox.c
> > > @@ -11,6 +11,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >
> > > @@ -50,13 +51,17 @@
> > >  #define SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ BIT(0)
> > >  #define SPRD_OUTBOX_FIFO_IRQ_MASK  GENMASK(4, 0)
> > >
> > > +#define SPRD_OUTBOX_BASE_SPAN  0x1000
> > >  #define SPRD_MBOX_CHAN_MAX 8
> > > +#define SPRD_SUPP_INBOX_ID_SC9860  6
> > >
> > >  struct sprd_mbox_priv {
> > > struct mbox_controller  mbox;
> > > struct device   *dev;
> > > void __iomem*inbox_base;
> > > void __iomem*outbox_base;
> > > +   /*  Base register address for supplementary outbox */
> > > +   void __iomem*supp_base;
> > > struct clk  *clk;
> > > u32 outbox_fifo_depth;
> > >
> > > @@ -96,14 +101,13 @@ static u32 sprd_mbox_get_fifo_len(struct 
> > > sprd_mbox_priv *priv, u32 fifo_sts)
> > > return fifo_len;
> > >  }
> > >
> > > -static irqreturn_t sprd_mbox_outbox_isr(int irq, void *data)
> > > +static inline irqreturn_t do_outbox_isr(void __iomem *base, struct 
> > > sprd_mbox_priv *priv)
> >
> > No need to add an explicit 'inline' tag, the compiler can do the smart
> > things than us.
>
> I thought it will help to increase perfermance of isr execution before.
>
> Will fix at next.
>
> >
> > >  {
> > > -   struct sprd_mbox_priv *priv = data;
> > > struct mbox_chan *chan;
> > > u32 fifo_sts, fifo_len, msg[2];
> > > int i, id;
> > >
> > > -   fifo_sts = readl(priv->outbox_base + SPRD_MBOX_FIFO_STS);
> > > +   fifo_sts = readl(base + SPRD_MBOX_FIFO_STS);
> > >
> > > fifo_len = sprd_mbox_get_fifo_len(priv, fifo_sts);
> > > if (!fifo_len) {
> > > @@ -112,23 +116,41 @@ static irqreturn_t sprd_mbox_outbox_isr(int irq, 
> > > void *data)
> > > }
> > >
> > > for (i = 0; i < fifo_len; i++) {
> > > -   msg[0] = readl(priv->outbox_base + SPRD_MBOX_MSG_LOW);
> > > -   msg[1] = readl(priv->outbox_base + SPRD_MBOX

Re: [PATCH 1/3] mailbox: sprd: Introduce refcnt when clients requests/free channels

2021-02-09 Thread Baolin Wang

On Tue, Feb 9, 2021 at 11:28 AM Orson Zhai  wrote:
>
> On Mon, Feb 08, 2021 at 10:06:47PM +0800, Baolin Wang wrote:
> > Hi Orson,
> >
> > On Mon, Feb 8, 2021 at 7:52 PM Orson Zhai  wrote:
> > >
> > > From: Orson Zhai 
> > >
> > > Unisoc mailbox has no way to be enabled/disabled for any single channel.
> > > They can only be set to startup or shutdown as a whole device at same 
> > > time.
> > >
> > > Add a variable to count references to avoid mailbox FIFO being reset
> > > unexpectedly when clients are requesting or freeing channels.
> > >
> > > Also add a lock to dismiss possible conflicts from register r/w in
> > > different startup or shutdown threads.
> > >
> > > Fixes: ca27fc26cd22 ("mailbox: sprd: Add Spreadtrum mailbox driver")
> > > Signed-off-by: Orson Zhai 
> > > ---
> > >  drivers/mailbox/sprd-mailbox.c | 38 
> > > +-
> > >  1 file changed, 25 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/drivers/mailbox/sprd-mailbox.c 
> > > b/drivers/mailbox/sprd-mailbox.c
> > > index f6fab24..e606f52 100644
> > > --- a/drivers/mailbox/sprd-mailbox.c
> > > +++ b/drivers/mailbox/sprd-mailbox.c
> > > @@ -60,6 +60,8 @@ struct sprd_mbox_priv {
> > > struct clk  *clk;
> > > u32 outbox_fifo_depth;
> > >
> > > +   struct mutexlock;
> > > +   u32 refcnt;
> > > struct mbox_chanchan[SPRD_MBOX_CHAN_MAX];
> > >  };
> > >
> > > @@ -215,18 +217,22 @@ static int sprd_mbox_startup(struct mbox_chan *chan)
> > > struct sprd_mbox_priv *priv = to_sprd_mbox_priv(chan->mbox);
> > > u32 val;
> > >
> > > -   /* Select outbox FIFO mode and reset the outbox FIFO status */
> > > -   writel(0x0, priv->outbox_base + SPRD_MBOX_FIFO_RST);
> > > +   mutex_lock(>lock);
> > > +   if (priv->refcnt++ == 0) {
> > > +   /* Select outbox FIFO mode and reset the outbox FIFO 
> > > status */
> > > +   writel(0x0, priv->outbox_base + SPRD_MBOX_FIFO_RST);
> > >
> > > -   /* Enable inbox FIFO overflow and delivery interrupt */
> > > -   val = readl(priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> > > -   val &= ~(SPRD_INBOX_FIFO_OVERFLOW_IRQ | 
> > > SPRD_INBOX_FIFO_DELIVER_IRQ);
> > > -   writel(val, priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> > > +   /* Enable inbox FIFO overflow and delivery interrupt */
> > > +   val = readl(priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> > > +   val &= ~(SPRD_INBOX_FIFO_OVERFLOW_IRQ | 
> > > SPRD_INBOX_FIFO_DELIVER_IRQ);
> > > +   writel(val, priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> > >
> > > -   /* Enable outbox FIFO not empty interrupt */
> > > -   val = readl(priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> > > -   val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> > > -   writel(val, priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> > > +   /* Enable outbox FIFO not empty interrupt */
> > > +   val = readl(priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> > > +   val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> > > +   writel(val, priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> > > +   }
> > > +   mutex_unlock(>lock);
> >
> > I think using the atomic_add/sub_and_test() related APIs can remove
> > the mutex lock.
>
> Yes, atomic could make refcnt itself safe. But mutex lock is to make whole 
> processing of
> reading/writing registers safe.
>
> Consider case like this:
>
>   channel #1 channel #2
> -
>startup
>.
>shutdown   startup
>  |-refcnt==0|
>  |  |-retcnt+1
>  |  |-enable mailbox
>  |-disable mailbox
>
> Mailbox will be wrongly disabled after client requests channel #2's startup.

Yeah, you are right. Sorry for noise.

> >
> > >
> > > return 0;
> > >  }
> > > @@ -235,9 +241,13 @@ static void sprd_mbox_shutdown(struct mbox_chan 
> > > *chan)
> > >  {
> > > struct sprd_mbox_priv *priv = to_sprd_mbox_priv(chan->mbox);
> > >
> > > -

[PATCH] mm: backing-dev: Remove duplicated macro definition

2021-02-09 Thread Baolin Wang

Move the K() macro a little forward to remove the same macro definition.

Signed-off-by: Baolin Wang 
---
 mm/backing-dev.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 71a2bf4..576220a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -33,6 +33,8 @@
 /* bdi_wq serves all asynchronous writeback tasks */
 struct workqueue_struct *bdi_wq;
 
+#define K(x) ((x) << (PAGE_SHIFT - 10))
+
 #ifdef CONFIG_DEBUG_FS
 #include 
 #include 
@@ -70,7 +72,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
global_dirty_limits(_thresh, _thresh);
wb_thresh = wb_calc_thresh(wb, dirty_thresh);
 
-#define K(x) ((x) << (PAGE_SHIFT - 10))
seq_printf(m,
   "BdiWriteback:   %10lu kB\n"
   "BdiReclaimable: %10lu kB\n"
@@ -99,7 +100,6 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
   nr_more_io,
   nr_dirty_time,
   !list_empty(>bdi_list), bdi->wb.state);
-#undef K
 
return 0;
 }
@@ -147,8 +147,6 @@ static ssize_t read_ahead_kb_store(struct device *dev,
return count;
 }
 
-#define K(pages) ((pages) << (PAGE_SHIFT - 10))
-
 #define BDI_SHOW(name, expr)   \
 static ssize_t name##_show(struct device *dev, \
   struct device_attribute *attr, char *buf)\
-- 
1.8.3.1

Re: [PATCH 2/3] mailbox: sprd: Add supplementary inbox support

2021-02-08 Thread Baolin Wang

> writel(val, priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> +
> +   /* Enable supplementary outbox as the fundamental one */
> +   if (priv->supp_base) {
> +   writel(0x0, priv->supp_base + SPRD_MBOX_FIFO_RST);
> +   val = readl(priv->supp_base + SPRD_MBOX_IRQ_MSK);
> +   val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> +   writel(val, priv->supp_base + SPRD_MBOX_IRQ_MSK);
> +   }
> }
> mutex_unlock(>lock);
>
> @@ -246,6 +276,10 @@ static void sprd_mbox_shutdown(struct mbox_chan *chan)
> /* Disable inbox & outbox interrupt */
> writel(SPRD_INBOX_FIFO_IRQ_MASK, priv->inbox_base + 
> SPRD_MBOX_IRQ_MSK);
> writel(SPRD_OUTBOX_FIFO_IRQ_MASK, priv->outbox_base + 
> SPRD_MBOX_IRQ_MSK);
> +
> +   if (priv->supp_base)
> +   writel(SPRD_OUTBOX_FIFO_IRQ_MASK,
> +  priv->supp_base + SPRD_MBOX_IRQ_MSK);
> }
> mutex_unlock(>lock);
>  }
> @@ -268,8 +302,8 @@ static int sprd_mbox_probe(struct platform_device *pdev)
>  {
> struct device *dev = >dev;
> struct sprd_mbox_priv *priv;
> -   int ret, inbox_irq, outbox_irq;
> -   unsigned long id;
> +   int ret, inbox_irq, outbox_irq, supp_irq;
> +   unsigned long id, supp;
>
> priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
> if (!priv)
> @@ -280,11 +314,15 @@ static int sprd_mbox_probe(struct platform_device *pdev)
> mutex_init(>lock);
>
> /*
> -* The Spreadtrum mailbox uses an inbox to send messages to the target
> -* core, and uses an outbox to receive messages from other cores.
> +* Unisoc mailbox uses an inbox to send messages to the target
> +* core, and uses (an) outbox(es) to receive messages from other
> +* cores.
> +*
> +* Thus in general the mailbox controller supplies 2 different
> +* register addresses and IRQ numbers for inbox and outbox.
>  *
> -* Thus the mailbox controller supplies 2 different register addresses
> -* and IRQ numbers for inbox and outbox.
> +* If necessary, a supplementary inbox could be enabled optionally
> +* with an independent FIFO and an extra interrupt.
>  */
> priv->inbox_base = devm_platform_ioremap_resource(pdev, 0);
> if (IS_ERR(priv->inbox_base))
> @@ -310,7 +348,7 @@ static int sprd_mbox_probe(struct platform_device *pdev)
> return ret;
> }
>
> -   inbox_irq = platform_get_irq(pdev, 0);
> +   inbox_irq = platform_get_irq_byname(pdev, "inbox");

I think you should put the dt changes before this patch.

> if (inbox_irq < 0)
> return inbox_irq;
>
> @@ -321,7 +359,7 @@ static int sprd_mbox_probe(struct platform_device *pdev)
> return ret;
> }
>
> -   outbox_irq = platform_get_irq(pdev, 1);
> +   outbox_irq = platform_get_irq_byname(pdev, "outbox");
> if (outbox_irq < 0)
> return outbox_irq;
>
> @@ -332,6 +370,24 @@ static int sprd_mbox_probe(struct platform_device *pdev)
> return ret;
> }
>
> +   /* Supplementary outbox IRQ is optional */
> +   supp_irq = platform_get_irq_byname(pdev, "supp-outbox");
> +   if (supp_irq > 0) {
> +   ret = devm_request_irq(dev, supp_irq, sprd_mbox_supp_isr,
> +  IRQF_NO_SUSPEND, dev_name(dev), priv);
> +   if (ret) {
> +   dev_err(dev, "failed to request outbox IRQ: %d\n", 
> ret);
> +   return ret;
> +   }
> +
> +   supp = (unsigned long) of_device_get_match_data(dev);
> +   if (!supp) {
> +   dev_err(dev, "no supplementary outbox specified\n");
> +   return -ENODEV;
> +   }
> +   priv->supp_base = priv->outbox_base + (SPRD_OUTBOX_BASE_SPAN 
> * supp);
> +   }
> +
> /* Get the default outbox FIFO depth */
> priv->outbox_fifo_depth =
> readl(priv->outbox_base + SPRD_MBOX_FIFO_DEPTH) + 1;
> @@ -354,7 +410,8 @@ static int sprd_mbox_probe(struct platform_device *pdev)
>  }
>
>  static const struct of_device_id sprd_mbox_of_match[] = {
> -   { .compatible = "sprd,sc9860-mailbox", },
> +   { .compatible = "sprd,sc9860-mailbox",
> + .data = (void *)SPRD_SUPP_INBOX_ID_SC9860 },
> { },
>  };
>  MODULE_DEVICE_TABLE(of, sprd_mbox_of_match);
> --
> 2.7.4
>


-- 
Baolin Wang

Re: [PATCH 1/3] mailbox: sprd: Introduce refcnt when clients requests/free channels

2021-02-08 Thread Baolin Wang

Hi Orson,

On Mon, Feb 8, 2021 at 7:52 PM Orson Zhai  wrote:
>
> From: Orson Zhai 
>
> Unisoc mailbox has no way to be enabled/disabled for any single channel.
> They can only be set to startup or shutdown as a whole device at same time.
>
> Add a variable to count references to avoid mailbox FIFO being reset
> unexpectedly when clients are requesting or freeing channels.
>
> Also add a lock to dismiss possible conflicts from register r/w in
> different startup or shutdown threads.
>
> Fixes: ca27fc26cd22 ("mailbox: sprd: Add Spreadtrum mailbox driver")
> Signed-off-by: Orson Zhai 
> ---
>  drivers/mailbox/sprd-mailbox.c | 38 +-
>  1 file changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/mailbox/sprd-mailbox.c b/drivers/mailbox/sprd-mailbox.c
> index f6fab24..e606f52 100644
> --- a/drivers/mailbox/sprd-mailbox.c
> +++ b/drivers/mailbox/sprd-mailbox.c
> @@ -60,6 +60,8 @@ struct sprd_mbox_priv {
> struct clk  *clk;
> u32 outbox_fifo_depth;
>
> +   struct mutexlock;
> +   u32 refcnt;
> struct mbox_chanchan[SPRD_MBOX_CHAN_MAX];
>  };
>
> @@ -215,18 +217,22 @@ static int sprd_mbox_startup(struct mbox_chan *chan)
> struct sprd_mbox_priv *priv = to_sprd_mbox_priv(chan->mbox);
> u32 val;
>
> -   /* Select outbox FIFO mode and reset the outbox FIFO status */
> -   writel(0x0, priv->outbox_base + SPRD_MBOX_FIFO_RST);
> +   mutex_lock(>lock);
> +   if (priv->refcnt++ == 0) {
> +   /* Select outbox FIFO mode and reset the outbox FIFO status */
> +   writel(0x0, priv->outbox_base + SPRD_MBOX_FIFO_RST);
>
> -   /* Enable inbox FIFO overflow and delivery interrupt */
> -   val = readl(priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> -   val &= ~(SPRD_INBOX_FIFO_OVERFLOW_IRQ | SPRD_INBOX_FIFO_DELIVER_IRQ);
> -   writel(val, priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> +   /* Enable inbox FIFO overflow and delivery interrupt */
> +   val = readl(priv->inbox_base + SPRD_MBOX_IRQ_MSK);
> +   val &= ~(SPRD_INBOX_FIFO_OVERFLOW_IRQ | 
> SPRD_INBOX_FIFO_DELIVER_IRQ);
> +   writel(val, priv->inbox_base + SPRD_MBOX_IRQ_MSK);
>
> -   /* Enable outbox FIFO not empty interrupt */
> -   val = readl(priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> -   val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> -   writel(val, priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> +   /* Enable outbox FIFO not empty interrupt */
> +   val = readl(priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> +   val &= ~SPRD_OUTBOX_FIFO_NOT_EMPTY_IRQ;
> +   writel(val, priv->outbox_base + SPRD_MBOX_IRQ_MSK);
> +   }
> +   mutex_unlock(>lock);

I think using the atomic_add/sub_and_test() related APIs can remove
the mutex lock.

>
> return 0;
>  }
> @@ -235,9 +241,13 @@ static void sprd_mbox_shutdown(struct mbox_chan *chan)
>  {
> struct sprd_mbox_priv *priv = to_sprd_mbox_priv(chan->mbox);
>
> -   /* Disable inbox & outbox interrupt */
> -   writel(SPRD_INBOX_FIFO_IRQ_MASK, priv->inbox_base + 
> SPRD_MBOX_IRQ_MSK);
> -   writel(SPRD_OUTBOX_FIFO_IRQ_MASK, priv->outbox_base + 
> SPRD_MBOX_IRQ_MSK);
> +   mutex_lock(>lock);
> +   if (--priv->refcnt == 0) {
> +   /* Disable inbox & outbox interrupt */
> +   writel(SPRD_INBOX_FIFO_IRQ_MASK, priv->inbox_base + 
> SPRD_MBOX_IRQ_MSK);
> +   writel(SPRD_OUTBOX_FIFO_IRQ_MASK, priv->outbox_base + 
> SPRD_MBOX_IRQ_MSK);
> +   }
> +   mutex_unlock(>lock);
>  }
>
>  static const struct mbox_chan_ops sprd_mbox_ops = {
> @@ -266,6 +276,8 @@ static int sprd_mbox_probe(struct platform_device *pdev)
> return -ENOMEM;
>
> priv->dev = dev;
> +   priv->refcnt = 0;

No need to do this, the priv structure is already cleared to 0.

> +   mutex_init(>lock);
>
> /*
>  * The Spreadtrum mailbox uses an inbox to send messages to the target
> --
> 2.7.4
>


-- 
Baolin Wang

Re: [PATCH] mailbox: sprd: correct definition of SPRD_OUTBOX_FIFO_FULL

2021-02-04 Thread Baolin Wang

On Thu, Feb 4, 2021 at 4:18 PM Chunyan Zhang  wrote:
>
> From: Magnum Shan 
>
> According to the specification, bit[2] represents SPRD_OUTBOX_FIFO_FULL,
> not bit[0], so correct it.
>
> Fixes: ca27fc26cd22 ("mailbox: sprd: Add Spreadtrum mailbox driver")
> Signed-off-by: Magnum Shan 
> Signed-off-by: Chunyan Zhang 

LGTM.
Reviewed-by: Baolin Wang 

> ---
>  drivers/mailbox/sprd-mailbox.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/mailbox/sprd-mailbox.c b/drivers/mailbox/sprd-mailbox.c
> index f6fab24ae8a9..4c325301a2fe 100644
> --- a/drivers/mailbox/sprd-mailbox.c
> +++ b/drivers/mailbox/sprd-mailbox.c
> @@ -35,7 +35,7 @@
>  #define SPRD_MBOX_IRQ_CLR  BIT(0)
>
>  /* Bit and mask definiation for outbox's SPRD_MBOX_FIFO_STS register */
> -#define SPRD_OUTBOX_FIFO_FULL  BIT(0)
> +#define SPRD_OUTBOX_FIFO_FULL  BIT(2)
>  #define SPRD_OUTBOX_FIFO_WR_SHIFT  16
>  #define SPRD_OUTBOX_FIFO_RD_SHIFT  24
>  #define SPRD_OUTBOX_FIFO_POS_MASK  GENMASK(7, 0)
> --
> 2.25.1
>


-- 
Baolin Wang

[PATCH] blk-cgroup: Remove obsolete macro

2021-01-27 Thread Baolin Wang

Remove the obsolete 'MAX_KEY_LEN' macro.

Signed-off-by: Baolin Wang 
---
 block/blk-cgroup.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4b4fcb5..a317c03 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -32,8 +32,6 @@
 #include 
 #include "blk.h"
 
-#define MAX_KEY_LEN 100
-
 /*
  * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation.
  * blkcg_pol_register_mutex nests outside of it and synchronizes entire
-- 
1.8.3.1

[PATCH v3] blk-cgroup: Use cond_resched() when destroy blkgs

2021-01-27 Thread Baolin Wang

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138

Suggested-by: Tejun Heo 
Signed-off-by: Baolin Wang 
---
Changes from v2:
 - Simplify logics suggested by Jens.

Changes from v1:
 - Add might_sleep() in blkcg_destroy_blkgs().
 - Add an explicitly need_resched() check before releasing lock.
 - Add some comments.
---
 block/blk-cgroup.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 02ce205..4b4fcb5 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1016,6 +1016,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state 
*css)
  */
 void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
+   might_sleep();
+
spin_lock_irq(>lock);
 
while (!hlist_empty(>blkg_list)) {
@@ -1023,14 +1025,20 @@ void blkcg_destroy_blkgs(struct blkcg *blkcg)
struct blkcg_gq, blkcg_node);
struct request_queue *q = blkg->q;
 
-   if (spin_trylock(>queue_lock)) {
-   blkg_destroy(blkg);
-   spin_unlock(>queue_lock);
-   } else {
+   if (need_resched() || !spin_trylock(>queue_lock)) {
+   /*
+* Given that the system can accumulate a huge number
+* of blkgs in pathological cases, check to see if we
+* need to rescheduling to avoid softlockup.
+*/
spin_unlock_irq(>lock);
-   cpu_relax();
+   cond_resched();
spin_lock_irq(>lock);
+   continue;
}
+
+   blkg_destroy(blkg);
+   spin_unlock(>queue_lock);
}
 
spin_unlock_irq(>lock);
-- 
1.8.3.1

Re: [PATCH v2] blk-cgroup: Use cond_resched() when destroy blkgs

2021-01-27 Thread Baolin Wang





在 2021/1/28 11:41, Jens Axboe 写道:

On 1/27/21 8:22 PM, Baolin Wang wrote:

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138


Kind of ugly with the two clauses for dropping the blkcg lock, one
being a cpu_relax() and the other a resched. How about something
like this:


diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 031114d454a6..4221a1539391 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1016,6 +1016,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state 
*css)
   */
  void blkcg_destroy_blkgs(struct blkcg *blkcg)
  {
+   might_sleep();
+
spin_lock_irq(>lock);
  
  	while (!hlist_empty(>blkg_list)) {

@@ -1023,14 +1025,20 @@ void blkcg_destroy_blkgs(struct blkcg *blkcg)
struct blkcg_gq, blkcg_node);
struct request_queue *q = blkg->q;
  
-		if (spin_trylock(>queue_lock)) {

-   blkg_destroy(blkg);
-   spin_unlock(>queue_lock);
-   } else {
+   if (need_resched() || !spin_trylock(>queue_lock)) {
+   /*
+* Given that the system can accumulate a huge number
+* of blkgs in pathological cases, check to see if we
+* need to rescheduling to avoid softlockup.
+*/
spin_unlock_irq(>lock);
-   cpu_relax();
+   cond_resched();
spin_lock_irq(>lock);
+   continue;
}
+
+   blkg_destroy(blkg);
+   spin_unlock(>queue_lock);
}
  
  	spin_unlock_irq(>lock);




Looks better to me. Do I need resend with your suggestion? Thanks.

[PATCH v2] blk-cgroup: Use cond_resched() when destroy blkgs

2021-01-27 Thread Baolin Wang

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), or the system can accumulate a
huge number of blkgs in pathological cases. We can add a need_resched()
check on each loop and release locks and do cond_resched() if true
to avoid this issue, since the blkcg_destroy_blkgs() is not called
from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138

Suggested-by: Tejun Heo 
Signed-off-by: Baolin Wang 
---
Changes from v1:
 - Add might_sleep() in blkcg_destroy_blkgs().
 - Add an explicitly need_resched() check before releasing lock.
 - Add some comments.
---
 block/blk-cgroup.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3465d6e..94eeed7 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1016,6 +1016,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state 
*css)
  */
 void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
+   might_sleep();
+
spin_lock_irq(>lock);
 
while (!hlist_empty(>blkg_list)) {
@@ -1031,6 +1033,17 @@ void blkcg_destroy_blkgs(struct blkcg *blkcg)
cpu_relax();
spin_lock_irq(>lock);
}
+
+   /*
+* Given that the system can accumulate a huge number
+* of blkgs in pathological cases, check to see if we
+* need to rescheduling to avoid softlockup.
+*/
+   if (need_resched()) {
+   spin_unlock_irq(>lock);
+   cond_resched();
+   spin_lock_irq(>lock);
+   }
}
 
spin_unlock_irq(>lock);
-- 
1.8.3.1

Re: [PATCH] blk-cgroup: Use cond_resched() when destroy blkgs

2021-01-27 Thread Baolin Wang


Hi Tejun,


Hello, Baolin.

On Tue, Jan 26, 2021 at 09:33:25PM +0800, Baolin Wang wrote:

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), thus we can use cond_resched()
instead of cpu_relax() to avoid this issue, since the
blkcg_destroy_blkgs() is not called from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138

Signed-off-by: Baolin Wang 


* Can you please add might_sleep() at the top of the function?


Sure.



* Given that the system can accumulate a huge number of blkgs in
   pathological cases, I wonder whether a better way to go about it is
   explicitly testing need_resched() on each loop and release locks and do
   cond_resched() if true?


Yes, sound better to to me and will update in next version. Thanks for 
your sugestion.

[PATCH] blk-cgroup: Use cond_resched() when destroy blkgs

2021-01-26 Thread Baolin Wang

On !PREEMPT kernel, we can get below softlockup when doing stress
testing with creating and destroying block cgroup repeatly. The
reason is it may take a long time to acquire the queue's lock in
the loop of blkcg_destroy_blkgs(), thus we can use cond_resched()
instead of cpu_relax() to avoid this issue, since the
blkcg_destroy_blkgs() is not called from atomic contexts.

[ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
[ 4757.010698] Call trace:
[ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
[ 4757.010701]  cgwb_release_workfn+0x104/0x158
[ 4757.010702]  process_one_work+0x1bc/0x3f0
[ 4757.010704]  worker_thread+0x164/0x468
[ 4757.010705]  kthread+0x108/0x138

Signed-off-by: Baolin Wang 
---
 block/blk-cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 3465d6e..af7c0ce 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1028,7 +1028,7 @@ void blkcg_destroy_blkgs(struct blkcg *blkcg)
spin_unlock(>queue_lock);
} else {
spin_unlock_irq(>lock);
-   cpu_relax();
+   cond_resched();
spin_lock_irq(>lock);
}
}
-- 
1.8.3.1

Re: [PATCH] mm/filemap: Remove redundant variable's assignment

2021-01-26 Thread Baolin Wang





在 2021/1/26 7:22, Andrew Morton 写道:

On Mon, 25 Jan 2021 11:20:02 +0800 Baolin Wang  
wrote:


We've already set the variable 'i' 's initial value before using it,
thus remove redundant previous assignment of variable 'i'.

...

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2472,7 +2472,6 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
if ((iocb->ki_flags & IOCB_WAITQ) && written)
iocb->ki_flags |= IOCB_NOWAIT;
  
-		i = 0;

pg_nr = generic_file_buffered_read_get_pages(iocb, iter,
 pages, nr_pages);
if (pg_nr < 0) {


Matthew's patch series "Refactor generic_file_buffered_read "
(https://lkml.kernel.org/r/20210122160140.223228-1-wi...@infradead.org)
changes all this code - I dont think the patch is relevant after that,
so I'll skip it.


Ah, right. Thanks for letting me know.

[PATCH] mm/filemap: Remove redundant variable's assignment

2021-01-24 Thread Baolin Wang

We've already set the variable 'i' 's initial value before using it,
thus remove redundant previous assignment of variable 'i'.

Signed-off-by: Baolin Wang 
---
 mm/filemap.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e4906f5..07b02f3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2472,7 +2472,6 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
if ((iocb->ki_flags & IOCB_WAITQ) && written)
iocb->ki_flags |= IOCB_NOWAIT;
 
-   i = 0;
pg_nr = generic_file_buffered_read_get_pages(iocb, iter,
 pages, nr_pages);
if (pg_nr < 0) {
-- 
1.8.3.1

Re: [PATCH] i2c: sprd:: depend on COMMON_CLK to fix compile tests

2021-01-19 Thread Baolin Wang

Hi,

On Sun, Jan 17, 2021 at 7:43 PM Krzysztof Kozlowski  wrote:
>
> The I2C_SPRD uses Common Clock Framework thus it cannot be built on
> platforms without it (e.g. compile test on MIPS with LANTIQ):
>
> /usr/bin/mips-linux-gnu-ld: drivers/i2c/busses/i2c-sprd.o: in function 
> `sprd_i2c_probe':
> i2c-sprd.c:(.text.sprd_i2c_probe+0x254): undefined reference to 
> `clk_set_parent'
>
> Fixes: 4a2d5f663dab ("i2c: Enable compile testing for more drivers")
> Reported-by: kernel test robot 
> Signed-off-by: Krzysztof Kozlowski 

LGTM. Thanks.
Reviewed-by: Baolin Wang 

> ---
>  drivers/i2c/busses/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
> index d4d60ad0eda0..ab1f39ac39f4 100644
> --- a/drivers/i2c/busses/Kconfig
> +++ b/drivers/i2c/busses/Kconfig
> @@ -1013,6 +1013,7 @@ config I2C_SIRF
>  config I2C_SPRD
> tristate "Spreadtrum I2C interface"
> depends on I2C=y && (ARCH_SPRD || COMPILE_TEST)
> +   depends on COMMON_CLK
> help
>   If you say yes to this option, support will be included for the
>   Spreadtrum I2C interface.
> --
> 2.25.1
>


-- 
Baolin Wang

Re: [PATCH v2] pinctrl: sprd: Simplify bool comparison

2021-01-13 Thread Baolin Wang

On Wed, Jan 13, 2021 at 11:43 AM Yang Li  wrote:
>
> Fix the following coccicheck warning:
> ./drivers/pinctrl/sprd/pinctrl-sprd.c:690:8-23: WARNING: Comparison to
> bool
>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 

You should keep other guy's reviewed-by or acked-by tag for the
following version if no other big changes. So again
Reviewed-by: Baolin Wang 

> ---
> Changes in v2:
> - make "pinctrl: sprd:" as subject prefix
>
>  drivers/pinctrl/sprd/pinctrl-sprd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pinctrl/sprd/pinctrl-sprd.c 
> b/drivers/pinctrl/sprd/pinctrl-sprd.c
> index 08dc193..dca7a50 100644
> --- a/drivers/pinctrl/sprd/pinctrl-sprd.c
> +++ b/drivers/pinctrl/sprd/pinctrl-sprd.c
> @@ -687,7 +687,7 @@ static int sprd_pinconf_set(struct pinctrl_dev *pctldev, 
> unsigned int pin_id,
> shift = INPUT_SCHMITT_SHIFT;
> break;
> case PIN_CONFIG_BIAS_PULL_UP:
> -   if (is_sleep_config == true) {
> +   if (is_sleep_config) {
> val |= SLEEP_PULL_UP;
> mask = SLEEP_PULL_UP_MASK;
>     shift = SLEEP_PULL_UP_SHIFT;
> --
> 1.8.3.1
>


-- 
Baolin Wang

Re: [PATCH] pinctrl: sprd: style: Simplify bool comparison

2021-01-12 Thread Baolin Wang

On Tue, Jan 12, 2021 at 4:28 PM YANG LI  wrote:
>
> Fix the following coccicheck warning:
> ./drivers/pinctrl/sprd/pinctrl-sprd.c:690:8-23: WARNING: Comparison to
> bool
>
> Reported-by: Abaci Robot 
> Signed-off-by: YANG LI 

The subject line should be "pinctrl: sprd: Simplify xxx", otherwise
Reviewed-by: Baolin Wang 

> ---
>  drivers/pinctrl/sprd/pinctrl-sprd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pinctrl/sprd/pinctrl-sprd.c 
> b/drivers/pinctrl/sprd/pinctrl-sprd.c
> index 08dc193..dca7a50 100644
> --- a/drivers/pinctrl/sprd/pinctrl-sprd.c
> +++ b/drivers/pinctrl/sprd/pinctrl-sprd.c
> @@ -687,7 +687,7 @@ static int sprd_pinconf_set(struct pinctrl_dev *pctldev, 
> unsigned int pin_id,
> shift = INPUT_SCHMITT_SHIFT;
> break;
> case PIN_CONFIG_BIAS_PULL_UP:
> -   if (is_sleep_config == true) {
> +   if (is_sleep_config) {
> val |= SLEEP_PULL_UP;
> mask = SLEEP_PULL_UP_MASK;
>     shift = SLEEP_PULL_UP_SHIFT;
> --
> 1.8.3.1
>


-- 
Baolin Wang

[PATCH] mm/filemap: Remove unused parameter and change to void type for replace_page_cache_page()

2021-01-07 Thread Baolin Wang

Since commit 74d609585d8b ("page cache: Add and replace pages using the XArray")
was merged, the replace_page_cache_page() can not fail and always return
0, we can remove the redundant return value and void it. Moreover remove
the unused gfp_mask.

Signed-off-by: Baolin Wang 
---
 fs/fuse/dev.c   | 6 +-
 include/linux/pagemap.h | 2 +-
 mm/filemap.c| 7 +--
 3 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 588f8d1..c6636b4 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -844,11 +844,7 @@ static int fuse_try_move_page(struct fuse_copy_state *cs, 
struct page **pagep)
if (WARN_ON(PageMlocked(oldpage)))
goto out_fallback_unlock;
 
-   err = replace_page_cache_page(oldpage, newpage, GFP_KERNEL);
-   if (err) {
-   unlock_page(newpage);
-   goto out_put_old;
-   }
+   replace_page_cache_page(oldpage, newpage);
 
get_page(newpage);
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d5570de..74e466e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -757,7 +757,7 @@ int add_to_page_cache_lru(struct page *page, struct 
address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
 extern void delete_from_page_cache(struct page *page);
 extern void __delete_from_page_cache(struct page *page, void *shadow);
-int replace_page_cache_page(struct page *old, struct page *new, gfp_t 
gfp_mask);
+void replace_page_cache_page(struct page *old, struct page *new);
 void delete_from_page_cache_batch(struct address_space *mapping,
  struct pagevec *pvec);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 5c9d564..e4906f5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -775,7 +775,6 @@ int file_write_and_wait_range(struct file *file, loff_t 
lstart, loff_t lend)
  * replace_page_cache_page - replace a pagecache page with a new one
  * @old:   page to be replaced
  * @new:   page to replace with
- * @gfp_mask:  allocation mode
  *
  * This function replaces a page in the pagecache with a new one.  On
  * success it acquires the pagecache reference for the new page and
@@ -784,10 +783,8 @@ int file_write_and_wait_range(struct file *file, loff_t 
lstart, loff_t lend)
  * caller must do that.
  *
  * The remove + add is atomic.  This function cannot fail.
- *
- * Return: %0
  */
-int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
+void replace_page_cache_page(struct page *old, struct page *new)
 {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *) = mapping->a_ops->freepage;
@@ -822,8 +819,6 @@ int replace_page_cache_page(struct page *old, struct page 
*new, gfp_t gfp_mask)
if (freepage)
freepage(old);
put_page(old);
-
-   return 0;
 }
 EXPORT_SYMBOL_GPL(replace_page_cache_page);
 
-- 
1.8.3.1

Re: [PATCH 1/2] blk-iocost: Add iocg idle state tracepoint

2020-12-16 Thread Baolin Wang


Hi Jens,


It will be helpful to trace the iocg's whole state, including active and
idle state. And we can easily expand the original iocost_iocg_activate
trace event to support a state trace class, including active and idle
state tracing.

Signed-off-by: Baolin Wang 


Could you pick up patch 1 which was already acked by Tejun. Thanks.


---
  block/blk-iocost.c|  3 +++
  include/trace/events/iocost.h | 16 +++-
  2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index ffa418c..ac6078a 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2185,6 +2185,9 @@ static int ioc_check_iocgs(struct ioc *ioc, struct 
ioc_now *now)
WEIGHT_ONE);
}
  
+			TRACE_IOCG_PATH(iocg_idle, iocg, now,

+   atomic64_read(>active_period),
+   atomic64_read(>cur_period), vtime);
__propagate_weights(iocg, 0, 0, false, now);
list_del_init(>active_list);
}
diff --git a/include/trace/events/iocost.h b/include/trace/events/iocost.h
index 0b68699..e282ce0 100644
--- a/include/trace/events/iocost.h
+++ b/include/trace/events/iocost.h
@@ -11,7 +11,7 @@
  
  #include 
  
-TRACE_EVENT(iocost_iocg_activate,

+DECLARE_EVENT_CLASS(iocost_iocg_state,
  
  	TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,

u64 last_period, u64 cur_period, u64 vtime),
@@ -59,6 +59,20 @@
)
  );
  
+DEFINE_EVENT(iocost_iocg_state, iocost_iocg_activate,

+   TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
+u64 last_period, u64 cur_period, u64 vtime),
+
+   TP_ARGS(iocg, path, now, last_period, cur_period, vtime)
+);
+
+DEFINE_EVENT(iocost_iocg_state, iocost_iocg_idle,
+   TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
+u64 last_period, u64 cur_period, u64 vtime),
+
+   TP_ARGS(iocg, path, now, last_period, cur_period, vtime)
+);
+
  DECLARE_EVENT_CLASS(iocg_inuse_update,
  
  	TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,

Re: [PATCH 2/2] blk-iocost: Use alloc_percpu_gfp() to simplify the code

2020-12-16 Thread Baolin Wang


Hi Tejun,


Hello,

On Fri, Dec 11, 2020 at 03:13:29PM +0800, Baolin Wang wrote:

Thanks for teaching me this, at least I did not get this from the local_ops
Documentation before. Just out of curiosity, these local[64]_t variables are
also allocated from budy allocator ultimately, why they can not be
initialized to zeros on some ARCHs with __GFP_ZERO? Could you elaborate on
about this restriction? Thanks.


* It's highly unlikely but theoretically possible that some arch might need
   more than raw value to implement local semantics.

* People might wanna add debug annotations which may require more than just
   the raw value.


Thanks for your explanation. It will be helpful to add these comments 
for the code in case someone else wants to do the same thing like this 
patch in future. I can send a patch to add these comments if you have no 
objection.



By the way, seems the kyber-iosched has the same issue, since the 'struct
kyber_cpu_latency' also contains an atomic_t variable.

kqd->cpu_latency = alloc_percpu_gfp(struct kyber_cpu_latency,
GFP_KERNEL | __GFP_ZERO);
if (!kqd->cpu_latency)
goto err_kqd;


Yeah, the lines become blurry when all existing usages are fine with zeroing
and we do end up needing to clean up those when the need arises (e.g. we
used to zero some spinlocks too). It's just a better form to stick with
initializers when they are provided.


OK. Sounds it is worth sending a patch to initialize this structure 
explicitly to keep consistent.

Re: [PATCH 2/2] blk-iocost: Use alloc_percpu_gfp() to simplify the code

2020-12-10 Thread Baolin Wang


Hi Tejun,


Hello,

On Thu, Dec 10, 2020 at 06:56:45PM +0800, Baolin Wang wrote:

Use alloc_percpu_gfp() with __GFP_ZERO flag, which can remove
some explicit initialization code.


__GFP_ZERO is implicit for percpu allocations and local[64]_t's initial
states aren't guaranteed to be all zeros on different archs.


Thanks for teaching me this, at least I did not get this from the 
local_ops Documentation before. Just out of curiosity, these local[64]_t 
variables are also allocated from budy allocator ultimately, why they 
can not be initialized to zeros on some ARCHs with __GFP_ZERO? Could you 
elaborate on about this restriction? Thanks.


By the way, seems the kyber-iosched has the same issue, since the 
'struct kyber_cpu_latency' also contains an atomic_t variable.


kqd->cpu_latency = alloc_percpu_gfp(struct kyber_cpu_latency,
GFP_KERNEL | __GFP_ZERO);
if (!kqd->cpu_latency)
goto err_kqd;

[PATCH 1/2] blk-iocost: Add iocg idle state tracepoint

2020-12-10 Thread Baolin Wang

It will be helpful to trace the iocg's whole state, including active and
idle state. And we can easily expand the original iocost_iocg_activate
trace event to support a state trace class, including active and idle
state tracing.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c|  3 +++
 include/trace/events/iocost.h | 16 +++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index ffa418c..ac6078a 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2185,6 +2185,9 @@ static int ioc_check_iocgs(struct ioc *ioc, struct 
ioc_now *now)
WEIGHT_ONE);
}
 
+   TRACE_IOCG_PATH(iocg_idle, iocg, now,
+   atomic64_read(>active_period),
+   atomic64_read(>cur_period), vtime);
__propagate_weights(iocg, 0, 0, false, now);
list_del_init(>active_list);
}
diff --git a/include/trace/events/iocost.h b/include/trace/events/iocost.h
index 0b68699..e282ce0 100644
--- a/include/trace/events/iocost.h
+++ b/include/trace/events/iocost.h
@@ -11,7 +11,7 @@
 
 #include 
 
-TRACE_EVENT(iocost_iocg_activate,
+DECLARE_EVENT_CLASS(iocost_iocg_state,
 
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
u64 last_period, u64 cur_period, u64 vtime),
@@ -59,6 +59,20 @@
)
 );
 
+DEFINE_EVENT(iocost_iocg_state, iocost_iocg_activate,
+   TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
+u64 last_period, u64 cur_period, u64 vtime),
+
+   TP_ARGS(iocg, path, now, last_period, cur_period, vtime)
+);
+
+DEFINE_EVENT(iocost_iocg_state, iocost_iocg_idle,
+   TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
+u64 last_period, u64 cur_period, u64 vtime),
+
+   TP_ARGS(iocg, path, now, last_period, cur_period, vtime)
+);
+
 DECLARE_EVENT_CLASS(iocg_inuse_update,
 
TP_PROTO(struct ioc_gq *iocg, const char *path, struct ioc_now *now,
-- 
1.8.3.1

[PATCH 2/2] blk-iocost: Use alloc_percpu_gfp() to simplify the code

2020-12-10 Thread Baolin Wang

Use alloc_percpu_gfp() with __GFP_ZERO flag, which can remove
some explicit initialization code.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index ac6078a..52ce2e3 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2819,28 +2819,19 @@ static int blk_iocost_init(struct request_queue *q)
 {
struct ioc *ioc;
struct rq_qos *rqos;
-   int i, cpu, ret;
+   int ret;
+   gfp_t gfp = GFP_KERNEL | __GFP_ZERO;
 
ioc = kzalloc(sizeof(*ioc), GFP_KERNEL);
if (!ioc)
return -ENOMEM;
 
-   ioc->pcpu_stat = alloc_percpu(struct ioc_pcpu_stat);
+   ioc->pcpu_stat = alloc_percpu_gfp(struct ioc_pcpu_stat, gfp);
if (!ioc->pcpu_stat) {
kfree(ioc);
return -ENOMEM;
}
 
-   for_each_possible_cpu(cpu) {
-   struct ioc_pcpu_stat *ccs = per_cpu_ptr(ioc->pcpu_stat, cpu);
-
-   for (i = 0; i < ARRAY_SIZE(ccs->missed); i++) {
-   local_set(>missed[i].nr_met, 0);
-   local_set(>missed[i].nr_missed, 0);
-   }
-   local64_set(>rq_wait_ns, 0);
-   }
-
rqos = >rqos;
rqos->id = RQ_QOS_COST;
rqos->ops = _rqos_ops;
-- 
1.8.3.1

Re: [PATCH v2 0/5] Some cleanups and improvements for blk-iocost

2020-12-06 Thread Baolin Wang


Hi Jens,


Hi,

This patch set did some cleanups and improvements for blk-iocost, and
no big functional changes. Please help to review. Thanks.

Changes from v1:
  - Add acked-by tag from Tejun.
  - Drop 2 unnecessary patches.
  - Move the related variable declarations inside the block together
  with the code in patch 3.
  - Move the commit_weights() into ioc_check_iocgs().
  - Move more related logics of adjusting base vrate into the
  ioc_adjust_base_vrate().
  - Rename the new functions.


Could you take this patch set if no objection from your side? Thanks.



Baolin Wang (5):
   blk-iocost: Fix some typos in comments
   blk-iocost: Remove unnecessary advance declaration
   blk-iocost: Move the usage ratio calculation to the correct place
   blk-iocost: Factor out the active iocgs' state check into a separate
 function
   blk-iocost: Factor out the base vrate change into a separate function

  block/blk-iocost.c | 251 +
  1 file changed, 137 insertions(+), 114 deletions(-)

Re: [RFC PATCH] blk-iocost: Optimize the ioc_refreash_vrate() function

2020-12-03 Thread Baolin Wang





在 2020/12/3 4:32, Tejun Heo 写道:

On Sun, Nov 29, 2020 at 10:37:18AM +0800, Baolin Wang wrote:

The ioc_refreash_vrate() will only be called in ioc_timer_fn() after
starting a new period or stopping the period.

So when starting a new period, the variable 'pleft' in ioc_refreash_vrate()
is always the period's time, which means if the abs(ioc->vtime_err)
is less than the period's time, the vcomp is 0, and we do not need
compensate the vtime_rate in this case, just set it as the base vrate
and return.

When stopping the period, the ioc->vtime_err will be cleared to 0,
and we also do not need to compensate the vtime_rate, just set it as
the base vrate and return.


Before, the function did something which is conceptually discrete and
describable. After, its operation is intertwined with when it's called. I
don't think this sort of micro optimizations are beneficial in cold paths.


OK. I understood your concern. Thanks for your input.

[PATCH] mm/vmalloc: Remove unnecessary return statement

2020-12-01 Thread Baolin Wang

Remove unnecessary return statement for void function.

Signed-off-by: Baolin Wang 
---
 mm/vmalloc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6ae491a..c290fc9 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2275,7 +2275,6 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
}
 
kfree(area);
-   return;
 }
 
 static inline void __vfree_deferred(const void *addr)
-- 
1.8.3.1

[RFC PATCH] blk-iocost: Optimize the ioc_refreash_vrate() function

2020-11-28 Thread Baolin Wang

The ioc_refreash_vrate() will only be called in ioc_timer_fn() after
starting a new period or stopping the period.

So when starting a new period, the variable 'pleft' in ioc_refreash_vrate()
is always the period's time, which means if the abs(ioc->vtime_err)
is less than the period's time, the vcomp is 0, and we do not need
compensate the vtime_rate in this case, just set it as the base vrate
and return.

When stopping the period, the ioc->vtime_err will be cleared to 0,
and we also do not need to compensate the vtime_rate, just set it as
the base vrate and return.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 8348db4..58c9533 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -943,30 +943,29 @@ static bool ioc_refresh_params(struct ioc *ioc, bool 
force)
  */
 static void ioc_refresh_vrate(struct ioc *ioc, struct ioc_now *now)
 {
-   s64 pleft = ioc->period_at + ioc->period_us - now->now;
s64 vperiod = ioc->period_us * ioc->vtime_base_rate;
s64 vcomp, vcomp_min, vcomp_max;
 
lockdep_assert_held(>lock);
 
-   /* we need some time left in this period */
-   if (pleft <= 0)
-   goto done;
+   if (abs(ioc->vtime_err) < ioc->period_us) {
+   atomic64_set(>vtime_rate, ioc->vtime_base_rate);
+   return;
+   }
 
/*
 * Calculate how much vrate should be adjusted to offset the error.
 * Limit the amount of adjustment and deduct the adjusted amount from
 * the error.
 */
-   vcomp = -div64_s64(ioc->vtime_err, pleft);
+   vcomp = -div64_s64(ioc->vtime_err, ioc->period_us);
vcomp_min = -(ioc->vtime_base_rate >> 1);
vcomp_max = ioc->vtime_base_rate;
vcomp = clamp(vcomp, vcomp_min, vcomp_max);
 
-   ioc->vtime_err += vcomp * pleft;
+   ioc->vtime_err += vcomp * ioc->period_us;
 
atomic64_set(>vtime_rate, ioc->vtime_base_rate + vcomp);
-done:
/* bound how much error can accumulate */
ioc->vtime_err = clamp(ioc->vtime_err, -vperiod, vperiod);
 }
-- 
1.8.3.1

[PATCH v2 2/5] blk-iocost: Remove unnecessary advance declaration

2020-11-26 Thread Baolin Wang

Remove unnecessary advance declaration of struct ioc_gq.

Signed-off-by: Baolin Wang 
Acked-by: Tejun Heo 
---
 block/blk-iocost.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 4ffde36..103ccbd 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -370,8 +370,6 @@ enum {
AUTOP_SSD_FAST,
 };
 
-struct ioc_gq;
-
 struct ioc_params {
u32 qos[NR_QOS_PARAMS];
u64 i_lcoefs[NR_I_LCOEFS];
-- 
1.8.3.1

[PATCH v2 0/5] Some cleanups and improvements for blk-iocost

2020-11-26 Thread Baolin Wang

Hi,

This patch set did some cleanups and improvements for blk-iocost, and
no big functional changes. Please help to review. Thanks.

Changes from v1:
 - Add acked-by tag from Tejun.
 - Drop 2 unnecessary patches.
 - Move the related variable declarations inside the block together
 with the code in patch 3.
 - Move the commit_weights() into ioc_check_iocgs().
 - Move more related logics of adjusting base vrate into the
 ioc_adjust_base_vrate().
 - Rename the new functions.

Baolin Wang (5):
  blk-iocost: Fix some typos in comments
  blk-iocost: Remove unnecessary advance declaration
  blk-iocost: Move the usage ratio calculation to the correct place
  blk-iocost: Factor out the active iocgs' state check into a separate
function
  blk-iocost: Factor out the base vrate change into a separate function

 block/blk-iocost.c | 251 +
 1 file changed, 137 insertions(+), 114 deletions(-)

-- 
1.8.3.1

[PATCH v2 4/5] blk-iocost: Factor out the active iocgs' state check into a separate function

2020-11-26 Thread Baolin Wang

Factor out the iocgs' state check into a separate function to
simplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 94 +++---
 1 file changed, 54 insertions(+), 40 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index a926179..93abfe0 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2069,40 +2069,21 @@ static void ioc_forgive_debts(struct ioc *ioc, u64 
usage_us_sum, int nr_debtors,
}
 }
 
-static void ioc_timer_fn(struct timer_list *timer)
+/*
+ * Check the active iocgs' state to avoid oversleeping and deactive
+ * idle iocgs.
+ *
+ * Since waiters determine the sleep durations based on the vrate
+ * they saw at the time of sleep, if vrate has increased, some
+ * waiters could be sleeping for too long. Wake up tardy waiters
+ * which should have woken up in the last period and expire idle
+ * iocgs.
+ */
+static int ioc_check_iocgs(struct ioc *ioc, struct ioc_now *now)
 {
-   struct ioc *ioc = container_of(timer, struct ioc, timer);
+   int nr_debtors = 0;
struct ioc_gq *iocg, *tiocg;
-   struct ioc_now now;
-   LIST_HEAD(surpluses);
-   int nr_debtors = 0, nr_shortages = 0, nr_lagging = 0;
-   u64 usage_us_sum = 0;
-   u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM];
-   u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
-   u32 missed_ppm[2], rq_wait_pct;
-   u64 period_vtime;
-   int prev_busy_level;
-
-   /* how were the latencies during the period? */
-   ioc_lat_stat(ioc, missed_ppm, _wait_pct);
 
-   /* take care of active iocgs */
-   spin_lock_irq(>lock);
-
-   ioc_now(ioc, );
-
-   period_vtime = now.vnow - ioc->period_at_vtime;
-   if (WARN_ON_ONCE(!period_vtime)) {
-   spin_unlock_irq(>lock);
-   return;
-   }
-
-   /*
-* Waiters determine the sleep durations based on the vrate they
-* saw at the time of sleep.  If vrate has increased, some waiters
-* could be sleeping for too long.  Wake up tardy waiters which
-* should have woken up in the last period and expire idle iocgs.
-*/
list_for_each_entry_safe(iocg, tiocg, >active_iocgs, active_list) {
if (!waitqueue_active(>waitq) && !iocg->abs_vdebt &&
!iocg->delay && !iocg_is_idle(iocg))
@@ -2112,24 +2093,24 @@ static void ioc_timer_fn(struct timer_list *timer)
 
/* flush wait and indebt stat deltas */
if (iocg->wait_since) {
-   iocg->local_stat.wait_us += now.now - iocg->wait_since;
-   iocg->wait_since = now.now;
+   iocg->local_stat.wait_us += now->now - iocg->wait_since;
+   iocg->wait_since = now->now;
}
if (iocg->indebt_since) {
iocg->local_stat.indebt_us +=
-   now.now - iocg->indebt_since;
-   iocg->indebt_since = now.now;
+   now->now - iocg->indebt_since;
+   iocg->indebt_since = now->now;
}
if (iocg->indelay_since) {
iocg->local_stat.indelay_us +=
-   now.now - iocg->indelay_since;
-   iocg->indelay_since = now.now;
+   now->now - iocg->indelay_since;
+   iocg->indelay_since = now->now;
}
 
if (waitqueue_active(>waitq) || iocg->abs_vdebt ||
iocg->delay) {
/* might be oversleeping vtime / hweight changes, kick 
*/
-   iocg_kick_waitq(iocg, true, );
+   iocg_kick_waitq(iocg, true, now);
if (iocg->abs_vdebt || iocg->delay)
nr_debtors++;
} else if (iocg_is_idle(iocg)) {
@@ -2143,7 +2124,7 @@ static void ioc_timer_fn(struct timer_list *timer)
 * error and throw away. On reactivation, it'll start
 * with the target budget.
 */
-   excess = now.vnow - vtime - ioc->margins.target;
+   excess = now->vnow - vtime - ioc->margins.target;
if (excess > 0) {
u32 old_hwi;
 
@@ -2152,13 +2133,46 @@ static void ioc_timer_fn(struct timer_list *timer)
WEIGHT_ONE);
}
 
-   __propagate_weights(iocg, 0, 0, false, );
+   __propagate_weights(iocg, 0, 0, false, now);
list

[PATCH v2 3/5] blk-iocost: Move the usage ratio calculation to the correct place

2020-11-26 Thread Baolin Wang

We only use the hweight based usage ratio to calculate the new
hweight_inuse of the iocg to decide if this iocg can donate some
surplus vtime.

Thus move the usage ratio calculation to the correct place to
avoid unnecessary calculation for some vtime shortage iocgs.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 44 +++-
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 103ccbd..a926179 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2168,8 +2168,8 @@ static void ioc_timer_fn(struct timer_list *timer)
 
/* calc usage and see whether some weights need to be moved around */
list_for_each_entry(iocg, >active_iocgs, active_list) {
-   u64 vdone, vtime, usage_us, usage_dur;
-   u32 usage, hw_active, hw_inuse;
+   u64 vdone, vtime, usage_us;
+   u32 hw_active, hw_inuse;
 
/*
 * Collect unused and wind vtime closer to vnow to prevent
@@ -2200,30 +2200,32 @@ static void ioc_timer_fn(struct timer_list *timer)
usage_us = iocg->usage_delta_us;
usage_us_sum += usage_us;
 
-   if (vdone != vtime) {
-   u64 inflight_us = DIV64_U64_ROUND_UP(
-   cost_to_abs_cost(vtime - vdone, hw_inuse),
-   ioc->vtime_base_rate);
-   usage_us = max(usage_us, inflight_us);
-   }
-
-   /* convert to hweight based usage ratio */
-   if (time_after64(iocg->activated_at, ioc->period_at))
-   usage_dur = max_t(u64, now.now - iocg->activated_at, 1);
-   else
-   usage_dur = max_t(u64, now.now - ioc->period_at, 1);
-
-   usage = clamp_t(u32,
-   DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
-  usage_dur),
-   1, WEIGHT_ONE);
-
/* see whether there's surplus vtime */
WARN_ON_ONCE(!list_empty(>surplus_list));
if (hw_inuse < hw_active ||
(!waitqueue_active(>waitq) &&
 time_before64(vtime, now.vnow - ioc->margins.low))) {
-   u32 hwa, old_hwi, hwm, new_hwi;
+   u32 hwa, old_hwi, hwm, new_hwi, usage;
+   u64 usage_dur;
+
+   if (vdone != vtime) {
+   u64 inflight_us = DIV64_U64_ROUND_UP(
+   cost_to_abs_cost(vtime - vdone, 
hw_inuse),
+   ioc->vtime_base_rate);
+
+   usage_us = max(usage_us, inflight_us);
+   }
+
+   /* convert to hweight based usage ratio */
+   if (time_after64(iocg->activated_at, ioc->period_at))
+   usage_dur = max_t(u64, now.now - 
iocg->activated_at, 1);
+   else
+   usage_dur = max_t(u64, now.now - 
ioc->period_at, 1);
+
+   usage = clamp_t(u32,
+   DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
+  usage_dur),
+   1, WEIGHT_ONE);
 
/*
 * Already donating or accumulated enough to start.
-- 
1.8.3.1

[PATCH v2 5/5] blk-iocost: Factor out the base vrate change into a separate function

2020-11-26 Thread Baolin Wang

Factor out the base vrate change code into a separate function
to fimplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 99 +-
 1 file changed, 54 insertions(+), 45 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 93abfe0..8348db4 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -971,6 +971,58 @@ static void ioc_refresh_vrate(struct ioc *ioc, struct 
ioc_now *now)
ioc->vtime_err = clamp(ioc->vtime_err, -vperiod, vperiod);
 }
 
+static void ioc_adjust_base_vrate(struct ioc *ioc, u32 rq_wait_pct,
+ int nr_lagging, int nr_shortages,
+ int prev_busy_level, u32 *missed_ppm)
+{
+   u64 vrate = ioc->vtime_base_rate;
+   u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;
+
+   if (!ioc->busy_level || (ioc->busy_level < 0 && nr_lagging)) {
+   if (ioc->busy_level != prev_busy_level || nr_lagging)
+   trace_iocost_ioc_vrate_adj(ioc, 
atomic64_read(>vtime_rate),
+  missed_ppm, rq_wait_pct,
+  nr_lagging, nr_shortages);
+
+   return;
+   }
+
+   /* rq_wait signal is always reliable, ignore user vrate_min */
+   if (rq_wait_pct > RQ_WAIT_BUSY_PCT)
+   vrate_min = VRATE_MIN;
+
+   /*
+* If vrate is out of bounds, apply clamp gradually as the
+* bounds can change abruptly.  Otherwise, apply busy_level
+* based adjustment.
+*/
+   if (vrate < vrate_min) {
+   vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT), 100);
+   vrate = min(vrate, vrate_min);
+   } else if (vrate > vrate_max) {
+   vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT), 100);
+   vrate = max(vrate, vrate_max);
+   } else {
+   int idx = min_t(int, abs(ioc->busy_level),
+   ARRAY_SIZE(vrate_adj_pct) - 1);
+   u32 adj_pct = vrate_adj_pct[idx];
+
+   if (ioc->busy_level > 0)
+   adj_pct = 100 - adj_pct;
+   else
+   adj_pct = 100 + adj_pct;
+
+   vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100),
+ vrate_min, vrate_max);
+   }
+
+   trace_iocost_ioc_vrate_adj(ioc, vrate, missed_ppm, rq_wait_pct,
+  nr_lagging, nr_shortages);
+
+   ioc->vtime_base_rate = vrate;
+   ioc_refresh_margins(ioc);
+}
+
 /* take a snapshot of the current [v]time and vrate */
 static void ioc_now(struct ioc *ioc, struct ioc_now *now)
 {
@@ -2323,51 +2375,8 @@ static void ioc_timer_fn(struct timer_list *timer)
 
ioc->busy_level = clamp(ioc->busy_level, -1000, 1000);
 
-   if (ioc->busy_level > 0 || (ioc->busy_level < 0 && !nr_lagging)) {
-   u64 vrate = ioc->vtime_base_rate;
-   u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;
-
-   /* rq_wait signal is always reliable, ignore user vrate_min */
-   if (rq_wait_pct > RQ_WAIT_BUSY_PCT)
-   vrate_min = VRATE_MIN;
-
-   /*
-* If vrate is out of bounds, apply clamp gradually as the
-* bounds can change abruptly.  Otherwise, apply busy_level
-* based adjustment.
-*/
-   if (vrate < vrate_min) {
-   vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT),
- 100);
-   vrate = min(vrate, vrate_min);
-   } else if (vrate > vrate_max) {
-   vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT),
- 100);
-   vrate = max(vrate, vrate_max);
-   } else {
-   int idx = min_t(int, abs(ioc->busy_level),
-   ARRAY_SIZE(vrate_adj_pct) - 1);
-   u32 adj_pct = vrate_adj_pct[idx];
-
-   if (ioc->busy_level > 0)
-   adj_pct = 100 - adj_pct;
-   else
-   adj_pct = 100 + adj_pct;
-
-   vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100),
- vrate_min, vrate_max);
-   }
-
-   trace_iocost_ioc_vrate_adj(ioc, vrate, missed_ppm, rq_wait_pct,
-  nr_lagging, nr_shortages);
-
-   ioc->vtime_base_rate = vrate;
-   ioc_refresh_margins(ioc);
-   } else if (ioc->bu

[PATCH v2 1/5] blk-iocost: Fix some typos in comments

2020-11-26 Thread Baolin Wang

Fix some typos in comments.

Signed-off-by: Baolin Wang 
Acked-by: Tejun Heo 
---
 block/blk-iocost.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index bbe86d1..4ffde36 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -39,7 +39,7 @@
  * On top of that, a size cost proportional to the length of the IO is
  * added.  While simple, this model captures the operational
  * characteristics of a wide varienty of devices well enough.  Default
- * paramters for several different classes of devices are provided and the
+ * parameters for several different classes of devices are provided and the
  * parameters can be configured from userspace via
  * /sys/fs/cgroup/io.cost.model.
  *
@@ -77,7 +77,7 @@
  *
  * This constitutes the basis of IO capacity distribution.  Each cgroup's
  * vtime is running at a rate determined by its hweight.  A cgroup tracks
- * the vtime consumed by past IOs and can issue a new IO iff doing so
+ * the vtime consumed by past IOs and can issue a new IO if doing so
  * wouldn't outrun the current device vtime.  Otherwise, the IO is
  * suspended until the vtime has progressed enough to cover it.
  *
@@ -155,7 +155,7 @@
  * Instead of debugfs or other clumsy monitoring mechanisms, this
  * controller uses a drgn based monitoring script -
  * tools/cgroup/iocost_monitor.py.  For details on drgn, please see
- * https://github.com/osandov/drgn.  The ouput looks like the following.
+ * https://github.com/osandov/drgn.  The output looks like the following.
  *
  *  sdb RUN   per=300ms cur_per=234.218:v203.695 busy= +1 vrate= 62.12%
  * active  weight  hweight% inflt% dbt  delay usages%
@@ -492,7 +492,7 @@ struct ioc_gq {
/*
 * `vtime` is this iocg's vtime cursor which progresses as IOs are
 * issued.  If lagging behind device vtime, the delta represents
-* the currently available IO budget.  If runnning ahead, the
+* the currently available IO budget.  If running ahead, the
 * overage.
 *
 * `vtime_done` is the same but progressed on completion rather
@@ -1046,7 +1046,7 @@ static void __propagate_weights(struct ioc_gq *iocg, u32 
active, u32 inuse,
 
/*
 * The delta between inuse and active sums indicates that
-* that much of weight is being given away.  Parent's inuse
+* much of weight is being given away.  Parent's inuse
 * and active should reflect the ratio.
 */
if (parent->child_active_sum) {
@@ -2400,7 +2400,7 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq 
*iocg, u64 vtime,
return cost;
 
/*
-* We only increase inuse during period and do so iff the margin has
+* We only increase inuse during period and do so if the margin has
 * deteriorated since the previous adjustment.
 */
if (margin >= iocg->saved_margin || margin >= margins->low ||
-- 
1.8.3.1

Re: [PATCH 4/7] blk-iocost: Add a flag to indicate if need update hwi

2020-11-25 Thread Baolin Wang





Hello,

On Tue, Nov 24, 2020 at 11:33:33AM +0800, Baolin Wang wrote:

@@ -1445,7 +1447,8 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool 
pay_debt,
 * after the above debt payment.
 */
ctx.vbudget = vbudget;
-   current_hweight(iocg, NULL, _inuse);
+   if (need_update_hwi)
+   current_hweight(iocg, NULL, _inuse);


So, if you look at the implementation of current_hweight(), it's

1. If nothing has changed, read out the cached values.
2. If something has changed, recalculate.


Yes, correct.



and the "something changed" test is single memory read (most likely L1 hot
at this point) and testing for equality. IOW, the change you're suggesting
isn't much of an optimization. Maybe the compiler can do a somewhat better
job of arranging the code and it's a register load than memory load but
given that it's already a relatively cold wait path, this is unlikely to
make any actual difference. And that's how current_hweight() is meant to be
used.


What I want to avoid is the 'atomic_read(>hweight_gen)' in 
current_hweight(), cause this is not a register load and is always a 
memory load. But introducing a flag can be cached and more light than a 
memory load.


But after thinking more, I think we can just move the 
"current_hweight(iocg, NULL, _inuse);" to the correct place 
without introducing new flag to optimize the code. How do you think the 
below code?


diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index bbe86d1..db29200 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1413,7 +1413,7 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, 
bool pay_debt,


lockdep_assert_held(>waitq.lock);

-   current_hweight(iocg, , NULL);
+   current_hweight(iocg, , _inuse);
vbudget = now->vnow - atomic64_read(>vtime);

/* pay off debt */
@@ -1428,6 +1428,11 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, 
bool pay_debt,

atomic64_add(vpay, >done_vtime);
iocg_pay_debt(iocg, abs_vpay, now);
vbudget -= vpay;
+   /*
+* As paying off debt restores hw_inuse, it must be read 
after

+* the above debt payment.
+*/
+   current_hweight(iocg, NULL, _inuse);
}

if (iocg->abs_vdebt || iocg->delay)
@@ -1446,11 +1451,9 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, 
bool pay_debt,


/*
 * Wake up the ones which are due and see how much vtime we'll 
need for
-* the next one. As paying off debt restores hw_inuse, it must 
be read

-* after the above debt payment.
+* the next one.
 */
ctx.vbudget = vbudget;
-   current_hweight(iocg, NULL, _inuse);

__wake_up_locked_key(>waitq, TASK_NORMAL, );



So, I'm not sure this is an improvement. It increases complication without
actually gaining anything.

Thanks.

Re: [PATCH 7/7] blk-iocost: Factor out the base vrate change into a separate function

2020-11-25 Thread Baolin Wang





Hello,

On Tue, Nov 24, 2020 at 11:33:36AM +0800, Baolin Wang wrote:

@@ -2320,45 +2358,11 @@ static void ioc_timer_fn(struct timer_list *timer)
ioc->busy_level = clamp(ioc->busy_level, -1000, 1000);
  
  	if (ioc->busy_level > 0 || (ioc->busy_level < 0 && !nr_lagging)) {

-   u64 vrate = ioc->vtime_base_rate;
-   u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;

...

+   trace_iocost_ioc_vrate_adj(ioc, ioc->vtime_base_rate,
+  missed_ppm, rq_wait_pct,
   nr_lagging, nr_shortages);
-
-   ioc->vtime_base_rate = vrate;
-   ioc_refresh_margins(ioc);
} else if (ioc->busy_level != prev_busy_level || nr_lagging) {
trace_iocost_ioc_vrate_adj(ioc, atomic64_read(>vtime_rate),
   missed_ppm, rq_wait_pct, nr_lagging,


I think it'd be better to factor out the surrounding if/else block together


OK.


(as early exit if blocks). Also, how about ioc_adjust_base_vrate()?


Sure, will rename it in next version. Thanks.

Re: [PATCH 6/7] blk-iocost: Factor out the active iocgs' state check into a separate function

2020-11-25 Thread Baolin Wang





Hello,

On Tue, Nov 24, 2020 at 11:33:35AM +0800, Baolin Wang wrote:

-static void ioc_timer_fn(struct timer_list *timer)
+/*
+ * Waiters determine the sleep durations based on the vrate they
+ * saw at the time of sleep.  If vrate has increased, some waiters
+ * could be sleeping for too long.  Wake up tardy waiters which
+ * should have woken up in the last period and expire idle iocgs.
+ */


Please reflow the comment.


Sure.



...

+   nr_debtors = ioc_check_iocg_state(ioc, );


How about ioc_check_iocgs()?


Yes, sounds reasonable to me.




+
commit_weights(ioc);


I think it'd make more sense to move commit_weights() inside the new
function.


OK. Thanks.

Re: [PATCH 5/7] blk-iocost: Move the usage ratio calculation to the correct place

2020-11-25 Thread Baolin Wang





在 2020/11/25 20:19, Tejun Heo 写道:

Hello,

@@ -2225,6 +2207,25 @@ static void ioc_timer_fn(struct timer_list *timer)
 time_before64(vtime, now.vnow - ioc->margins.low))) {
u32 hwa, old_hwi, hwm, new_hwi;
  
+			if (vdone != vtime) {

+   u64 inflight_us = DIV64_U64_ROUND_UP(
+   cost_to_abs_cost(vtime - vdone, 
hw_inuse),
+   ioc->vtime_base_rate);
+
+   usage_us = max(usage_us, inflight_us);
+   }
+
+   /* convert to hweight based usage ratio */
+   if (time_after64(iocg->activated_at, ioc->period_at))
+   usage_dur = max_t(u64, now.now - 
iocg->activated_at, 1);
+   else
+   usage_dur = max_t(u64, now.now - 
ioc->period_at, 1);
+
+   usage = clamp_t(u32,
+   DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
+  usage_dur),
+   1, WEIGHT_ONE);


Can you please move the variable declarations inside the block together with
the code?


Yes, sure. Will do in next version. Thanks.

Re: [PATCH 3/7] blk-iocost: Just open code the q_name()

2020-11-25 Thread Baolin Wang






On Tue, Nov 24, 2020 at 11:33:32AM +0800, Baolin Wang wrote:

The simple q_name() function is only called from ioc_name(),
just open code it to make code more readable.

Signed-off-by: Baolin Wang 


I'm not sure this is an improvement. Either way seems fine to me. So, why
change?


Yes, this may be not called an 'improvement'. Just from my taste of 
reading code, there is no need to factor out a separate function only 
including 4 lines code and called by only one place. Anyway I can drop

this patch if you think this is unnecessary. Thanks.

[PATCH 2/7] blk-iocost: Remove unnecessary advance declaration

2020-11-23 Thread Baolin Wang

Remove unnecessary advance declaration of struct ioc_gq.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 4ffde36..103ccbd 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -370,8 +370,6 @@ enum {
AUTOP_SSD_FAST,
 };
 
-struct ioc_gq;
-
 struct ioc_params {
u32 qos[NR_QOS_PARAMS];
u64 i_lcoefs[NR_I_LCOEFS];
-- 
1.8.3.1

[PATCH 7/7] blk-iocost: Factor out the base vrate change into a separate function

2020-11-23 Thread Baolin Wang

Factor out the base vrate change code into a separate function
to fimplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 78 --
 1 file changed, 41 insertions(+), 37 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index db4f894..739c8d4 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -968,6 +968,44 @@ static void ioc_refresh_vrate(struct ioc *ioc, struct 
ioc_now *now)
ioc->vtime_err = clamp(ioc->vtime_err, -vperiod, vperiod);
 }
 
+static void ioc_refresh_base_vrate(struct ioc *ioc, u32 rq_wait_pct)
+{
+   u64 vrate = ioc->vtime_base_rate;
+   u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;
+
+   /* rq_wait signal is always reliable, ignore user vrate_min */
+   if (rq_wait_pct > RQ_WAIT_BUSY_PCT)
+   vrate_min = VRATE_MIN;
+
+   /*
+* If vrate is out of bounds, apply clamp gradually as the
+* bounds can change abruptly.  Otherwise, apply busy_level
+* based adjustment.
+*/
+   if (vrate < vrate_min) {
+   vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT), 100);
+   vrate = min(vrate, vrate_min);
+   } else if (vrate > vrate_max) {
+   vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT), 100);
+   vrate = max(vrate, vrate_max);
+   } else {
+   int idx = min_t(int, abs(ioc->busy_level),
+   ARRAY_SIZE(vrate_adj_pct) - 1);
+   u32 adj_pct = vrate_adj_pct[idx];
+
+   if (ioc->busy_level > 0)
+   adj_pct = 100 - adj_pct;
+   else
+   adj_pct = 100 + adj_pct;
+
+   vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100),
+ vrate_min, vrate_max);
+   }
+
+   ioc->vtime_base_rate = vrate;
+   ioc_refresh_margins(ioc);
+}
+
 /* take a snapshot of the current [v]time and vrate */
 static void ioc_now(struct ioc *ioc, struct ioc_now *now)
 {
@@ -2320,45 +2358,11 @@ static void ioc_timer_fn(struct timer_list *timer)
ioc->busy_level = clamp(ioc->busy_level, -1000, 1000);
 
if (ioc->busy_level > 0 || (ioc->busy_level < 0 && !nr_lagging)) {
-   u64 vrate = ioc->vtime_base_rate;
-   u64 vrate_min = ioc->vrate_min, vrate_max = ioc->vrate_max;
-
-   /* rq_wait signal is always reliable, ignore user vrate_min */
-   if (rq_wait_pct > RQ_WAIT_BUSY_PCT)
-   vrate_min = VRATE_MIN;
-
-   /*
-* If vrate is out of bounds, apply clamp gradually as the
-* bounds can change abruptly.  Otherwise, apply busy_level
-* based adjustment.
-*/
-   if (vrate < vrate_min) {
-   vrate = div64_u64(vrate * (100 + VRATE_CLAMP_ADJ_PCT),
- 100);
-   vrate = min(vrate, vrate_min);
-   } else if (vrate > vrate_max) {
-   vrate = div64_u64(vrate * (100 - VRATE_CLAMP_ADJ_PCT),
- 100);
-   vrate = max(vrate, vrate_max);
-   } else {
-   int idx = min_t(int, abs(ioc->busy_level),
-   ARRAY_SIZE(vrate_adj_pct) - 1);
-   u32 adj_pct = vrate_adj_pct[idx];
-
-   if (ioc->busy_level > 0)
-   adj_pct = 100 - adj_pct;
-   else
-   adj_pct = 100 + adj_pct;
+   ioc_refresh_base_vrate(ioc, rq_wait_pct);
 
-   vrate = clamp(DIV64_U64_ROUND_UP(vrate * adj_pct, 100),
- vrate_min, vrate_max);
-   }
-
-   trace_iocost_ioc_vrate_adj(ioc, vrate, missed_ppm, rq_wait_pct,
+   trace_iocost_ioc_vrate_adj(ioc, ioc->vtime_base_rate,
+  missed_ppm, rq_wait_pct,
   nr_lagging, nr_shortages);
-
-   ioc->vtime_base_rate = vrate;
-   ioc_refresh_margins(ioc);
} else if (ioc->busy_level != prev_busy_level || nr_lagging) {
trace_iocost_ioc_vrate_adj(ioc, atomic64_read(>vtime_rate),
   missed_ppm, rq_wait_pct, nr_lagging,
-- 
1.8.3.1

[PATCH 0/7] Some cleanups and improvements for blk-iocost

2020-11-23 Thread Baolin Wang

Hi,

This patch set did some cleanups and improvements for blk-iocost, and
no big functional changes. Please help to review. Thanks.

Baolin Wang (7):
  blk-iocost: Fix some typos in comments
  blk-iocost: Remove unnecessary advance declaration
  blk-iocost: Just open code the q_name()
  blk-iocost: Add a flag to indicate if need update hwi
  blk-iocost: Move the usage ratio calculation to the correct place
  blk-iocost: Factor out the active iocgs' state check into a separate  
  function
  blk-iocost: Factor out the base vrate change into a separate function

 block/blk-iocost.c | 238 -
 1 file changed, 126 insertions(+), 112 deletions(-)

-- 
1.8.3.1

[PATCH 5/7] blk-iocost: Move the usage ratio calculation to the correct place

2020-11-23 Thread Baolin Wang

We only use the hweight based usage ratio to calculate the new
hweight_inuse of the iocg to decide if this iocg can donate some
surplus vtime.

Thus move the usage ratio calculation to the correct place to
avoid unnecessary calculation for some vtime shortage iocgs.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 37 +++--
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 5305afd..e36cd8e 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2200,24 +2200,6 @@ static void ioc_timer_fn(struct timer_list *timer)
usage_us = iocg->usage_delta_us;
usage_us_sum += usage_us;
 
-   if (vdone != vtime) {
-   u64 inflight_us = DIV64_U64_ROUND_UP(
-   cost_to_abs_cost(vtime - vdone, hw_inuse),
-   ioc->vtime_base_rate);
-   usage_us = max(usage_us, inflight_us);
-   }
-
-   /* convert to hweight based usage ratio */
-   if (time_after64(iocg->activated_at, ioc->period_at))
-   usage_dur = max_t(u64, now.now - iocg->activated_at, 1);
-   else
-   usage_dur = max_t(u64, now.now - ioc->period_at, 1);
-
-   usage = clamp_t(u32,
-   DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
-  usage_dur),
-   1, WEIGHT_ONE);
-
/* see whether there's surplus vtime */
WARN_ON_ONCE(!list_empty(>surplus_list));
if (hw_inuse < hw_active ||
@@ -2225,6 +2207,25 @@ static void ioc_timer_fn(struct timer_list *timer)
 time_before64(vtime, now.vnow - ioc->margins.low))) {
u32 hwa, old_hwi, hwm, new_hwi;
 
+   if (vdone != vtime) {
+   u64 inflight_us = DIV64_U64_ROUND_UP(
+   cost_to_abs_cost(vtime - vdone, 
hw_inuse),
+   ioc->vtime_base_rate);
+
+   usage_us = max(usage_us, inflight_us);
+   }
+
+   /* convert to hweight based usage ratio */
+   if (time_after64(iocg->activated_at, ioc->period_at))
+   usage_dur = max_t(u64, now.now - 
iocg->activated_at, 1);
+   else
+   usage_dur = max_t(u64, now.now - 
ioc->period_at, 1);
+
+   usage = clamp_t(u32,
+   DIV64_U64_ROUND_UP(usage_us * WEIGHT_ONE,
+  usage_dur),
+   1, WEIGHT_ONE);
+
/*
 * Already donating or accumulated enough to start.
 * Determine the donation amount.
-- 
1.8.3.1

[PATCH 3/7] blk-iocost: Just open code the q_name()

2020-11-23 Thread Baolin Wang

The simple q_name() function is only called from ioc_name(),
just open code it to make code more readable.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 103ccbd..089f3fe 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -665,17 +665,14 @@ static struct ioc *q_to_ioc(struct request_queue *q)
return rqos_to_ioc(rq_qos_id(q, RQ_QOS_COST));
 }
 
-static const char *q_name(struct request_queue *q)
+static const char __maybe_unused *ioc_name(struct ioc *ioc)
 {
+   struct request_queue *q = ioc->rqos.q;
+
if (blk_queue_registered(q))
return kobject_name(q->kobj.parent);
-   else
-   return "";
-}
 
-static const char __maybe_unused *ioc_name(struct ioc *ioc)
-{
-   return q_name(ioc->rqos.q);
+   return "";
 }
 
 static struct ioc_gq *pd_to_iocg(struct blkg_policy_data *pd)
-- 
1.8.3.1

[PATCH 6/7] blk-iocost: Factor out the active iocgs' state check into a separate function

2020-11-23 Thread Baolin Wang

Factor out the iocgs' state check into a separate function to
simplify the ioc_timer_fn().

No functional change.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 91 ++
 1 file changed, 51 insertions(+), 40 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index e36cd8e..db4f894 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -2069,40 +2069,17 @@ static void ioc_forgive_debts(struct ioc *ioc, u64 
usage_us_sum, int nr_debtors,
}
 }
 
-static void ioc_timer_fn(struct timer_list *timer)
+/*
+ * Waiters determine the sleep durations based on the vrate they
+ * saw at the time of sleep.  If vrate has increased, some waiters
+ * could be sleeping for too long.  Wake up tardy waiters which
+ * should have woken up in the last period and expire idle iocgs.
+ */
+static int ioc_check_iocg_state(struct ioc *ioc, struct ioc_now *now)
 {
-   struct ioc *ioc = container_of(timer, struct ioc, timer);
+   int nr_debtors = 0;
struct ioc_gq *iocg, *tiocg;
-   struct ioc_now now;
-   LIST_HEAD(surpluses);
-   int nr_debtors = 0, nr_shortages = 0, nr_lagging = 0;
-   u64 usage_us_sum = 0;
-   u32 ppm_rthr = MILLION - ioc->params.qos[QOS_RPPM];
-   u32 ppm_wthr = MILLION - ioc->params.qos[QOS_WPPM];
-   u32 missed_ppm[2], rq_wait_pct;
-   u64 period_vtime;
-   int prev_busy_level;
 
-   /* how were the latencies during the period? */
-   ioc_lat_stat(ioc, missed_ppm, _wait_pct);
-
-   /* take care of active iocgs */
-   spin_lock_irq(>lock);
-
-   ioc_now(ioc, );
-
-   period_vtime = now.vnow - ioc->period_at_vtime;
-   if (WARN_ON_ONCE(!period_vtime)) {
-   spin_unlock_irq(>lock);
-   return;
-   }
-
-   /*
-* Waiters determine the sleep durations based on the vrate they
-* saw at the time of sleep.  If vrate has increased, some waiters
-* could be sleeping for too long.  Wake up tardy waiters which
-* should have woken up in the last period and expire idle iocgs.
-*/
list_for_each_entry_safe(iocg, tiocg, >active_iocgs, active_list) {
if (!waitqueue_active(>waitq) && !iocg->abs_vdebt &&
!iocg->delay && !iocg_is_idle(iocg))
@@ -2112,24 +2089,24 @@ static void ioc_timer_fn(struct timer_list *timer)
 
/* flush wait and indebt stat deltas */
if (iocg->wait_since) {
-   iocg->local_stat.wait_us += now.now - iocg->wait_since;
-   iocg->wait_since = now.now;
+   iocg->local_stat.wait_us += now->now - iocg->wait_since;
+   iocg->wait_since = now->now;
}
if (iocg->indebt_since) {
iocg->local_stat.indebt_us +=
-   now.now - iocg->indebt_since;
-   iocg->indebt_since = now.now;
+   now->now - iocg->indebt_since;
+   iocg->indebt_since = now->now;
}
if (iocg->indelay_since) {
iocg->local_stat.indelay_us +=
-   now.now - iocg->indelay_since;
-   iocg->indelay_since = now.now;
+   now->now - iocg->indelay_since;
+   iocg->indelay_since = now->now;
}
 
if (waitqueue_active(>waitq) || iocg->abs_vdebt ||
iocg->delay) {
/* might be oversleeping vtime / hweight changes, kick 
*/
-   iocg_kick_waitq(iocg, true, );
+   iocg_kick_waitq(iocg, true, now);
if (iocg->abs_vdebt || iocg->delay)
nr_debtors++;
} else if (iocg_is_idle(iocg)) {
@@ -2143,7 +2120,7 @@ static void ioc_timer_fn(struct timer_list *timer)
 * error and throw away. On reactivation, it'll start
 * with the target budget.
 */
-   excess = now.vnow - vtime - ioc->margins.target;
+   excess = now->vnow - vtime - ioc->margins.target;
if (excess > 0) {
u32 old_hwi;
 
@@ -2152,12 +2129,46 @@ static void ioc_timer_fn(struct timer_list *timer)
WEIGHT_ONE);
}
 
-   __propagate_weights(iocg, 0, 0, false, );
+   __propagate_weights(iocg, 0, 0, false, now);
list_del_init(>active_list);
}
 
spin_unlock(>waitq.loc

[PATCH 4/7] blk-iocost: Add a flag to indicate if need update hwi

2020-11-23 Thread Baolin Wang

We can get the hwa and hwi at one time if no debt need to pay off,
thus add a flag to indicate if the hw_inuse has been changed and
need to update, which can avoid calling current_hweight() twice
for no debt iocgs.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index 089f3fe..5305afd 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -1405,10 +1405,11 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool 
pay_debt,
u64 vshortage, expires, oexpires;
s64 vbudget;
u32 hwa;
+   bool need_update_hwi = false;
 
lockdep_assert_held(>waitq.lock);
 
-   current_hweight(iocg, , NULL);
+   current_hweight(iocg, , _inuse);
vbudget = now->vnow - atomic64_read(>vtime);
 
/* pay off debt */
@@ -1423,6 +1424,7 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool 
pay_debt,
atomic64_add(vpay, >done_vtime);
iocg_pay_debt(iocg, abs_vpay, now);
vbudget -= vpay;
+   need_update_hwi = true;
}
 
if (iocg->abs_vdebt || iocg->delay)
@@ -1445,7 +1447,8 @@ static void iocg_kick_waitq(struct ioc_gq *iocg, bool 
pay_debt,
 * after the above debt payment.
 */
ctx.vbudget = vbudget;
-   current_hweight(iocg, NULL, _inuse);
+   if (need_update_hwi)
+   current_hweight(iocg, NULL, _inuse);
 
__wake_up_locked_key(>waitq, TASK_NORMAL, );
 
-- 
1.8.3.1

[PATCH 1/7] blk-iocost: Fix some typos in comments

2020-11-23 Thread Baolin Wang

Fix some typos in comments.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index bbe86d1..4ffde36 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -39,7 +39,7 @@
  * On top of that, a size cost proportional to the length of the IO is
  * added.  While simple, this model captures the operational
  * characteristics of a wide varienty of devices well enough.  Default
- * paramters for several different classes of devices are provided and the
+ * parameters for several different classes of devices are provided and the
  * parameters can be configured from userspace via
  * /sys/fs/cgroup/io.cost.model.
  *
@@ -77,7 +77,7 @@
  *
  * This constitutes the basis of IO capacity distribution.  Each cgroup's
  * vtime is running at a rate determined by its hweight.  A cgroup tracks
- * the vtime consumed by past IOs and can issue a new IO iff doing so
+ * the vtime consumed by past IOs and can issue a new IO if doing so
  * wouldn't outrun the current device vtime.  Otherwise, the IO is
  * suspended until the vtime has progressed enough to cover it.
  *
@@ -155,7 +155,7 @@
  * Instead of debugfs or other clumsy monitoring mechanisms, this
  * controller uses a drgn based monitoring script -
  * tools/cgroup/iocost_monitor.py.  For details on drgn, please see
- * https://github.com/osandov/drgn.  The ouput looks like the following.
+ * https://github.com/osandov/drgn.  The output looks like the following.
  *
  *  sdb RUN   per=300ms cur_per=234.218:v203.695 busy= +1 vrate= 62.12%
  * active  weight  hweight% inflt% dbt  delay usages%
@@ -492,7 +492,7 @@ struct ioc_gq {
/*
 * `vtime` is this iocg's vtime cursor which progresses as IOs are
 * issued.  If lagging behind device vtime, the delta represents
-* the currently available IO budget.  If runnning ahead, the
+* the currently available IO budget.  If running ahead, the
 * overage.
 *
 * `vtime_done` is the same but progressed on completion rather
@@ -1046,7 +1046,7 @@ static void __propagate_weights(struct ioc_gq *iocg, u32 
active, u32 inuse,
 
/*
 * The delta between inuse and active sums indicates that
-* that much of weight is being given away.  Parent's inuse
+* much of weight is being given away.  Parent's inuse
 * and active should reflect the ratio.
 */
if (parent->child_active_sum) {
@@ -2400,7 +2400,7 @@ static u64 adjust_inuse_and_calc_cost(struct ioc_gq 
*iocg, u64 vtime,
return cost;
 
/*
-* We only increase inuse during period and do so iff the margin has
+* We only increase inuse during period and do so if the margin has
 * deteriorated since the previous adjustment.
 */
if (margin >= iocg->saved_margin || margin >= margins->low ||
-- 
1.8.3.1

Re: [PATCH] arm64: PCI: Validate the node before setting node id for root bus

2020-11-09 Thread Baolin Wang


Hi Lorenzo,




On Tue, Sep 29, 2020 at 11:41:29PM +0800, Baolin Wang wrote:

Hi,

锟斤拷 2020/9/28 23:23, Lorenzo Pieralisi 写锟斤拷:

On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:

On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:

[+ Lorenzo]

On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:

If the BIOS disabled the NUMA configuration, but did not change the
proximity domain description in the SRAT table, so the PCI root bus
device may get a incorrect node id by acpi_get_node().


How "incorrect" are we talking here? What actually goes wrong? At 
some

point, we have to trust what the firmware is telling us.


What I mean is, if we disable the NUMA from BIOS


Please define what this means ie are you removing SRAT from ACPI static
tables ?


Yes.




but we did not change the PXM for the PCI devices,


If a _PXM maps to a proximity domain that is not described in the SRAT
your firmware is buggy.


Sorry for confusing, that's not what I mean. When the BIOS disable 
the NUMA

(remove the SRAT table), but the PCI devices' _PXM description is still
available, which means we can still get the pxm from 
acpi_evaluate_integer()

in this case.


There should not be a _PXM object if the SRAT is not available, that's
a firmware bug.


So we can get below inconsistent log on ARM platform:
"No NUMA configuration found
PCI_bus :00 on NUMA node 0
...
PCI_bus :e3 on NUMA node 1"

On X86, the pci_acpi_root_get_node() will validate the node before 
setting

the node id for root bus. So I think we can add this validation for ARM
platform. Or anything else I missed?


We are not adding checks because x86 does it, it is certainly to paper
over a firmware bug that you hopefully still have a chance to fix,
let's do that instead of adding code that is not necessary.


Thanks for your input, and I will check this issue with our firmware 
colleagues again.


Sorry for late reply.

I did some investigation for this issue. I am sorry I made some 
misleading description in the commit message. The issue is, when we
want to disable the NUMA from firmware, we usually just remove the SRAT 
table from the BIOS. But the devices' proximity domain information is 
still remain in the ACPI tables.


For example, the IORT table still contains the proximity domain 
information for the SMMU devices, so in this case, the SMMU devices 
still can get incorrect NUMA nodes if we remove the SRAT table. But
the SMMU devices will validate the numa node in 
arm_smmu_v3_set_proximity() to avoid this issue.


static int  __init arm_smmu_v3_set_proximity(struct device *dev,
  struct acpi_iort_node *node)
{
struct acpi_iort_smmu_v3 *smmu;

smmu = (struct acpi_iort_smmu_v3 *)node->node_data;
if (smmu->flags & ACPI_IORT_SMMU_V3_PXM_VALID) {
int dev_node = pxm_to_node(smmu->pxm);

if (dev_node != NUMA_NO_NODE && !node_online(dev_node))
return -EINVAL;

set_dev_node(dev, dev_node);
pr_info("SMMU-v3[%llx] Mapped to Proximity domain %d\n",
smmu->base_address,
smmu->pxm);
}
return 0;
}

So similar with SMMU devices, the DSDT table will still contain the PCI 
root host devices' proximity domain though we already remove the SRAT 
table. So I think we still need this validation in 
pcibios_root_bridge_prepare() to avoid this issue like other devices did.


I can update the commit message in next version if you think this is 
reasonable. Thanks.

Re: [PATCH] nvme: Simplify the nvme_req_qid()

2020-11-02 Thread Baolin Wang


Hi,


Use the request's '->mq_hctx->queue_num' directly to simplify the
nvme_req_qid() function.

Signed-off-by: Baolin Wang 


Gentle ping?


---
  drivers/nvme/host/nvme.h | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cc3..0b62b62 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -178,7 +178,8 @@ static inline u16 nvme_req_qid(struct request *req)
  {
if (!req->q->queuedata)
return 0;
-   return blk_mq_unique_tag_to_hwq(blk_mq_unique_tag(req)) + 1;
+
+   return req->mq_hctx->queue_num + 1;
  }
  
  /* The below value is the specific amount of delay needed before checking

[PATCH] nvme: Simplify the nvme_req_qid()

2020-10-27 Thread Baolin Wang

Use the request's '->mq_hctx->queue_num' directly to simplify the
nvme_req_qid() function.

Signed-off-by: Baolin Wang 
---
 drivers/nvme/host/nvme.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index cc3..0b62b62 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -178,7 +178,8 @@ static inline u16 nvme_req_qid(struct request *req)
 {
if (!req->q->queuedata)
return 0;
-   return blk_mq_unique_tag_to_hwq(blk_mq_unique_tag(req)) + 1;
+
+   return req->mq_hctx->queue_num + 1;
 }
 
 /* The below value is the specific amount of delay needed before checking
-- 
1.8.3.1

[PATCH v2 8/8] blk-throttle: Re-use the throtl_set_slice_end()

2020-10-07 Thread Baolin Wang

Re-use throtl_set_slice_end() to remove duplicate code.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fc5c14f..b771c42 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -808,7 +808,7 @@ static inline void throtl_set_slice_end(struct throtl_grp 
*tg, bool rw,
 static inline void throtl_extend_slice(struct throtl_grp *tg, bool rw,
   unsigned long jiffy_end)
 {
-   tg->slice_end[rw] = roundup(jiffy_end, tg->td->throtl_slice);
+   throtl_set_slice_end(tg, rw, jiffy_end);
throtl_log(>service_queue,
   "[%c] extend slice start=%lu end=%lu jiffies=%lu",
   rw == READ ? 'R' : 'W', tg->slice_start[rw],
-- 
1.8.3.1

[PATCH v2 0/8] Some improvements for blk throttle

2020-10-07 Thread Baolin Wang

Hi,

This patch set did some improvements for blk throttle, please
help to review. Thanks.

Changes from v1:
 - Add another 4 new patches in this patch set.

Baolin Wang (8):
  blk-throttle: Remove a meaningless parameter for
throtl_downgrade_state()
  blk-throttle: Avoid getting the current time if tg->last_finish_time
is 0
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Move the list operation after list validation
  blk-throttle: Move service tree validation out of the
throtl_rb_first()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Re-use the throtl_set_slice_end()

 block/blk-throttle.c | 69 ++--
 1 file changed, 35 insertions(+), 34 deletions(-)

-- 
1.8.3.1

[PATCH v2 7/8] blk-throttle: Open code __throtl_de/enqueue_tg()

2020-10-07 Thread Baolin Wang

The __throtl_de/enqueue_tg() functions are only be called by
throtl_de/enqueue_tg(), thus we can just open code them to
make code more readable.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 38aed8b..fc5c14f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -691,29 +691,21 @@ static void tg_service_queue_add(struct throtl_grp *tg)
   leftmost);
 }
 
-static void __throtl_enqueue_tg(struct throtl_grp *tg)
-{
-   tg_service_queue_add(tg);
-   tg->flags |= THROTL_TG_PENDING;
-   tg->service_queue.parent_sq->nr_pending++;
-}
-
 static void throtl_enqueue_tg(struct throtl_grp *tg)
 {
-   if (!(tg->flags & THROTL_TG_PENDING))
-   __throtl_enqueue_tg(tg);
-}
-
-static void __throtl_dequeue_tg(struct throtl_grp *tg)
-{
-   throtl_rb_erase(>rb_node, tg->service_queue.parent_sq);
-   tg->flags &= ~THROTL_TG_PENDING;
+   if (!(tg->flags & THROTL_TG_PENDING)) {
+   tg_service_queue_add(tg);
+   tg->flags |= THROTL_TG_PENDING;
+   tg->service_queue.parent_sq->nr_pending++;
+   }
 }
 
 static void throtl_dequeue_tg(struct throtl_grp *tg)
 {
-   if (tg->flags & THROTL_TG_PENDING)
-   __throtl_dequeue_tg(tg);
+   if (tg->flags & THROTL_TG_PENDING) {
+   throtl_rb_erase(>rb_node, tg->service_queue.parent_sq);
+   tg->flags &= ~THROTL_TG_PENDING;
+   }
 }
 
 /* Call with queue lock held */
-- 
1.8.3.1

[PATCH v2 3/8] blk-throttle: Avoid tracking latency if low limit is invalid

2020-10-07 Thread Baolin Wang

The IO latency tracking is only for LOW limit, so we should add a
validation to avoid redundant latency tracking if the LOW limit
is not valid.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7e72102..b0d8f7c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2100,7 +2100,7 @@ static void throtl_update_latency_buckets(struct 
throtl_data *td)
unsigned long last_latency[2] = { 0 };
unsigned long latency[2];
 
-   if (!blk_queue_nonrot(td->queue))
+   if (!blk_queue_nonrot(td->queue) || !td->limit_valid[LIMIT_LOW])
return;
if (time_before(jiffies, td->last_calculate_time + HZ))
return;
@@ -2338,6 +2338,8 @@ void blk_throtl_bio_endio(struct bio *bio)
if (!blkg)
return;
tg = blkg_to_tg(blkg);
+   if (!tg->td->limit_valid[LIMIT_LOW])
+   return;
 
finish_time_ns = ktime_get_ns();
tg->last_finish_time = finish_time_ns >> 10;
-- 
1.8.3.1

[PATCH v2 4/8] blk-throttle: Fix IO hang for a corner case

2020-10-07 Thread Baolin Wang

It can not scale up in throtl_adjusted_limit() if we set bps or iops is
1, which will cause IO hang when enable low limit. Thus we should treat
1 as a illegal value to avoid this issue.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b0d8f7c..0649bce 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1687,13 +1687,13 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
goto out_finish;
 
ret = -EINVAL;
-   if (!strcmp(tok, "rbps"))
+   if (!strcmp(tok, "rbps") && val > 1)
v[0] = val;
-   else if (!strcmp(tok, "wbps"))
+   else if (!strcmp(tok, "wbps") && val > 1)
v[1] = val;
-   else if (!strcmp(tok, "riops"))
+   else if (!strcmp(tok, "riops") && val > 1)
v[2] = min_t(u64, val, UINT_MAX);
-   else if (!strcmp(tok, "wiops"))
+   else if (!strcmp(tok, "wiops") && val > 1)
v[3] = min_t(u64, val, UINT_MAX);
else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
idle_time = val;
-- 
1.8.3.1

[PATCH v2 2/8] blk-throttle: Avoid getting the current time if tg->last_finish_time is 0

2020-10-07 Thread Baolin Wang

We only update the tg->last_finish_time when the low limitaion is
enabled, so we can move the tg->last_finish_time validation a little
forward to avoid getting the unnecessary current time stamp if the
the low limitation is not enabled.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4007b26..7e72102 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2077,10 +2077,14 @@ static void throtl_downgrade_check(struct throtl_grp 
*tg)
 
 static void blk_throtl_update_idletime(struct throtl_grp *tg)
 {
-   unsigned long now = ktime_get_ns() >> 10;
+   unsigned long now;
unsigned long last_finish_time = tg->last_finish_time;
 
-   if (now <= last_finish_time || last_finish_time == 0 ||
+   if (last_finish_time == 0)
+   return;
+
+   now = ktime_get_ns() >> 10;
+   if (now <= last_finish_time ||
last_finish_time == tg->checked_last_finish_time)
return;
 
-- 
1.8.3.1

[PATCH v2 6/8] blk-throttle: Move service tree validation out of the throtl_rb_first()

2020-10-07 Thread Baolin Wang

The throtl_schedule_next_dispatch() will validate if the service queue
is empty before calling update_min_dispatch_time(), and the
update_min_dispatch_time() will call throtl_rb_first(), which will
validate service queue again.

Thus we can move the service queue validation out of the
throtl_rb_first() to remove the redundant validation in the fast path.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index f1bcb5c..38aed8b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -638,9 +638,6 @@ static void throtl_pd_free(struct blkg_policy_data *pd)
 throtl_rb_first(struct throtl_service_queue *parent_sq)
 {
struct rb_node *n;
-   /* Service tree is empty */
-   if (!parent_sq->nr_pending)
-   return NULL;
 
n = rb_first_cached(_sq->pending_tree);
WARN_ON_ONCE(!n);
@@ -1224,9 +1221,13 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
unsigned int nr_disp = 0;
 
while (1) {
-   struct throtl_grp *tg = throtl_rb_first(parent_sq);
+   struct throtl_grp *tg;
struct throtl_service_queue *sq;
 
+   if (!parent_sq->nr_pending)
+   break;
+
+   tg = throtl_rb_first(parent_sq);
if (!tg)
break;
 
-- 
1.8.3.1

[PATCH v2 1/8] blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()

2020-10-07 Thread Baolin Wang

The throtl_downgrade_state() is always used to change to LIMIT_LOW
limitation, thus remove the latter meaningless parameter which
indicates the limitation index.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 36ba61c..4007b26 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1970,7 +1970,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
queue_work(kthrotld_workqueue, >dispatch_work);
 }
 
-static void throtl_downgrade_state(struct throtl_data *td, int new)
+static void throtl_downgrade_state(struct throtl_data *td)
 {
td->scale /= 2;
 
@@ -1980,7 +1980,7 @@ static void throtl_downgrade_state(struct throtl_data 
*td, int new)
return;
}
 
-   td->limit_index = new;
+   td->limit_index = LIMIT_LOW;
td->low_downgrade_time = jiffies;
 }
 
@@ -2067,7 +2067,7 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 * cgroups
 */
if (throtl_hierarchy_can_downgrade(tg))
-   throtl_downgrade_state(tg->td, LIMIT_LOW);
+   throtl_downgrade_state(tg->td);
 
tg->last_bytes_disp[READ] = 0;
tg->last_bytes_disp[WRITE] = 0;
-- 
1.8.3.1

[PATCH v2 5/8] blk-throttle: Move the list operation after list validation

2020-10-07 Thread Baolin Wang

We should move the list operation after validation.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0649bce..f1bcb5c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -423,12 +423,13 @@ static void throtl_qnode_add_bio(struct bio *bio, struct 
throtl_qnode *qn,
  */
 static struct bio *throtl_peek_queued(struct list_head *queued)
 {
-   struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, 
node);
+   struct throtl_qnode *qn;
struct bio *bio;
 
if (list_empty(queued))
return NULL;
 
+   qn = list_first_entry(queued, struct throtl_qnode, node);
bio = bio_list_peek(>bios);
WARN_ON_ONCE(!bio);
return bio;
@@ -451,12 +452,13 @@ static struct bio *throtl_peek_queued(struct list_head 
*queued)
 static struct bio *throtl_pop_queued(struct list_head *queued,
 struct throtl_grp **tg_to_put)
 {
-   struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, 
node);
+   struct throtl_qnode *qn;
struct bio *bio;
 
if (list_empty(queued))
return NULL;
 
+   qn = list_first_entry(queued, struct throtl_qnode, node);
bio = bio_list_pop(>bios);
WARN_ON_ONCE(!bio);
 
-- 
1.8.3.1

Re: [PATCH] block: Remove redundant 'return' statement

2020-10-07 Thread Baolin Wang

Hi,

On Mon, Sep 28, 2020 at 08:42:26AM +0800, Baolin Wang wrote:
> Remove redundant 'return' statement for 'void' functions.
> 
> Signed-off-by: Baolin Wang 

Gentle ping?

> ---
>  block/blk-iocost.c| 2 +-
>  block/blk-iolatency.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-iocost.c b/block/blk-iocost.c
> index ef9476f..e38c406 100644
> --- a/block/blk-iocost.c
> +++ b/block/blk-iocost.c
> @@ -3343,7 +3343,7 @@ static int __init ioc_init(void)
>  
>  static void __exit ioc_exit(void)
>  {
> - return blkcg_policy_unregister(_policy_iocost);
> + blkcg_policy_unregister(_policy_iocost);
>  }
>  
>  module_init(ioc_init);
> diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
> index f90429c..81be009 100644
> --- a/block/blk-iolatency.c
> +++ b/block/blk-iolatency.c
> @@ -1046,7 +1046,7 @@ static int __init iolatency_init(void)
>  
>  static void __exit iolatency_exit(void)
>  {
> - return blkcg_policy_unregister(_policy_iolatency);
> + blkcg_policy_unregister(_policy_iolatency);
>  }
>  
>  module_init(iolatency_init);
> -- 
> 1.8.3.1

Re: [PATCH] arm64: PCI: Validate the node before setting node id for root bus

2020-10-03 Thread Baolin Wang





On Tue, Sep 29, 2020 at 11:41:29PM +0800, Baolin Wang wrote:

Hi,

锟斤拷 2020/9/28 23:23, Lorenzo Pieralisi 写锟斤拷:

On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:

On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:

[+ Lorenzo]

On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:

If the BIOS disabled the NUMA configuration, but did not change the
proximity domain description in the SRAT table, so the PCI root bus
device may get a incorrect node id by acpi_get_node().


How "incorrect" are we talking here? What actually goes wrong? At some
point, we have to trust what the firmware is telling us.


What I mean is, if we disable the NUMA from BIOS


Please define what this means ie are you removing SRAT from ACPI static
tables ?


Yes.




but we did not change the PXM for the PCI devices,


If a _PXM maps to a proximity domain that is not described in the SRAT
your firmware is buggy.


Sorry for confusing, that's not what I mean. When the BIOS disable the NUMA
(remove the SRAT table), but the PCI devices' _PXM description is still
available, which means we can still get the pxm from acpi_evaluate_integer()
in this case.


There should not be a _PXM object if the SRAT is not available, that's
a firmware bug.


So we can get below inconsistent log on ARM platform:
"No NUMA configuration found
PCI_bus :00 on NUMA node 0
...
PCI_bus :e3 on NUMA node 1"

On X86, the pci_acpi_root_get_node() will validate the node before setting
the node id for root bus. So I think we can add this validation for ARM
platform. Or anything else I missed?


We are not adding checks because x86 does it, it is certainly to paper
over a firmware bug that you hopefully still have a chance to fix,
let's do that instead of adding code that is not necessary.


Thanks for your input, and I will check this issue with our firmware 
colleagues again.

Re: [PATCH] arm64: PCI: Validate the node before setting node id for root bus

2020-09-29 Thread Baolin Wang


Hi,

在 2020/9/28 23:23, Lorenzo Pieralisi 写道:

On Mon, Sep 28, 2020 at 10:49:57PM +0800, Baolin Wang wrote:

On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:

[+ Lorenzo]

On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:

If the BIOS disabled the NUMA configuration, but did not change the
proximity domain description in the SRAT table, so the PCI root bus
device may get a incorrect node id by acpi_get_node().


How "incorrect" are we talking here? What actually goes wrong? At some
point, we have to trust what the firmware is telling us.


What I mean is, if we disable the NUMA from BIOS


Please define what this means ie are you removing SRAT from ACPI static
tables ?


Yes.




but we did not change the PXM for the PCI devices,


If a _PXM maps to a proximity domain that is not described in the SRAT
your firmware is buggy.


Sorry for confusing, that's not what I mean. When the BIOS disable the 
NUMA (remove the SRAT table), but the PCI devices' _PXM description is 
still available, which means we can still get the pxm from 
acpi_evaluate_integer() in this case.


So we can get below inconsistent log on ARM platform:
"No NUMA configuration found
PCI_bus :00 on NUMA node 0
...
PCI_bus :e3 on NUMA node 1"

On X86, the pci_acpi_root_get_node() will validate the node before 
setting the node id for root bus. So I think we can add this validation 
for ARM platform. Or anything else I missed?





so the PCI devices can still get a numa node id from acpi_get_node().
For example, we can still get the numa node id = 1 in this case from
acpi_get_node(), but the numa_nodes_parsed is empty, which means the
node id 1 is invalid.  We should add a validation for the node id when
setting the root bus node id.


The kernel is not a firmware validation test suite, so fix the firmware
please.

Having said that, please provide a trace log of the issue this is
causing, if any.


See above.

Re: [PATCH 0/4] Some improvements for blk throttle

2020-09-29 Thread Baolin Wang


Hi Jens,


Hi,

This patch set did some improvements for blk throttle, please
help to review. Thanks.


Do you have any comments for this patch set? Thanks.



Baolin Wang (4):
   blk-throttle: Remove a meaningless parameter for
 throtl_downgrade_state()
   blk-throttle: Avoid getting the current time if tg->last_finish_time
 is 0
   blk-throttle: Avoid tracking latency if low limit is invalid
   blk-throttle: Fix IO hang for a corner case

  block/blk-throttle.c | 26 --
  1 file changed, 16 insertions(+), 10 deletions(-)

Re: [PATCH] arm64: PCI: Validate the node before setting node id for root bus

2020-09-28 Thread Baolin Wang

On Mon, Sep 28, 2020 at 03:00:55PM +0100, Will Deacon wrote:
> [+ Lorenzo]
> 
> On Tue, Sep 22, 2020 at 06:33:24PM +0800, Baolin Wang wrote:
> > If the BIOS disabled the NUMA configuration, but did not change the
> > proximity domain description in the SRAT table, so the PCI root bus
> > device may get a incorrect node id by acpi_get_node().
> 
> How "incorrect" are we talking here? What actually goes wrong? At some
> point, we have to trust what the firmware is telling us.

What I mean is, if we disable the NUMA from BIOS, but we did not change
the PXM for the PCI devices, so the PCI devices can still get a numa
node id from acpi_get_node(). For example, we can still get the numa
node id = 1 in this case from acpi_get_node(), but the numa_nodes_parsed
is empty, which means the node id 1 is invalid. We should add a
validation for the node id when setting the root bus node id.

> 
> > Thus better to add a numa node validation before setting numa node
> > for the PCI root bus, like pci_acpi_root_get_node() does for X86
> > architecture.
> > 
> > Signed-off-by: Baolin Wang 
> > ---
> >  arch/arm64/kernel/pci.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
> > index 1006ed2..24fe2bd 100644
> > --- a/arch/arm64/kernel/pci.c
> > +++ b/arch/arm64/kernel/pci.c
> > @@ -86,9 +86,13 @@ int pcibios_root_bridge_prepare(struct pci_host_bridge 
> > *bridge)
> > struct pci_config_window *cfg = bridge->bus->sysdata;
> > struct acpi_device *adev = to_acpi_device(cfg->parent);
> > struct device *bus_dev = >bus->dev;
> > +   int node = acpi_get_node(acpi_device_handle(adev));
> > +
> > +   if (node != NUMA_NO_NODE && !node_online(node))
> > +   node = NUMA_NO_NODE;
> 
> Hmm. afaict, acpi_get_node() tries quite hard to return a valid node when
> it gets back NUMA_NO_NODE in acpi_map_pxm_to_node(). Seems like we're
> undoing all of that here, which worries me because NUMA_NO_NODE is a bit
> of a loaded gun if you interpret it as a valid node.

I did not treate NUMA_NO_NODE as a valid node, I just add a validation
to validate if it is a valid node before setting. See my previous comments,
hopes I make things clear. Thanks.

> 
> Anyway, I defer to Lorenzo on this.
> 
> Will

[PATCH] block: Remove redundant 'return' statement

2020-09-27 Thread Baolin Wang

Remove redundant 'return' statement for 'void' functions.

Signed-off-by: Baolin Wang 
---
 block/blk-iocost.c| 2 +-
 block/blk-iolatency.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index ef9476f..e38c406 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -3343,7 +3343,7 @@ static int __init ioc_init(void)
 
 static void __exit ioc_exit(void)
 {
-   return blkcg_policy_unregister(_policy_iocost);
+   blkcg_policy_unregister(_policy_iocost);
 }
 
 module_init(ioc_init);
diff --git a/block/blk-iolatency.c b/block/blk-iolatency.c
index f90429c..81be009 100644
--- a/block/blk-iolatency.c
+++ b/block/blk-iolatency.c
@@ -1046,7 +1046,7 @@ static int __init iolatency_init(void)
 
 static void __exit iolatency_exit(void)
 {
-   return blkcg_policy_unregister(_policy_iolatency);
+   blkcg_policy_unregister(_policy_iolatency);
 }
 
 module_init(iolatency_init);
-- 
1.8.3.1

[PATCH] arm64: PCI: Validate the node before setting node id for root bus

2020-09-22 Thread Baolin Wang

If the BIOS disabled the NUMA configuration, but did not change the
proximity domain description in the SRAT table, so the PCI root bus
device may get a incorrect node id by acpi_get_node().

Thus better to add a numa node validation before setting numa node
for the PCI root bus, like pci_acpi_root_get_node() does for X86
architecture.

Signed-off-by: Baolin Wang 
---
 arch/arm64/kernel/pci.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/pci.c b/arch/arm64/kernel/pci.c
index 1006ed2..24fe2bd 100644
--- a/arch/arm64/kernel/pci.c
+++ b/arch/arm64/kernel/pci.c
@@ -86,9 +86,13 @@ int pcibios_root_bridge_prepare(struct pci_host_bridge 
*bridge)
struct pci_config_window *cfg = bridge->bus->sysdata;
struct acpi_device *adev = to_acpi_device(cfg->parent);
struct device *bus_dev = >bus->dev;
+   int node = acpi_get_node(acpi_device_handle(adev));
+
+   if (node != NUMA_NO_NODE && !node_online(node))
+   node = NUMA_NO_NODE;
 
ACPI_COMPANION_SET(>dev, adev);
-   set_dev_node(bus_dev, acpi_get_node(acpi_device_handle(adev)));
+   set_dev_node(bus_dev, node);
}
 
return 0;
-- 
1.8.3.1

[PATCH 3/4] blk-throttle: Avoid tracking latency if low limit is invalid

2020-09-20 Thread Baolin Wang

The IO latency tracking is only for LOW limit, so we should add a
validation to avoid redundant latency tracking if the LOW limit
is not valid.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 7e72102..b0d8f7c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2100,7 +2100,7 @@ static void throtl_update_latency_buckets(struct 
throtl_data *td)
unsigned long last_latency[2] = { 0 };
unsigned long latency[2];
 
-   if (!blk_queue_nonrot(td->queue))
+   if (!blk_queue_nonrot(td->queue) || !td->limit_valid[LIMIT_LOW])
return;
if (time_before(jiffies, td->last_calculate_time + HZ))
return;
@@ -2338,6 +2338,8 @@ void blk_throtl_bio_endio(struct bio *bio)
if (!blkg)
return;
tg = blkg_to_tg(blkg);
+   if (!tg->td->limit_valid[LIMIT_LOW])
+   return;
 
finish_time_ns = ktime_get_ns();
tg->last_finish_time = finish_time_ns >> 10;
-- 
1.8.3.1

[PATCH 4/4] blk-throttle: Fix IO hang for a corner case

2020-09-20 Thread Baolin Wang

It can not scale up in throtl_adjusted_limit() if we set bps or iops is
1, which will cause IO hang when enable low limit. Thus we should treat
1 as a illegal value to avoid this issue.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b0d8f7c..0649bce 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1687,13 +1687,13 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
goto out_finish;
 
ret = -EINVAL;
-   if (!strcmp(tok, "rbps"))
+   if (!strcmp(tok, "rbps") && val > 1)
v[0] = val;
-   else if (!strcmp(tok, "wbps"))
+   else if (!strcmp(tok, "wbps") && val > 1)
v[1] = val;
-   else if (!strcmp(tok, "riops"))
+   else if (!strcmp(tok, "riops") && val > 1)
v[2] = min_t(u64, val, UINT_MAX);
-   else if (!strcmp(tok, "wiops"))
+   else if (!strcmp(tok, "wiops") && val > 1)
v[3] = min_t(u64, val, UINT_MAX);
else if (off == LIMIT_LOW && !strcmp(tok, "idle"))
idle_time = val;
-- 
1.8.3.1

[PATCH 2/4] blk-throttle: Avoid getting the current time if tg->last_finish_time is 0

2020-09-20 Thread Baolin Wang

We only update the tg->last_finish_time when the low limitaion is
enabled, so we can move the tg->last_finish_time validation a little
forward to avoid getting the unnecessary current time stamp if the
the low limitation is not enabled.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 4007b26..7e72102 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2077,10 +2077,14 @@ static void throtl_downgrade_check(struct throtl_grp 
*tg)
 
 static void blk_throtl_update_idletime(struct throtl_grp *tg)
 {
-   unsigned long now = ktime_get_ns() >> 10;
+   unsigned long now;
unsigned long last_finish_time = tg->last_finish_time;
 
-   if (now <= last_finish_time || last_finish_time == 0 ||
+   if (last_finish_time == 0)
+   return;
+
+   now = ktime_get_ns() >> 10;
+   if (now <= last_finish_time ||
last_finish_time == tg->checked_last_finish_time)
return;
 
-- 
1.8.3.1

[PATCH 0/4] Some improvements for blk throttle

2020-09-20 Thread Baolin Wang

Hi,

This patch set did some improvements for blk throttle, please
help to review. Thanks.

Baolin Wang (4):
  blk-throttle: Remove a meaningless parameter for
throtl_downgrade_state()
  blk-throttle: Avoid getting the current time if tg->last_finish_time
is 0
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Fix IO hang for a corner case

 block/blk-throttle.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

-- 
1.8.3.1

[PATCH 1/4] blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()

2020-09-20 Thread Baolin Wang

The throtl_downgrade_state() is always used to change to LIMIT_LOW
limitation, thus remove the latter meaningless parameter which
indicates the limitation index.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 36ba61c..4007b26 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1970,7 +1970,7 @@ static void throtl_upgrade_state(struct throtl_data *td)
queue_work(kthrotld_workqueue, >dispatch_work);
 }
 
-static void throtl_downgrade_state(struct throtl_data *td, int new)
+static void throtl_downgrade_state(struct throtl_data *td)
 {
td->scale /= 2;
 
@@ -1980,7 +1980,7 @@ static void throtl_downgrade_state(struct throtl_data 
*td, int new)
return;
}
 
-   td->limit_index = new;
+   td->limit_index = LIMIT_LOW;
td->low_downgrade_time = jiffies;
 }
 
@@ -2067,7 +2067,7 @@ static void throtl_downgrade_check(struct throtl_grp *tg)
 * cgroups
 */
if (throtl_hierarchy_can_downgrade(tg))
-   throtl_downgrade_state(tg->td, LIMIT_LOW);
+   throtl_downgrade_state(tg->td);
 
tg->last_bytes_disp[READ] = 0;
tg->last_bytes_disp[WRITE] = 0;
-- 
1.8.3.1

Re: [PATCH 0/5] Some improvements for blk-throttle

2020-09-15 Thread Baolin Wang

Hi Jens,

On Mon, Sep 14, 2020 at 07:37:53PM -0600, Jens Axboe wrote:
> On 9/7/20 2:10 AM, Baolin Wang wrote:
> > Hi All,
> > 
> > This patch set did some clean-ups, as well as removing some unnecessary
> > bps/iops limitation calculation when checking if can dispatch a bio or
> > not for a tg. Please help to review. Thanks.
> > 
> > Baolin Wang (5):
> >   blk-throttle: Fix some comments' typos
> >   blk-throttle: Use readable READ/WRITE macros
> >   blk-throttle: Define readable macros instead of static variables
> >   blk-throttle: Avoid calculating bps/iops limitation repeatedly
> >   blk-throttle: Avoid checking bps/iops limitation if bps or iops is
> > unlimited
> > 
> >  block/blk-throttle.c | 59 
> > 
> >  1 file changed, 36 insertions(+), 23 deletions(-)
> 
> Looks reasonable to me, I've applied it.

Thanks.

> 
> Out of curiosity, are you using blk-throttle in production, or are these
> just fixes/cleanups that you came across?

Yes, we're using it in some old products, and I am trying to do some
cleanups and optimizaiton when testing it.

Re: [PATCH 0/5] Some improvements for blk-throttle

2020-09-13 Thread Baolin Wang

Hi Tejun and Jens,

On Mon, Sep 07, 2020 at 04:10:12PM +0800, Baolin Wang wrote:
> Hi All,
> 
> This patch set did some clean-ups, as well as removing some unnecessary
> bps/iops limitation calculation when checking if can dispatch a bio or
> not for a tg. Please help to review. Thanks.

Any comments for this patch set?

> 
> Baolin Wang (5):
>   blk-throttle: Fix some comments' typos
>   blk-throttle: Use readable READ/WRITE macros
>   blk-throttle: Define readable macros instead of static variables
>   blk-throttle: Avoid calculating bps/iops limitation repeatedly
>   blk-throttle: Avoid checking bps/iops limitation if bps or iops is
> unlimited
> 
>  block/blk-throttle.c | 59 
> 
>  1 file changed, 36 insertions(+), 23 deletions(-)
> 
> -- 
> 1.8.3.1

[PATCH] block: Remove unused blk_mq_sched_free_hctx_data()

2020-09-07 Thread Baolin Wang

Now we usually free the hctx->sched_data by e->type->ops.exit_hctx(),
and no users will use blk_mq_sched_free_hctx_data() function.
Remove it.

Signed-off-by: Baolin Wang 
---
 block/blk-mq-sched.c | 15 ---
 block/blk-mq-sched.h |  3 ---
 2 files changed, 18 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 86d5545..3e95967 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -18,21 +18,6 @@
 #include "blk-mq-tag.h"
 #include "blk-wbt.h"
 
-void blk_mq_sched_free_hctx_data(struct request_queue *q,
-void (*exit)(struct blk_mq_hw_ctx *))
-{
-   struct blk_mq_hw_ctx *hctx;
-   int i;
-
-   queue_for_each_hw_ctx(q, hctx, i) {
-   if (exit && hctx->sched_data)
-   exit(hctx);
-   kfree(hctx->sched_data);
-   hctx->sched_data = NULL;
-   }
-}
-EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
-
 void blk_mq_sched_assign_ioc(struct request *rq)
 {
struct request_queue *q = rq->q;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..fe62e7c 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -5,9 +5,6 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
-void blk_mq_sched_free_hctx_data(struct request_queue *q,
-void (*exit)(struct blk_mq_hw_ctx *));
-
 void blk_mq_sched_assign_ioc(struct request *rq);
 
 void blk_mq_sched_request_inserted(struct request *rq);
-- 
1.8.3.1

[PATCH 4/5] blk-throttle: Avoid calculating bps/iops limitation repeatedly

2020-09-07 Thread Baolin Wang

The tg_may_dispatch() will call tg_with_in_bps_limit() and
tg_with_in_iops_limit() to check if we can dispatch a bio or
not, which will calculate bps/iops limitation multiple times.
But tg_may_dispatch() is always called under queue lock, which
means the bps/iops limitation will not change in tg_may_dispatch().

So we can calculate the bps/iops limitation only once, and pass
them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to
avoid calculating bps/iops limitation repeatedly.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index ca9002d..8719e37 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -894,7 +894,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, 
bool rw)
 }
 
 static bool tg_with_in_iops_limit(struct throtl_grp *tg, struct bio *bio,
- unsigned long *wait)
+ u32 iops_limit, unsigned long *wait)
 {
bool rw = bio_data_dir(bio);
unsigned int io_allowed;
@@ -913,7 +913,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
 * have been trimmed.
 */
 
-   tmp = (u64)tg_iops_limit(tg, rw) * jiffy_elapsed_rnd;
+   tmp = (u64)iops_limit * jiffy_elapsed_rnd;
do_div(tmp, HZ);
 
if (tmp > UINT_MAX)
@@ -936,7 +936,7 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
 }
 
 static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
-unsigned long *wait)
+u64 bps_limit, unsigned long *wait)
 {
bool rw = bio_data_dir(bio);
u64 bytes_allowed, extra_bytes, tmp;
@@ -951,7 +951,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, 
struct bio *bio,
 
jiffy_elapsed_rnd = roundup(jiffy_elapsed_rnd, tg->td->throtl_slice);
 
-   tmp = tg_bps_limit(tg, rw) * jiffy_elapsed_rnd;
+   tmp = bps_limit * jiffy_elapsed_rnd;
do_div(tmp, HZ);
bytes_allowed = tmp;
 
@@ -963,7 +963,7 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, 
struct bio *bio,
 
/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
-   jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));
+   jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
 
if (!jiffy_wait)
jiffy_wait = 1;
@@ -987,6 +987,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
 {
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
+   u64 bps_limit = tg_bps_limit(tg, rw);
+   u32 iops_limit = tg_iops_limit(tg, rw);
 
/*
 * Currently whole state machine of group depends on first bio
@@ -998,8 +1000,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
   bio != throtl_peek_queued(>service_queue.queued[rw]));
 
/* If tg->bps = -1, then BW is unlimited */
-   if (tg_bps_limit(tg, rw) == U64_MAX &&
-   tg_iops_limit(tg, rw) == UINT_MAX) {
+   if (bps_limit == U64_MAX && iops_limit == UINT_MAX) {
if (wait)
*wait = 0;
return true;
@@ -1021,8 +1022,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct 
bio *bio,
jiffies + tg->td->throtl_slice);
}
 
-   if (tg_with_in_bps_limit(tg, bio, _wait) &&
-   tg_with_in_iops_limit(tg, bio, _wait)) {
+   if (tg_with_in_bps_limit(tg, bio, bps_limit, _wait) &&
+   tg_with_in_iops_limit(tg, bio, iops_limit, _wait)) {
if (wait)
*wait = 0;
return true;
-- 
1.8.3.1

[PATCH 5/5] blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimited

2020-09-07 Thread Baolin Wang

Do not need check the bps or iops limitation if bps or iops is unlimited.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8719e37..36ba61c 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -901,6 +901,12 @@ static bool tg_with_in_iops_limit(struct throtl_grp *tg, 
struct bio *bio,
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
u64 tmp;
 
+   if (iops_limit == UINT_MAX) {
+   if (wait)
+   *wait = 0;
+   return true;
+   }
+
jiffy_elapsed = jiffies - tg->slice_start[rw];
 
/* Round up to the next throttle slice, wait time must be nonzero */
@@ -943,6 +949,12 @@ static bool tg_with_in_bps_limit(struct throtl_grp *tg, 
struct bio *bio,
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
unsigned int bio_size = throtl_bio_data_size(bio);
 
+   if (bps_limit == U64_MAX) {
+   if (wait)
+   *wait = 0;
+   return true;
+   }
+
jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
 
/* Slice has just started. Consider one slice interval */
-- 
1.8.3.1

[PATCH 3/5] blk-throttle: Define readable macros instead of static variables

2020-09-07 Thread Baolin Wang

The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only
variables, thus better to use readable macros instead of static
variables, which can also save some spaces for .bss area.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 06e73ed..ca9002d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -15,10 +15,10 @@
 #include "blk-cgroup-rwstat.h"
 
 /* Max dispatch from a group in 1 round */
-static int throtl_grp_quantum = 8;
+#define THROTL_GRP_QUANTUM 8
 
 /* Total max dispatch from all groups in one round */
-static int throtl_quantum = 32;
+#define THROTL_QUANTUM 32
 
 /* Throttling is performed over a slice and after that slice is renewed */
 #define DFL_THROTL_SLICE_HD (HZ / 10)
@@ -1175,8 +1175,8 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
 {
struct throtl_service_queue *sq = >service_queue;
unsigned int nr_reads = 0, nr_writes = 0;
-   unsigned int max_nr_reads = throtl_grp_quantum*3/4;
-   unsigned int max_nr_writes = throtl_grp_quantum - max_nr_reads;
+   unsigned int max_nr_reads = THROTL_GRP_QUANTUM * 3 / 4;
+   unsigned int max_nr_writes = THROTL_GRP_QUANTUM - max_nr_reads;
struct bio *bio;
 
/* Try to dispatch 75% READS and 25% WRITES */
@@ -1226,7 +1226,7 @@ static int throtl_select_dispatch(struct 
throtl_service_queue *parent_sq)
if (sq->nr_queued[0] || sq->nr_queued[1])
tg_update_disptime(tg);
 
-   if (nr_disp >= throtl_quantum)
+   if (nr_disp >= THROTL_QUANTUM)
break;
}
 
-- 
1.8.3.1

[PATCH 1/5] blk-throttle: Fix some comments' typos

2020-09-07 Thread Baolin Wang

Fix some comments' typos.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index fee3325..2fc6c3e 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -150,7 +150,7 @@ struct throtl_grp {
/* user configured IOPS limits */
unsigned int iops_conf[2][LIMIT_CNT];
 
-   /* Number of bytes disptached in current slice */
+   /* Number of bytes dispatched in current slice */
uint64_t bytes_disp[2];
/* Number of bio's dispatched in current slice */
unsigned int io_disp[2];
@@ -852,7 +852,7 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, 
bool rw)
/*
 * A bio has been dispatched. Also adjust slice_end. It might happen
 * that initially cgroup limit was very low resulting in high
-* slice_end, but later limit was bumped up and bio was dispached
+* slice_end, but later limit was bumped up and bio was dispatched
 * sooner, then we need to reduce slice_end. A high bogus slice_end
 * is bad because it does not allow new slice to start.
 */
@@ -1082,7 +1082,7 @@ static void throtl_add_bio_tg(struct bio *bio, struct 
throtl_qnode *qn,
 * If @tg doesn't currently have any bios queued in the same
 * direction, queueing @bio can change when @tg should be
 * dispatched.  Mark that @tg was empty.  This is automatically
-* cleaered on the next tg_update_disptime().
+* cleared on the next tg_update_disptime().
 */
if (!sq->nr_queued[rw])
tg->flags |= THROTL_TG_WAS_EMPTY;
@@ -1303,7 +1303,7 @@ static void throtl_pending_timer_fn(struct timer_list *t)
}
}
} else {
-   /* reached the top-level, queue issueing */
+   /* reached the top-level, queue issuing */
queue_work(kthrotld_workqueue, >dispatch_work);
}
 out_unlock:
@@ -1314,8 +1314,8 @@ static void throtl_pending_timer_fn(struct timer_list *t)
  * blk_throtl_dispatch_work_fn - work function for throtl_data->dispatch_work
  * @work: work item being executed
  *
- * This function is queued for execution when bio's reach the bio_lists[]
- * of throtl_data->service_queue.  Those bio's are ready and issued by this
+ * This function is queued for execution when bios reach the bio_lists[]
+ * of throtl_data->service_queue.  Those bios are ready and issued by this
  * function.
  */
 static void blk_throtl_dispatch_work_fn(struct work_struct *work)
@@ -2230,7 +2230,7 @@ bool blk_throtl_bio(struct bio *bio)
 
/*
 * @bio passed through this layer without being throttled.
-* Climb up the ladder.  If we''re already at the top, it
+* Climb up the ladder.  If we're already at the top, it
 * can be executed directly.
 */
qn = >qnode_on_parent[rw];
-- 
1.8.3.1

[PATCH 2/5] blk-throttle: Use readable READ/WRITE macros

2020-09-07 Thread Baolin Wang

Use readable READ/WRITE macros instead of magic numbers.

Signed-off-by: Baolin Wang 
---
 block/blk-throttle.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 2fc6c3e..06e73ed 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1428,8 +1428,8 @@ static void tg_conf_updated(struct throtl_grp *tg, bool 
global)
 * that a group's limit are dropped suddenly and we don't want to
 * account recently dispatched IO with new low rate.
 */
-   throtl_start_new_slice(tg, 0);
-   throtl_start_new_slice(tg, 1);
+   throtl_start_new_slice(tg, READ);
+   throtl_start_new_slice(tg, WRITE);
 
if (tg->flags & THROTL_TG_PENDING) {
tg_update_disptime(tg);
-- 
1.8.3.1

[PATCH 0/5] Some improvements for blk-throttle

2020-09-07 Thread Baolin Wang

Hi All,

This patch set did some clean-ups, as well as removing some unnecessary
bps/iops limitation calculation when checking if can dispatch a bio or
not for a tg. Please help to review. Thanks.

Baolin Wang (5):
  blk-throttle: Fix some comments' typos
  blk-throttle: Use readable READ/WRITE macros
  blk-throttle: Define readable macros instead of static variables
  blk-throttle: Avoid calculating bps/iops limitation repeatedly
  blk-throttle: Avoid checking bps/iops limitation if bps or iops is
unlimited

 block/blk-throttle.c | 59 
 1 file changed, 36 insertions(+), 23 deletions(-)

-- 
1.8.3.1

[PATCH] block: Remove a duplicative condition

2020-09-01 Thread Baolin Wang

Remove a duplicative condition to remove below cppcheck warnings:

"warning: Redundant condition: sched_allow_merge. '!A || (A && B)' is
equivalent to '!A || B' [redundantCondition]"

Reported-by: kernel test robot 
Signed-off-by: Baolin Wang 
---
 block/blk-merge.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 80c9744..6ed7158 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -996,13 +996,11 @@ static enum bio_merge_status blk_attempt_bio_merge(struct 
request_queue *q,
 
switch (blk_try_merge(rq, bio)) {
case ELEVATOR_BACK_MERGE:
-   if (!sched_allow_merge ||
-   (sched_allow_merge && blk_mq_sched_allow_merge(q, rq, bio)))
+   if (!sched_allow_merge || blk_mq_sched_allow_merge(q, rq, bio))
return bio_attempt_back_merge(rq, bio, nr_segs);
break;
case ELEVATOR_FRONT_MERGE:
-   if (!sched_allow_merge ||
-   (sched_allow_merge && blk_mq_sched_allow_merge(q, rq, bio)))
+   if (!sched_allow_merge || blk_mq_sched_allow_merge(q, rq, bio))
return bio_attempt_front_merge(rq, bio, nr_segs);
break;
case ELEVATOR_DISCARD_MERGE:
-- 
1.8.3.1

[PATCH] block: Fix building error

2020-09-01 Thread Baolin Wang

The disk argument has been removed by commit 3b843c0bda28
("block: remove the disk argument to delete_partition"), thus
fix a building error caused by an incorrect argument.

Fixes: 3b843c0bda28 ("block: remove the disk argument to delete_partition")
Signed-off-by: Baolin Wang 
---
 block/partitions/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/partitions/core.c b/block/partitions/core.c
index cf6229b..dd68114 100644
--- a/block/partitions/core.c
+++ b/block/partitions/core.c
@@ -557,7 +557,7 @@ int bdev_del_partition(struct block_device *bdev, int 
partno)
sync_blockdev(bdevp);
invalidate_bdev(bdevp);
 
-   delete_partition(bdev->bd_disk);
+   delete_partition(part);
ret = 0;
 out_unlock:
mutex_unlock(>bd_mutex);
-- 
1.8.3.1

[PATCH v3 1/4] block: Move bio merge related functions into blk-merge.c

2020-08-27 Thread Baolin Wang

It's better to move bio merge related functions into blk-merge.c,
which contains all merge related functions.

Signed-off-by: Baolin Wang 
Reviewed-by: Christoph Hellwig 
---
 block/blk-core.c  | 156 -
 block/blk-merge.c | 157 ++
 2 files changed, 157 insertions(+), 156 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d9d6326..ed79109 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -642,162 +642,6 @@ void blk_put_request(struct request *req)
 }
 EXPORT_SYMBOL(blk_put_request);
 
-static void blk_account_io_merge_bio(struct request *req)
-{
-   if (!blk_do_io_stat(req))
-   return;
-
-   part_stat_lock();
-   part_stat_inc(req->part, merges[op_stat_group(req_op(req))]);
-   part_stat_unlock();
-}
-
-bool bio_attempt_back_merge(struct request *req, struct bio *bio,
-   unsigned int nr_segs)
-{
-   const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
-
-   if (!ll_back_merge_fn(req, bio, nr_segs))
-   return false;
-
-   trace_block_bio_backmerge(req->q, req, bio);
-   rq_qos_merge(req->q, req, bio);
-
-   if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
-   blk_rq_set_mixed_merge(req);
-
-   req->biotail->bi_next = bio;
-   req->biotail = bio;
-   req->__data_len += bio->bi_iter.bi_size;
-
-   bio_crypt_free_ctx(bio);
-
-   blk_account_io_merge_bio(req);
-   return true;
-}
-
-bool bio_attempt_front_merge(struct request *req, struct bio *bio,
-   unsigned int nr_segs)
-{
-   const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
-
-   if (!ll_front_merge_fn(req, bio, nr_segs))
-   return false;
-
-   trace_block_bio_frontmerge(req->q, req, bio);
-   rq_qos_merge(req->q, req, bio);
-
-   if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff)
-   blk_rq_set_mixed_merge(req);
-
-   bio->bi_next = req->bio;
-   req->bio = bio;
-
-   req->__sector = bio->bi_iter.bi_sector;
-   req->__data_len += bio->bi_iter.bi_size;
-
-   bio_crypt_do_front_merge(req, bio);
-
-   blk_account_io_merge_bio(req);
-   return true;
-}
-
-bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
-   struct bio *bio)
-{
-   unsigned short segments = blk_rq_nr_discard_segments(req);
-
-   if (segments >= queue_max_discard_segments(q))
-   goto no_merge;
-   if (blk_rq_sectors(req) + bio_sectors(bio) >
-   blk_rq_get_max_sectors(req, blk_rq_pos(req)))
-   goto no_merge;
-
-   rq_qos_merge(q, req, bio);
-
-   req->biotail->bi_next = bio;
-   req->biotail = bio;
-   req->__data_len += bio->bi_iter.bi_size;
-   req->nr_phys_segments = segments + 1;
-
-   blk_account_io_merge_bio(req);
-   return true;
-no_merge:
-   req_set_nomerge(q, req);
-   return false;
-}
-
-/**
- * blk_attempt_plug_merge - try to merge with %current's plugged list
- * @q: request_queue new bio is being queued at
- * @bio: new bio being queued
- * @nr_segs: number of segments in @bio
- * @same_queue_rq: pointer to  request that gets filled in when
- * another request associated with @q is found on the plug list
- * (optional, may be %NULL)
- *
- * Determine whether @bio being queued on @q can be merged with a request
- * on %current's plugged list.  Returns %true if merge was successful,
- * otherwise %false.
- *
- * Plugging coalesces IOs from the same issuer for the same purpose without
- * going through @q->queue_lock.  As such it's more of an issuing mechanism
- * than scheduling, and the request, while may have elvpriv data, is not
- * added on the elevator at this point.  In addition, we don't have
- * reliable access to the elevator outside queue lock.  Only check basic
- * merging parameters without querying the elevator.
- *
- * Caller must ensure !blk_queue_nomerges(q) beforehand.
- */
-bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
-   unsigned int nr_segs, struct request **same_queue_rq)
-{
-   struct blk_plug *plug;
-   struct request *rq;
-   struct list_head *plug_list;
-
-   plug = blk_mq_plug(q, bio);
-   if (!plug)
-   return false;
-
-   plug_list = >mq_list;
-
-   list_for_each_entry_reverse(rq, plug_list, queuelist) {
-   bool merged = false;
-
-   if (rq->q == q && same_queue_rq) {
-   /*
-* Only blk-mq multiple hardware queues case checks the
-* rq in the same queue, there should be only one such
-* rq in a queue
-**/
-   *same_queue_rq = rq;
-   }
-
-   if (rq

[PATCH v3 3/4] block: Add a new helper to attempt to merge a bio

2020-08-27 Thread Baolin Wang

There are lots of duplicated code when trying to merge a bio from
plug list and sw queue, we can introduce a new helper to attempt
to merge a bio, which can simplify the blk_bio_list_merge()
and blk_attempt_plug_merge().

Signed-off-by: Baolin Wang 
---
 block/blk-merge.c| 104 ++-
 block/blk-mq-sched.c |   6 +--
 block/blk.h  |  21 ---
 3 files changed, 71 insertions(+), 60 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index b09e9fc..80c9744 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -907,13 +907,14 @@ static void blk_account_io_merge_bio(struct request *req)
part_stat_unlock();
 }
 
-bool bio_attempt_back_merge(struct request *req, struct bio *bio,
-   unsigned int nr_segs)
+enum bio_merge_status bio_attempt_back_merge(struct request *req,
+struct bio *bio,
+unsigned int nr_segs)
 {
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
 
if (!ll_back_merge_fn(req, bio, nr_segs))
-   return false;
+   return BIO_MERGE_FAILED;
 
trace_block_bio_backmerge(req->q, req, bio);
rq_qos_merge(req->q, req, bio);
@@ -928,16 +929,17 @@ bool bio_attempt_back_merge(struct request *req, struct 
bio *bio,
bio_crypt_free_ctx(bio);
 
blk_account_io_merge_bio(req);
-   return true;
+   return BIO_MERGE_OK;
 }
 
-bool bio_attempt_front_merge(struct request *req, struct bio *bio,
-   unsigned int nr_segs)
+enum bio_merge_status bio_attempt_front_merge(struct request *req,
+ struct bio *bio,
+ unsigned int nr_segs)
 {
const int ff = bio->bi_opf & REQ_FAILFAST_MASK;
 
if (!ll_front_merge_fn(req, bio, nr_segs))
-   return false;
+   return BIO_MERGE_FAILED;
 
trace_block_bio_frontmerge(req->q, req, bio);
rq_qos_merge(req->q, req, bio);
@@ -954,11 +956,12 @@ bool bio_attempt_front_merge(struct request *req, struct 
bio *bio,
bio_crypt_do_front_merge(req, bio);
 
blk_account_io_merge_bio(req);
-   return true;
+   return BIO_MERGE_OK;
 }
 
-bool bio_attempt_discard_merge(struct request_queue *q, struct request *req,
-   struct bio *bio)
+enum bio_merge_status bio_attempt_discard_merge(struct request_queue *q,
+   struct request *req,
+   struct bio *bio)
 {
unsigned short segments = blk_rq_nr_discard_segments(req);
 
@@ -976,10 +979,39 @@ bool bio_attempt_discard_merge(struct request_queue *q, 
struct request *req,
req->nr_phys_segments = segments + 1;
 
blk_account_io_merge_bio(req);
-   return true;
+   return BIO_MERGE_OK;
 no_merge:
req_set_nomerge(q, req);
-   return false;
+   return BIO_MERGE_FAILED;
+}
+
+static enum bio_merge_status blk_attempt_bio_merge(struct request_queue *q,
+  struct request *rq,
+  struct bio *bio,
+  unsigned int nr_segs,
+  bool sched_allow_merge)
+{
+   if (!blk_rq_merge_ok(rq, bio))
+   return BIO_MERGE_NONE;
+
+   switch (blk_try_merge(rq, bio)) {
+   case ELEVATOR_BACK_MERGE:
+   if (!sched_allow_merge ||
+   (sched_allow_merge && blk_mq_sched_allow_merge(q, rq, bio)))
+   return bio_attempt_back_merge(rq, bio, nr_segs);
+   break;
+   case ELEVATOR_FRONT_MERGE:
+   if (!sched_allow_merge ||
+   (sched_allow_merge && blk_mq_sched_allow_merge(q, rq, bio)))
+   return bio_attempt_front_merge(rq, bio, nr_segs);
+   break;
+   case ELEVATOR_DISCARD_MERGE:
+   return bio_attempt_discard_merge(q, rq, bio);
+   default:
+   return BIO_MERGE_NONE;
+   }
+
+   return BIO_MERGE_FAILED;
 }
 
 /**
@@ -1018,8 +1050,6 @@ bool blk_attempt_plug_merge(struct request_queue *q, 
struct bio *bio,
plug_list = >mq_list;
 
list_for_each_entry_reverse(rq, plug_list, queuelist) {
-   bool merged = false;
-
if (rq->q == q && same_queue_rq) {
/*
 * Only blk-mq multiple hardware queues case checks the
@@ -1029,24 +1059,11 @@ bool blk_attempt_plug_merge(struct request_queue *q, 
struct bio *bio,
*same_queue_rq = rq;
}
 
-   if (rq->q != q || !blk_rq_merge_ok(rq, bio))
+   if (rq->q != q)
continue;
 
-

[PATCH v3 4/4] block: Remove blk_mq_attempt_merge() function

2020-08-27 Thread Baolin Wang

The small blk_mq_attempt_merge() function is only called by
__blk_mq_sched_bio_merge(), just open code it.

Signed-off-by: Baolin Wang 
Reviewed-by: Christoph Hellwig 
---
 block/blk-mq-sched.c | 44 
 1 file changed, 16 insertions(+), 28 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 94db0c9..205d971 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -391,28 +391,6 @@ bool blk_mq_sched_try_merge(struct request_queue *q, 
struct bio *bio,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
-/*
- * Reverse check our software queue for entries that we could potentially
- * merge with. Currently includes a hand-wavy stop count of 8, to not spend
- * too much time checking for merges.
- */
-static bool blk_mq_attempt_merge(struct request_queue *q,
-struct blk_mq_hw_ctx *hctx,
-struct blk_mq_ctx *ctx, struct bio *bio,
-unsigned int nr_segs)
-{
-   enum hctx_type type = hctx->type;
-
-   lockdep_assert_held(>lock);
-
-   if (blk_bio_list_merge(q, >rq_lists[type], bio, nr_segs)) {
-   ctx->rq_merged++;
-   return true;
-   }
-
-   return false;
-}
-
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio,
unsigned int nr_segs)
 {
@@ -426,14 +404,24 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, 
struct bio *bio,
return e->type->ops.bio_merge(hctx, bio, nr_segs);
 
type = hctx->type;
-   if ((hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
-   !list_empty_careful(>rq_lists[type])) {
-   /* default per sw-queue merge */
-   spin_lock(>lock);
-   ret = blk_mq_attempt_merge(q, hctx, ctx, bio, nr_segs);
-   spin_unlock(>lock);
+   if (!(hctx->flags & BLK_MQ_F_SHOULD_MERGE) ||
+   list_empty_careful(>rq_lists[type]))
+   return false;
+
+   /* default per sw-queue merge */
+   spin_lock(>lock);
+   /*
+* Reverse check our software queue for entries that we could
+* potentially merge with. Currently includes a hand-wavy stop
+* count of 8, to not spend too much time checking for merges.
+*/
+   if (blk_bio_list_merge(q, >rq_lists[type], bio, nr_segs)) {
+   ctx->rq_merged++;
+   ret = true;
}
 
+   spin_unlock(>lock);
+
return ret;
 }
 
-- 
1.8.3.1

[PATCH v3 2/4] block: Move blk_mq_bio_list_merge() into blk-merge.c

2020-08-27 Thread Baolin Wang

Move the blk_mq_bio_list_merge() into blk-merge.c and
rename it as a generic name.

Signed-off-by: Baolin Wang 
---
 block/blk-merge.c  | 44 
 block/blk-mq-sched.c   | 46 +-
 block/blk.h|  2 ++
 block/kyber-iosched.c  |  2 +-
 include/linux/blk-mq.h |  2 --
 5 files changed, 48 insertions(+), 48 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3aa2de5..b09e9fc 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -1052,3 +1052,47 @@ bool blk_attempt_plug_merge(struct request_queue *q, 
struct bio *bio,
 
return false;
 }
+
+/*
+ * Iterate list of requests and see if we can merge this bio with any
+ * of them.
+ */
+bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
+   struct bio *bio, unsigned int nr_segs)
+{
+   struct request *rq;
+   int checked = 8;
+
+   list_for_each_entry_reverse(rq, list, queuelist) {
+   bool merged = false;
+
+   if (!checked--)
+   break;
+
+   if (!blk_rq_merge_ok(rq, bio))
+   continue;
+
+   switch (blk_try_merge(rq, bio)) {
+   case ELEVATOR_BACK_MERGE:
+   if (blk_mq_sched_allow_merge(q, rq, bio))
+   merged = bio_attempt_back_merge(rq, bio,
+   nr_segs);
+   break;
+   case ELEVATOR_FRONT_MERGE:
+   if (blk_mq_sched_allow_merge(q, rq, bio))
+   merged = bio_attempt_front_merge(rq, bio,
+   nr_segs);
+   break;
+   case ELEVATOR_DISCARD_MERGE:
+   merged = bio_attempt_discard_merge(q, rq, bio);
+   break;
+   default:
+   continue;
+   }
+
+   return merged;
+   }
+
+   return false;
+}
+EXPORT_SYMBOL_GPL(blk_bio_list_merge);
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index d2790e5..82acff9 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -392,50 +392,6 @@ bool blk_mq_sched_try_merge(struct request_queue *q, 
struct bio *bio,
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
 /*
- * Iterate list of requests and see if we can merge this bio with any
- * of them.
- */
-bool blk_mq_bio_list_merge(struct request_queue *q, struct list_head *list,
-  struct bio *bio, unsigned int nr_segs)
-{
-   struct request *rq;
-   int checked = 8;
-
-   list_for_each_entry_reverse(rq, list, queuelist) {
-   bool merged = false;
-
-   if (!checked--)
-   break;
-
-   if (!blk_rq_merge_ok(rq, bio))
-   continue;
-
-   switch (blk_try_merge(rq, bio)) {
-   case ELEVATOR_BACK_MERGE:
-   if (blk_mq_sched_allow_merge(q, rq, bio))
-   merged = bio_attempt_back_merge(rq, bio,
-   nr_segs);
-   break;
-   case ELEVATOR_FRONT_MERGE:
-   if (blk_mq_sched_allow_merge(q, rq, bio))
-   merged = bio_attempt_front_merge(rq, bio,
-   nr_segs);
-   break;
-   case ELEVATOR_DISCARD_MERGE:
-   merged = bio_attempt_discard_merge(q, rq, bio);
-   break;
-   default:
-   continue;
-   }
-
-   return merged;
-   }
-
-   return false;
-}
-EXPORT_SYMBOL_GPL(blk_mq_bio_list_merge);
-
-/*
  * Reverse check our software queue for entries that we could potentially
  * merge with. Currently includes a hand-wavy stop count of 8, to not spend
  * too much time checking for merges.
@@ -449,7 +405,7 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
 
lockdep_assert_held(>lock);
 
-   if (blk_mq_bio_list_merge(q, >rq_lists[type], bio, nr_segs)) {
+   if (blk_bio_list_merge(q, >rq_lists[type], bio, nr_segs)) {
ctx->rq_merged++;
return true;
}
diff --git a/block/blk.h b/block/blk.h
index 49e2928..d6152d2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -177,6 +177,8 @@ bool bio_attempt_discard_merge(struct request_queue *q, 
struct request *req,
struct bio *bio);
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
unsigned int nr_segs, struct request **same_queue_rq);
+bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
+   struct bio *bio, unsigned int nr_segs);
 
 void blk_account_io_start(struct request *req);
 void blk_acc

[PATCH v3 0/4] Some clean-ups for bio merge

2020-08-27 Thread Baolin Wang

Hi,

There are some duplicated code when trying to merge bio from pluged list
and software queue, thus this patch set did some clean-ups when merging
a bio. Any comments are welcome. Thanks.

Changes from v2:
 - Split blk_mq_bio_list_merge() moving into a separate patch.
 - Add reviewed-by tag from Christoph.
 - Coding style improvement.

Changes from v1:
 - Drop patch 2 and patch 5 in v1 patch set.
 - Add reviewed-by tag from Christoph.
 - Move blk_mq_bio_list_merge() into blk-merge.c and rename it.
 - Some coding style improvements.


Baolin Wang (4):
  block: Move bio merge related functions into blk-merge.c
  block: Move blk_mq_bio_list_merge() into blk-merge.c
  block: Add a new helper to attempt to merge a bio
  block: Remove blk_mq_attempt_merge() function

 block/blk-core.c   | 156 -
 block/blk-merge.c  | 203 +
 block/blk-mq-sched.c   |  94 +--
 block/blk.h|  23 --
 block/kyber-iosched.c  |   2 +-
 include/linux/blk-mq.h |   2 -
 6 files changed, 240 insertions(+), 240 deletions(-)

-- 
1.8.3.1

Re: [PATCH v2 2/3] block: Add a new helper to attempt to merge a bio

2020-08-27 Thread Baolin Wang

On Thu, Aug 27, 2020 at 11:44:15AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 18, 2020 at 01:45:29PM +0800, Baolin Wang wrote:
> > Meanwhile move the blk_mq_bio_list_merge() into blk-merge.c and
> > rename it as a generic name.
> 
> That should probably a separate patch.

Sure.

> 
> > +   if (blk_attempt_bio_merge(q, rq, bio, nr_segs, false) == 
> > BIO_MERGE_OK)
> > +   return true;
> 
> This adds an overly long line.

The checkpatch.pl has increased the default limit to 100 characters, so
I did not get a long line warning. Anyway I will change a new line if
you concern about this.

> > -   if (merged)
> > +   switch (blk_attempt_bio_merge(q, rq, bio, nr_segs, true)) {
> > +   default:
> > +   case BIO_MERGE_NONE:
> > +   continue;
> > +   case BIO_MERGE_OK:
> > return true;
> > +   case BIO_MERGE_FAILED:
> > +   return false;
> > +   }
> 
> I don't think we need a default statement here as we handle all
> possible values of the enum.

OK. Will remove it in next version. Thanks.

Re: [PATCH v2 0/3] Some clean-ups for bio merge

2020-08-25 Thread Baolin Wang

Hi,

On Tue, Aug 18, 2020 at 01:45:27PM +0800, Baolin Wang wrote:
> Hi,
> 
> There are some duplicated code when trying to merge bio from pluged list
> and software queue, thus this patch set did some clean-ups when merging
> a bio. Any comments are welcome. Thanks.
> 
> Changes from v1:
>  - Drop patch 2 and patch 5 in v1 patch set.
>  - Add reviewed-by tag from Christoph.
>  - Move blk_mq_bio_list_merge() into blk-merge.c and rename it.
>  - Some coding style improvements.

Any comments for this patch set? Thanks.

> 
> Baolin Wang (3):
>   block: Move bio merge related functions into blk-merge.c
>   block: Add a new helper to attempt to merge a bio
>   block: Remove blk_mq_attempt_merge() function
> 
>  block/blk-core.c   | 156 --
>  block/blk-merge.c  | 202 
> +
>  block/blk-mq-sched.c   |  94 +--
>  block/blk.h|  23 --
>  block/kyber-iosched.c  |   2 +-
>  include/linux/blk-mq.h |   2 -
>  6 files changed, 239 insertions(+), 240 deletions(-)
> 
> -- 
> 1.8.3.1

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3825 matches

Mail list logo