Hi Shameer,

On 13/12/2018 10:59, Shameer Kolothum wrote:
From: Shanker Donthineni <shank...@codeaurora.org>

The NUMA node information is visible to ITS driver but not being used
other than handling hardware errata. ITS/GICR hardware accesses to the
local NUMA node is usually quicker than the remote NUMA node. How slow
the remote NUMA accesses are depends on the implementation details.

This patch allocates memory for ITS management tables and command
queue from the corresponding NUMA node using the appropriate NUMA
aware functions. This change improves the performance of the ITS
tables read latency on systems where it has more than one ITS block,
and with the slower inter node accesses.

Apache Web server benchmarking using ab tool on a HiSilicon D06
board with multiple numa mem nodes shows Time per request and
Transfer rate improvements of ~3.6% with this patch.

Signed-off-by: Shanker Donthineni <shank...@codeaurora.org>
Signed-off-by: Hanjun Guo <guohan...@huawei.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.th...@huawei.com>
---

This is to revive the patch originally sent by Shanker[1] and
to back it up with a benchmark test. Any further testing of
this is most welcome.

v2-->v3
  -Addressed comments to use page_address().
  -Added Benchmark results to commit log.
  -Removed T-by from Ganapatrao for now.

v1-->v2
  -Edited commit text.
  -Added Ganapatrao's tested-by.

Benchmark test details:
--------------------------------
Test Setup:
-D06 with dimm on node 0(Sock#0) and 3 (Sock#1).
-ITS belongs to numa node 0.
-Filesystem mounted on a PCIe NVMe based disk.
-Apache server installed on D06.
-Running ab benchmark test in concurrency mode from a remote m/c
  connected to D06 via  hns3(PCIe) n/w port.
  "ab -k -c 750 -n 2000000 http://10.202.225.188/";

Test results are avg. of 15 runs.

For 4.20-rc1  Kernel,
----------------------------
Time per request(mean, concurrent)  = 0.02753[ms]
Transfer Rate = 416501[Kbytes/sec]

For 4.20-rc1 +  this patch,
----------------------------------
Time per request(mean, concurrent)  = 0.02653[ms]
Transfer Rate = 431954[Kbytes/sec]

% improvement ~3.6%

vmstat shows around 170K-200K interrupts per second.

~# vmstat 1 -w
procs -----------------------memory-- -  -system--
  r  b         swpd         free            in
  5  0            0     30166724          102794
  9  0            0     30141828          171148
  5  0            0     30150160          207185
13  0            0     30145924          175691
15  0            0     30140792          145250
13  0            0     30135556          201879
13  0            0     30134864          192391
10  0            0     30133632          168880
....

[1] https://patchwork.kernel.org/patch/9833339/

  drivers/irqchip/irq-gic-v3-its.c | 20 ++++++++++++--------
  1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
index db20e99..ab01061 100644
--- a/drivers/irqchip/irq-gic-v3-its.c
+++ b/drivers/irqchip/irq-gic-v3-its.c
@@ -1749,7 +1749,8 @@ static int its_setup_baser(struct its_node *its, struct 
its_baser *baser,
                order = get_order(GITS_BASER_PAGES_MAX * psz);
        }
- base = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+       base = (void *)page_address(alloc_pages_node(its->numa_node,
+                                   GFP_KERNEL | __GFP_ZERO, order));

If alloc_pages_node() fails, the page_address() could crash the system.

-       its->cmd_base = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
-                                               get_order(ITS_CMD_QUEUE_SZ));
+       its->cmd_base = (void *)page_address(alloc_pages_node(its->numa_node,
+                                            GFP_KERNEL | __GFP_ZERO,
+                                            get_order(ITS_CMD_QUEUE_SZ)));

Similarly here. We may want to handle it properly.

Cheers
Suzuki

Reply via email to