Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-12-03 Thread Tejun Heo
Hello,

On Fri, Nov 30, 2018 at 04:13:07PM -0800, Daniel Jordan wrote:
> On Fri, Nov 30, 2018 at 11:18:19AM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> > > Michal, you mentioned that ktask should be sensitive to CPU 
> > > utilization[1].
> > > ktask threads now run at the lowest priority on the system to avoid 
> > > disturbing
> > > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > > concern?
> > > The plan to address your other comments is explained below.
> > 
> > Have you tested what kind of impact this has on bandwidth of a system
> > in addition to latency?  The thing is while this would make a better
> > use of a system which has idle capacity, it does so by doing more
> > total work.  It'd be really interesting to see how this affects
> > bandwidth of a system too.
> 
> I guess you mean something like comparing aggregate CPU time across threads to
> the base single thread time for some job or set of jobs?  Then no, I haven't
> measured that, but I can for next time.

Yeah, I'm primarily curious how expensive this is on an already loaded
system, so sth like loading up the system with a workload which can
saturate the system and comparing the bw impacts of serial and
parallel page clearings at the same frequency.

Thanks.

-- 
tejun


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-12-03 Thread Tejun Heo
Hello,

On Fri, Nov 30, 2018 at 04:13:07PM -0800, Daniel Jordan wrote:
> On Fri, Nov 30, 2018 at 11:18:19AM -0800, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> > > Michal, you mentioned that ktask should be sensitive to CPU 
> > > utilization[1].
> > > ktask threads now run at the lowest priority on the system to avoid 
> > > disturbing
> > > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > > concern?
> > > The plan to address your other comments is explained below.
> > 
> > Have you tested what kind of impact this has on bandwidth of a system
> > in addition to latency?  The thing is while this would make a better
> > use of a system which has idle capacity, it does so by doing more
> > total work.  It'd be really interesting to see how this affects
> > bandwidth of a system too.
> 
> I guess you mean something like comparing aggregate CPU time across threads to
> the base single thread time for some job or set of jobs?  Then no, I haven't
> measured that, but I can for next time.

Yeah, I'm primarily curious how expensive this is on an already loaded
system, so sth like loading up the system with a workload which can
saturate the system and comparing the bw impacts of serial and
parallel page clearings at the same frequency.

Thanks.

-- 
tejun


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-30 Thread Daniel Jordan
On Fri, Nov 30, 2018 at 11:18:19AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > ktask threads now run at the lowest priority on the system to avoid 
> > disturbing
> > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > concern?
> > The plan to address your other comments is explained below.
> 
> Have you tested what kind of impact this has on bandwidth of a system
> in addition to latency?  The thing is while this would make a better
> use of a system which has idle capacity, it does so by doing more
> total work.  It'd be really interesting to see how this affects
> bandwidth of a system too.

I guess you mean something like comparing aggregate CPU time across threads to
the base single thread time for some job or set of jobs?  Then no, I haven't
measured that, but I can for next time.


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-30 Thread Daniel Jordan
On Fri, Nov 30, 2018 at 11:18:19AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > ktask threads now run at the lowest priority on the system to avoid 
> > disturbing
> > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > concern?
> > The plan to address your other comments is explained below.
> 
> Have you tested what kind of impact this has on bandwidth of a system
> in addition to latency?  The thing is while this would make a better
> use of a system which has idle capacity, it does so by doing more
> total work.  It'd be really interesting to see how this affects
> bandwidth of a system too.

I guess you mean something like comparing aggregate CPU time across threads to
the base single thread time for some job or set of jobs?  Then no, I haven't
measured that, but I can for next time.


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-30 Thread Tejun Heo
Hello,

On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5).  Does this address your concern?
> The plan to address your other comments is explained below.

Have you tested what kind of impact this has on bandwidth of a system
in addition to latency?  The thing is while this would make a better
use of a system which has idle capacity, it does so by doing more
total work.  It'd be really interesting to see how this affects
bandwidth of a system too.

Thanks.

-- 
tejun


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-30 Thread Tejun Heo
Hello,

On Mon, Nov 05, 2018 at 11:55:45AM -0500, Daniel Jordan wrote:
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5).  Does this address your concern?
> The plan to address your other comments is explained below.

Have you tested what kind of impact this has on bandwidth of a system
in addition to latency?  The thing is while this would make a better
use of a system which has idle capacity, it does so by doing more
total work.  It'd be really interesting to see how this affects
bandwidth of a system too.

Thanks.

-- 
tejun


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-07 Thread Daniel Jordan
On Tue, Nov 06, 2018 at 10:21:45AM +0100, Michal Hocko wrote:
> On Mon 05-11-18 17:29:55, Daniel Jordan wrote:
> > On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> > > On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > > > Michal, you mentioned that ktask should be sensitive to CPU 
> > > > utilization[1].
> > > > ktask threads now run at the lowest priority on the system to avoid 
> > > > disturbing
> > > > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > > > concern?
> > > > The plan to address your other comments is explained below.
> > > 
> > > I have only glanced through the documentation patch and it looks like it
> > > will be much less disruptive than the previous attempts. Now the obvious
> > > question is how does this behave on a moderately or even busy system
> > > when you compare that to a single threaded execution. Some numbers about
> > > best/worst case execution would be really helpful.
> > 
> > Patches 4 and 5 have some numbers where a ktask and non-ktask workload 
> > compete
> > against each other.  Those show either 8 ktask threads on 8 CPUs (worst 
> > case) or no ktask threads (best case).
> > 
> > By single threaded execution, I guess you mean 1 ktask thread.  I'll run the
> > experiments that way too and post the numbers.
> 
> I mean a comparision of how much time it gets to accomplish the same
> amount of work if it was done singlethreaded to ktask based distribution
> on a idle system (best case for both) and fully contended system (the
> worst case). It would be also great to get some numbers on partially
> contended system to see how much the priority handover etc. acts under
> different CPU contention.

Ok, thanks for clarifying.

Testing notes
 - The two workloads used were confined to run anywhere within an 8-CPU cpumask
 - The vfio workload started a 64G VM using THP
 - usemem was enlisted to create CPU load doing page clearing, just as the vfio
   case is doing, so the two compete for the same system resources.  usemem ran
   four times with each of its threads allocating and freeing 30G of memory each
   time.  Four usemem threads simulate Michal's partially contended system
 - ktask helpers always run at MAX_NICE
 - renice?=yes means run with patch 5, renice?=no means without
 - CPU:   2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
  Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz

 vfio  usemem
  thr thr  renice?  ktask secusemem sec
-  --  ---     
4  n/a  24.0 ( ± 0.1% )
8  n/a  25.3 ( ± 0.0% )
 
1   0  n/a   13.5 ( ±  0.0% )
1   4  n/a   14.2 ( ±  0.4% )   24.1 ( ± 0.3% )
 ***1   8  n/a   17.3 ( ± 10.4% )   29.7 ( ± 0.4% )
 
8   0   no2.8 ( ±  1.5% )
8   4   no4.7 ( ±  0.8% )   24.1 ( ± 0.2% )
8   8   no   13.7 ( ±  8.8% )   27.2 ( ± 1.2% )

8   0  yes2.8 ( ±  1.0% )
8   4  yes4.7 ( ±  1.4% )   24.1 ( ± 0.0% )
 ***8   8  yes9.2 ( ±  2.2% )   27.0 ( ± 0.4% )

Renicing under partial contention (usemem nthr=4) doesn't affect vfio, but
renicing under heavy contention (usemem nthr=8) does: the 8-thread vfio case is
slower when the ktask master thread doesn't will its priority to each helper at
a time.

Comparing the ***'d lines, using 8 vfio threads instead of 1 causes the threads
of both workloads to finish sooner under heavy contention.


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-07 Thread Daniel Jordan
On Tue, Nov 06, 2018 at 10:21:45AM +0100, Michal Hocko wrote:
> On Mon 05-11-18 17:29:55, Daniel Jordan wrote:
> > On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> > > On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > > > Michal, you mentioned that ktask should be sensitive to CPU 
> > > > utilization[1].
> > > > ktask threads now run at the lowest priority on the system to avoid 
> > > > disturbing
> > > > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > > > concern?
> > > > The plan to address your other comments is explained below.
> > > 
> > > I have only glanced through the documentation patch and it looks like it
> > > will be much less disruptive than the previous attempts. Now the obvious
> > > question is how does this behave on a moderately or even busy system
> > > when you compare that to a single threaded execution. Some numbers about
> > > best/worst case execution would be really helpful.
> > 
> > Patches 4 and 5 have some numbers where a ktask and non-ktask workload 
> > compete
> > against each other.  Those show either 8 ktask threads on 8 CPUs (worst 
> > case) or no ktask threads (best case).
> > 
> > By single threaded execution, I guess you mean 1 ktask thread.  I'll run the
> > experiments that way too and post the numbers.
> 
> I mean a comparision of how much time it gets to accomplish the same
> amount of work if it was done singlethreaded to ktask based distribution
> on a idle system (best case for both) and fully contended system (the
> worst case). It would be also great to get some numbers on partially
> contended system to see how much the priority handover etc. acts under
> different CPU contention.

Ok, thanks for clarifying.

Testing notes
 - The two workloads used were confined to run anywhere within an 8-CPU cpumask
 - The vfio workload started a 64G VM using THP
 - usemem was enlisted to create CPU load doing page clearing, just as the vfio
   case is doing, so the two compete for the same system resources.  usemem ran
   four times with each of its threads allocating and freeing 30G of memory each
   time.  Four usemem threads simulate Michal's partially contended system
 - ktask helpers always run at MAX_NICE
 - renice?=yes means run with patch 5, renice?=no means without
 - CPU:   2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
  Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz

 vfio  usemem
  thr thr  renice?  ktask secusemem sec
-  --  ---     
4  n/a  24.0 ( ± 0.1% )
8  n/a  25.3 ( ± 0.0% )
 
1   0  n/a   13.5 ( ±  0.0% )
1   4  n/a   14.2 ( ±  0.4% )   24.1 ( ± 0.3% )
 ***1   8  n/a   17.3 ( ± 10.4% )   29.7 ( ± 0.4% )
 
8   0   no2.8 ( ±  1.5% )
8   4   no4.7 ( ±  0.8% )   24.1 ( ± 0.2% )
8   8   no   13.7 ( ±  8.8% )   27.2 ( ± 1.2% )

8   0  yes2.8 ( ±  1.0% )
8   4  yes4.7 ( ±  1.4% )   24.1 ( ± 0.0% )
 ***8   8  yes9.2 ( ±  2.2% )   27.0 ( ± 0.4% )

Renicing under partial contention (usemem nthr=4) doesn't affect vfio, but
renicing under heavy contention (usemem nthr=8) does: the 8-thread vfio case is
slower when the ktask master thread doesn't will its priority to each helper at
a time.

Comparing the ***'d lines, using 8 vfio threads instead of 1 causes the threads
of both workloads to finish sooner under heavy contention.


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-06 Thread Daniel Jordan
On Mon, Nov 05, 2018 at 09:48:56PM -0500, Zi Yan wrote:
> On 5 Nov 2018, at 21:20, Daniel Jordan wrote:
> 
> > Hi Zi,
> >
> > On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
> >> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
> >>
> >> Do you think if it makes sense to use ktask for huge page migration (the 
> >> data
> >> copy part)?
> >
> > It certainly could.
> >
> >> I did some experiments back in 2016[1], which showed that migrating one 
> >> 2MB page
> >> with 8 threads could achieve 2.8x throughput of the existing 
> >> single-threaded method.
> >> The problem with my parallel page migration patchset at that time was that 
> >> it
> >> has no CPU-utilization awareness, which is solved by your patches now.
> >
> > Did you run with fewer than 8 threads?  I'd want a bigger speedup than 2.8x 
> > for
> > 8, and a smaller thread count might improve thread utilization.
> 
> Yes. When migrating one 2MB THP with migrate_pages() system call on a 
> two-socket server
> with 2 E5-2650 v3 CPUs (10 cores per socket) across two sockets, here are the 
> page migration
> throughput numbers:
> 
>  throughput   factor
> 1 thread  2.15 GB/s 1x
> 2 threads 3.05 GB/s 1.42x
> 4 threads 4.50 GB/s 2.09x
> 8 threads 5.98 GB/s 2.78x

Thanks.  Looks like in your patches you start a worker for every piece of the
huge page copy and have the main thread wait.  I'm curious what the workqueue
overhead is like on your machine.  On a newer Xeon it's ~50usec from queueing a
work to starting to execute it and another ~20usec to flush a work
(barrier_func), which could happen after the work is already done.  A pretty
significant piece of the copy time for part of a THP.

bash 60728 [087] 155865.157116:   probe:ktask_run: 
(b7ee7a80)
bash 60728 [087] 155865.157119:workqueue:workqueue_queue_work: 
work struct=0x95fb73276000
bash 60728 [087] 155865.157119: workqueue:workqueue_activate_work: 
work struct 0x95fb73276000
 kworker/u194:3- 86730 [095] 155865.157168: workqueue:workqueue_execute_start: 
work struct 0x95fb73276000: function ktask_thread
 kworker/u194:3- 86730 [095] 155865.157170:   workqueue:workqueue_execute_end: 
work struct 0x95fb73276000
 kworker/u194:3- 86730 [095] 155865.157171: workqueue:workqueue_execute_start: 
work struct 0xa676995bfb90: function wq_barrier_func
 kworker/u194:3- 86730 [095] 155865.157190:   workqueue:workqueue_execute_end: 
work struct 0xa676995bfb90
bash 60728 [087] 155865.157207:   probe:ktask_run_ret__return: 
(b7ee7a80 <- b7ee7b7b)

> >
> > It would be nice to multithread at a higher granularity than 2M, too: a 
> > range
> > of THPs might also perform better than a single page.
> 
> Sure. But the kernel currently does not copy multiple pages altogether even 
> if a range
> of THPs is migrated. Page copy function is interleaved with page table 
> operations
> for every single page.
> 
> I also did some study and modified the kernel to improve this, which I called
> concurrent page migration in https://lwn.net/Articles/714991/. It further
> improves page migration throughput.

Ok, over 4x with 8 threads for 16 THPs.  Is 16 a typical number for migration,
or does it get larger?  What workloads do you have in mind with this change?


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-06 Thread Daniel Jordan
On Mon, Nov 05, 2018 at 09:48:56PM -0500, Zi Yan wrote:
> On 5 Nov 2018, at 21:20, Daniel Jordan wrote:
> 
> > Hi Zi,
> >
> > On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
> >> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
> >>
> >> Do you think if it makes sense to use ktask for huge page migration (the 
> >> data
> >> copy part)?
> >
> > It certainly could.
> >
> >> I did some experiments back in 2016[1], which showed that migrating one 
> >> 2MB page
> >> with 8 threads could achieve 2.8x throughput of the existing 
> >> single-threaded method.
> >> The problem with my parallel page migration patchset at that time was that 
> >> it
> >> has no CPU-utilization awareness, which is solved by your patches now.
> >
> > Did you run with fewer than 8 threads?  I'd want a bigger speedup than 2.8x 
> > for
> > 8, and a smaller thread count might improve thread utilization.
> 
> Yes. When migrating one 2MB THP with migrate_pages() system call on a 
> two-socket server
> with 2 E5-2650 v3 CPUs (10 cores per socket) across two sockets, here are the 
> page migration
> throughput numbers:
> 
>  throughput   factor
> 1 thread  2.15 GB/s 1x
> 2 threads 3.05 GB/s 1.42x
> 4 threads 4.50 GB/s 2.09x
> 8 threads 5.98 GB/s 2.78x

Thanks.  Looks like in your patches you start a worker for every piece of the
huge page copy and have the main thread wait.  I'm curious what the workqueue
overhead is like on your machine.  On a newer Xeon it's ~50usec from queueing a
work to starting to execute it and another ~20usec to flush a work
(barrier_func), which could happen after the work is already done.  A pretty
significant piece of the copy time for part of a THP.

bash 60728 [087] 155865.157116:   probe:ktask_run: 
(b7ee7a80)
bash 60728 [087] 155865.157119:workqueue:workqueue_queue_work: 
work struct=0x95fb73276000
bash 60728 [087] 155865.157119: workqueue:workqueue_activate_work: 
work struct 0x95fb73276000
 kworker/u194:3- 86730 [095] 155865.157168: workqueue:workqueue_execute_start: 
work struct 0x95fb73276000: function ktask_thread
 kworker/u194:3- 86730 [095] 155865.157170:   workqueue:workqueue_execute_end: 
work struct 0x95fb73276000
 kworker/u194:3- 86730 [095] 155865.157171: workqueue:workqueue_execute_start: 
work struct 0xa676995bfb90: function wq_barrier_func
 kworker/u194:3- 86730 [095] 155865.157190:   workqueue:workqueue_execute_end: 
work struct 0xa676995bfb90
bash 60728 [087] 155865.157207:   probe:ktask_run_ret__return: 
(b7ee7a80 <- b7ee7b7b)

> >
> > It would be nice to multithread at a higher granularity than 2M, too: a 
> > range
> > of THPs might also perform better than a single page.
> 
> Sure. But the kernel currently does not copy multiple pages altogether even 
> if a range
> of THPs is migrated. Page copy function is interleaved with page table 
> operations
> for every single page.
> 
> I also did some study and modified the kernel to improve this, which I called
> concurrent page migration in https://lwn.net/Articles/714991/. It further
> improves page migration throughput.

Ok, over 4x with 8 threads for 16 THPs.  Is 16 a typical number for migration,
or does it get larger?  What workloads do you have in mind with this change?


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-06 Thread Michal Hocko
On Mon 05-11-18 17:29:55, Daniel Jordan wrote:
> On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> > On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > > Michal, you mentioned that ktask should be sensitive to CPU 
> > > utilization[1].
> > > ktask threads now run at the lowest priority on the system to avoid 
> > > disturbing
> > > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > > concern?
> > > The plan to address your other comments is explained below.
> > 
> > I have only glanced through the documentation patch and it looks like it
> > will be much less disruptive than the previous attempts. Now the obvious
> > question is how does this behave on a moderately or even busy system
> > when you compare that to a single threaded execution. Some numbers about
> > best/worst case execution would be really helpful.
> 
> Patches 4 and 5 have some numbers where a ktask and non-ktask workload compete
> against each other.  Those show either 8 ktask threads on 8 CPUs (worst case) 
> or no ktask threads (best case).
> 
> By single threaded execution, I guess you mean 1 ktask thread.  I'll run the
> experiments that way too and post the numbers.

I mean a comparision of how much time it gets to accomplish the same
amount of work if it was done singlethreaded to ktask based distribution
on a idle system (best case for both) and fully contended system (the
worst case). It would be also great to get some numbers on partially
contended system to see how much the priority handover etc. acts under
different CPU contention.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-06 Thread Michal Hocko
On Mon 05-11-18 17:29:55, Daniel Jordan wrote:
> On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> > On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > > Michal, you mentioned that ktask should be sensitive to CPU 
> > > utilization[1].
> > > ktask threads now run at the lowest priority on the system to avoid 
> > > disturbing
> > > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > > concern?
> > > The plan to address your other comments is explained below.
> > 
> > I have only glanced through the documentation patch and it looks like it
> > will be much less disruptive than the previous attempts. Now the obvious
> > question is how does this behave on a moderately or even busy system
> > when you compare that to a single threaded execution. Some numbers about
> > best/worst case execution would be really helpful.
> 
> Patches 4 and 5 have some numbers where a ktask and non-ktask workload compete
> against each other.  Those show either 8 ktask threads on 8 CPUs (worst case) 
> or no ktask threads (best case).
> 
> By single threaded execution, I guess you mean 1 ktask thread.  I'll run the
> experiments that way too and post the numbers.

I mean a comparision of how much time it gets to accomplish the same
amount of work if it was done singlethreaded to ktask based distribution
on a idle system (best case for both) and fully contended system (the
worst case). It would be also great to get some numbers on partially
contended system to see how much the priority handover etc. acts under
different CPU contention.
-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Zi Yan
On 5 Nov 2018, at 21:20, Daniel Jordan wrote:

> Hi Zi,
>
> On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
>> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
>>
>> Do you think if it makes sense to use ktask for huge page migration (the data
>> copy part)?
>
> It certainly could.
>
>> I did some experiments back in 2016[1], which showed that migrating one 2MB 
>> page
>> with 8 threads could achieve 2.8x throughput of the existing single-threaded 
>> method.
>> The problem with my parallel page migration patchset at that time was that it
>> has no CPU-utilization awareness, which is solved by your patches now.
>
> Did you run with fewer than 8 threads?  I'd want a bigger speedup than 2.8x 
> for
> 8, and a smaller thread count might improve thread utilization.

Yes. When migrating one 2MB THP with migrate_pages() system call on a 
two-socket server
with 2 E5-2650 v3 CPUs (10 cores per socket) across two sockets, here are the 
page migration
throughput numbers:

 throughput   factor
1 thread  2.15 GB/s 1x
2 threads 3.05 GB/s 1.42x
4 threads 4.50 GB/s 2.09x
8 threads 5.98 GB/s 2.78x

>
> It would be nice to multithread at a higher granularity than 2M, too: a range
> of THPs might also perform better than a single page.

Sure. But the kernel currently does not copy multiple pages altogether even if 
a range
of THPs is migrated. Page copy function is interleaved with page table 
operations
for every single page.

I also did some study and modified the kernel to improve this, which I called
concurrent page migration in https://lwn.net/Articles/714991/. It further
improves page migration throughput.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Zi Yan
On 5 Nov 2018, at 21:20, Daniel Jordan wrote:

> Hi Zi,
>
> On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
>> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
>>
>> Do you think if it makes sense to use ktask for huge page migration (the data
>> copy part)?
>
> It certainly could.
>
>> I did some experiments back in 2016[1], which showed that migrating one 2MB 
>> page
>> with 8 threads could achieve 2.8x throughput of the existing single-threaded 
>> method.
>> The problem with my parallel page migration patchset at that time was that it
>> has no CPU-utilization awareness, which is solved by your patches now.
>
> Did you run with fewer than 8 threads?  I'd want a bigger speedup than 2.8x 
> for
> 8, and a smaller thread count might improve thread utilization.

Yes. When migrating one 2MB THP with migrate_pages() system call on a 
two-socket server
with 2 E5-2650 v3 CPUs (10 cores per socket) across two sockets, here are the 
page migration
throughput numbers:

 throughput   factor
1 thread  2.15 GB/s 1x
2 threads 3.05 GB/s 1.42x
4 threads 4.50 GB/s 2.09x
8 threads 5.98 GB/s 2.78x

>
> It would be nice to multithread at a higher granularity than 2M, too: a range
> of THPs might also perform better than a single page.

Sure. But the kernel currently does not copy multiple pages altogether even if 
a range
of THPs is migrated. Page copy function is interleaved with page table 
operations
for every single page.

I also did some study and modified the kernel to improve this, which I called
concurrent page migration in https://lwn.net/Articles/714991/. It further
improves page migration throughput.


—
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Daniel Jordan
Hi Zi,

On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
>
> Do you think if it makes sense to use ktask for huge page migration (the data
> copy part)?

It certainly could.

> I did some experiments back in 2016[1], which showed that migrating one 2MB 
> page
> with 8 threads could achieve 2.8x throughput of the existing single-threaded 
> method.
> The problem with my parallel page migration patchset at that time was that it
> has no CPU-utilization awareness, which is solved by your patches now.

Did you run with fewer than 8 threads?  I'd want a bigger speedup than 2.8x for
8, and a smaller thread count might improve thread utilization.

It would be nice to multithread at a higher granularity than 2M, too: a range
of THPs might also perform better than a single page.

Thanks for your comments.

> [1]https://lkml.org/lkml/2016/11/22/457


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Daniel Jordan
Hi Zi,

On Mon, Nov 05, 2018 at 01:49:14PM -0500, Zi Yan wrote:
> On 5 Nov 2018, at 11:55, Daniel Jordan wrote:
>
> Do you think if it makes sense to use ktask for huge page migration (the data
> copy part)?

It certainly could.

> I did some experiments back in 2016[1], which showed that migrating one 2MB 
> page
> with 8 threads could achieve 2.8x throughput of the existing single-threaded 
> method.
> The problem with my parallel page migration patchset at that time was that it
> has no CPU-utilization awareness, which is solved by your patches now.

Did you run with fewer than 8 threads?  I'd want a bigger speedup than 2.8x for
8, and a smaller thread count might improve thread utilization.

It would be nice to multithread at a higher granularity than 2M, too: a range
of THPs might also perform better than a single page.

Thanks for your comments.

> [1]https://lkml.org/lkml/2016/11/22/457


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Daniel Jordan
On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > ktask threads now run at the lowest priority on the system to avoid 
> > disturbing
> > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > concern?
> > The plan to address your other comments is explained below.
> 
> I have only glanced through the documentation patch and it looks like it
> will be much less disruptive than the previous attempts. Now the obvious
> question is how does this behave on a moderately or even busy system
> when you compare that to a single threaded execution. Some numbers about
> best/worst case execution would be really helpful.

Patches 4 and 5 have some numbers where a ktask and non-ktask workload compete
against each other.  Those show either 8 ktask threads on 8 CPUs (worst case) 
or no ktask threads (best case).

By single threaded execution, I guess you mean 1 ktask thread.  I'll run the
experiments that way too and post the numbers.

> I will look closer later.

Great!  Thanks for your comment.

Daniel


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Daniel Jordan
On Mon, Nov 05, 2018 at 06:29:31PM +0100, Michal Hocko wrote:
> On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> > Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> > ktask threads now run at the lowest priority on the system to avoid 
> > disturbing
> > busy CPUs (more details in patches 4 and 5).  Does this address your 
> > concern?
> > The plan to address your other comments is explained below.
> 
> I have only glanced through the documentation patch and it looks like it
> will be much less disruptive than the previous attempts. Now the obvious
> question is how does this behave on a moderately or even busy system
> when you compare that to a single threaded execution. Some numbers about
> best/worst case execution would be really helpful.

Patches 4 and 5 have some numbers where a ktask and non-ktask workload compete
against each other.  Those show either 8 ktask threads on 8 CPUs (worst case) 
or no ktask threads (best case).

By single threaded execution, I guess you mean 1 ktask thread.  I'll run the
experiments that way too and post the numbers.

> I will look closer later.

Great!  Thanks for your comment.

Daniel


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Zi Yan
Hi Daniel,

On 5 Nov 2018, at 11:55, Daniel Jordan wrote:

> Hi,
>
> This version addresses some of the feedback from Andrew and Michal last year
> and describes the plan for tackling the rest.  I'm posting now since I'll be
> presenting ktask at Plumbers next week.
>
> Andrew, you asked about parallelizing in more places[0].  This version adds
> multithreading for VFIO page pinning, and there are more planned users listed
> below.
>
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5).  Does this address your concern?
> The plan to address your other comments is explained below.
>
> Alex, any thoughts about the VFIO changes in patches 6-9?
>
> Tejun and Lai, what do you think of patch 5?
>
> And for everyone, questions and comments welcome.  Any suggestions for more
> users?
>
>  Thanks,
> Daniel
>
> P.S.  This series is big to address the above feedback, but I can send patches
> 7 and 8 separately.
>
>
> TODO
> 
>
>  - Implement cgroup-aware unbound workqueues in a separate series, picking up
>Bandan Das's effort from two years ago[2].  This should hopefully address
>Michal's comment about running ktask threads within the limits of the 
> calling
>context[1].
>
>  - Make ktask aware of power management.  A starting point is to disable the
>framework when energy-conscious cpufreq settings are enabled (e.g.
>powersave, conservative scaling governors).  This should address another
>comment from Michal about keeping CPUs under power constraints idle[1].
>
>  - Add more users.  On my list:
> - __ib_umem_release in IB core, which Jason Gunthorpe mentioned[3]
> - XFS quotacheck and online repair, as suggested by Darrick Wong
> - vfs object teardown at umount time, as Andrew mentioned[0]
> - page freeing in munmap/exit, as Aaron Lu posted[4]
> - page freeing in shmem
>The last three will benefit from scaling zone->lock and lru_lock.
>
>  - CPU hotplug support for ktask to adjust its per-CPU data and resource
>limits.
>
>  - Check with IOMMU folks that iommu_map is safe for all IOMMU backend
>implementations (it is for x86).
>
>
> Summary
> ---
>
> A single CPU can spend an excessive amount of time in the kernel operating
> on large amounts of data.  Often these situations arise during initialization-
> and destruction-related tasks, where the data involved scales with system 
> size.
> These long-running jobs can slow startup and shutdown of applications and the
> system itself while extra CPUs sit idle.
>
> To ensure that applications and the kernel continue to perform well as core
> counts and memory sizes increase, harness these idle CPUs to complete such 
> jobs
> more quickly.
>
> ktask is a generic framework for parallelizing CPU-intensive work in the
> kernel.  The API is generic enough to add concurrency to many different kinds
> of tasks--for example, zeroing a range of pages or evicting a list of
> inodes--and aims to save its clients the trouble of splitting up the work,
> choosing the number of threads to use, maintaining an efficient concurrency
> level, starting these threads, and load balancing the work between them.
>
> The first patch has more documentation, and the second patch has the 
> interface.
>
> Current users:
>  1) VFIO page pinning before kvm guest startup (others hitting slowness 
> too[5])
>  2) deferred struct page initialization at boot time
>  3) clearing gigantic pages
>  4) fallocate for HugeTLB pages

Do you think if it makes sense to use ktask for huge page migration (the data
copy part)?

I did some experiments back in 2016[1], which showed that migrating one 2MB page
with 8 threads could achieve 2.8x throughput of the existing single-threaded 
method.
The problem with my parallel page migration patchset at that time was that it
has no CPU-utilization awareness, which is solved by your patches now.

Thanks.

[1]https://lkml.org/lkml/2016/11/22/457

--
Best Regards
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Zi Yan
Hi Daniel,

On 5 Nov 2018, at 11:55, Daniel Jordan wrote:

> Hi,
>
> This version addresses some of the feedback from Andrew and Michal last year
> and describes the plan for tackling the rest.  I'm posting now since I'll be
> presenting ktask at Plumbers next week.
>
> Andrew, you asked about parallelizing in more places[0].  This version adds
> multithreading for VFIO page pinning, and there are more planned users listed
> below.
>
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5).  Does this address your concern?
> The plan to address your other comments is explained below.
>
> Alex, any thoughts about the VFIO changes in patches 6-9?
>
> Tejun and Lai, what do you think of patch 5?
>
> And for everyone, questions and comments welcome.  Any suggestions for more
> users?
>
>  Thanks,
> Daniel
>
> P.S.  This series is big to address the above feedback, but I can send patches
> 7 and 8 separately.
>
>
> TODO
> 
>
>  - Implement cgroup-aware unbound workqueues in a separate series, picking up
>Bandan Das's effort from two years ago[2].  This should hopefully address
>Michal's comment about running ktask threads within the limits of the 
> calling
>context[1].
>
>  - Make ktask aware of power management.  A starting point is to disable the
>framework when energy-conscious cpufreq settings are enabled (e.g.
>powersave, conservative scaling governors).  This should address another
>comment from Michal about keeping CPUs under power constraints idle[1].
>
>  - Add more users.  On my list:
> - __ib_umem_release in IB core, which Jason Gunthorpe mentioned[3]
> - XFS quotacheck and online repair, as suggested by Darrick Wong
> - vfs object teardown at umount time, as Andrew mentioned[0]
> - page freeing in munmap/exit, as Aaron Lu posted[4]
> - page freeing in shmem
>The last three will benefit from scaling zone->lock and lru_lock.
>
>  - CPU hotplug support for ktask to adjust its per-CPU data and resource
>limits.
>
>  - Check with IOMMU folks that iommu_map is safe for all IOMMU backend
>implementations (it is for x86).
>
>
> Summary
> ---
>
> A single CPU can spend an excessive amount of time in the kernel operating
> on large amounts of data.  Often these situations arise during initialization-
> and destruction-related tasks, where the data involved scales with system 
> size.
> These long-running jobs can slow startup and shutdown of applications and the
> system itself while extra CPUs sit idle.
>
> To ensure that applications and the kernel continue to perform well as core
> counts and memory sizes increase, harness these idle CPUs to complete such 
> jobs
> more quickly.
>
> ktask is a generic framework for parallelizing CPU-intensive work in the
> kernel.  The API is generic enough to add concurrency to many different kinds
> of tasks--for example, zeroing a range of pages or evicting a list of
> inodes--and aims to save its clients the trouble of splitting up the work,
> choosing the number of threads to use, maintaining an efficient concurrency
> level, starting these threads, and load balancing the work between them.
>
> The first patch has more documentation, and the second patch has the 
> interface.
>
> Current users:
>  1) VFIO page pinning before kvm guest startup (others hitting slowness 
> too[5])
>  2) deferred struct page initialization at boot time
>  3) clearing gigantic pages
>  4) fallocate for HugeTLB pages

Do you think if it makes sense to use ktask for huge page migration (the data
copy part)?

I did some experiments back in 2016[1], which showed that migrating one 2MB page
with 8 threads could achieve 2.8x throughput of the existing single-threaded 
method.
The problem with my parallel page migration patchset at that time was that it
has no CPU-utilization awareness, which is solved by your patches now.

Thanks.

[1]https://lkml.org/lkml/2016/11/22/457

--
Best Regards
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Michal Hocko
On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5).  Does this address your concern?
> The plan to address your other comments is explained below.

I have only glanced through the documentation patch and it looks like it
will be much less disruptive than the previous attempts. Now the obvious
question is how does this behave on a moderately or even busy system
when you compare that to a single threaded execution. Some numbers about
best/worst case execution would be really helpful.

I will look closer later.

-- 
Michal Hocko
SUSE Labs


Re: [RFC PATCH v4 00/13] ktask: multithread CPU-intensive kernel work

2018-11-05 Thread Michal Hocko
On Mon 05-11-18 11:55:45, Daniel Jordan wrote:
> Michal, you mentioned that ktask should be sensitive to CPU utilization[1].
> ktask threads now run at the lowest priority on the system to avoid disturbing
> busy CPUs (more details in patches 4 and 5).  Does this address your concern?
> The plan to address your other comments is explained below.

I have only glanced through the documentation patch and it looks like it
will be much less disruptive than the previous attempts. Now the obvious
question is how does this behave on a moderately or even busy system
when you compare that to a single threaded execution. Some numbers about
best/worst case execution would be really helpful.

I will look closer later.

-- 
Michal Hocko
SUSE Labs