On 08/24/2017 01:44 PM, Mel Gorman wrote:
> On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote:
>> On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen
>> wrote:
>>>
>>> These changes look fine. We are testing them now.
>>> Does the second patch in the series look okay to you?
>>
>> I didn't r
On Thu, Aug 24, 2017 at 11:16:15AM -0700, Linus Torvalds wrote:
> On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen wrote:
> >
> > These changes look fine. We are testing them now.
> > Does the second patch in the series look okay to you?
>
> I didn't really have any reaction to that one, as long as Me
On Thu, Aug 24, 2017 at 10:49 AM, Tim Chen wrote:
>
> These changes look fine. We are testing them now.
> Does the second patch in the series look okay to you?
I didn't really have any reaction to that one, as long as Mel&co are
ok with it, I'm fine with it.
Linus
On 08/23/2017 04:30 PM, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds
> wrote:
>> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote:
>>>
>>> Will you still consider the original patch as a fail safe mechanism?
>>
>> I don't think we have much choice, although I would *rea
On Wed, Aug 23, 2017 at 11:17 AM, Linus Torvalds
wrote:
> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote:
>>
>> Will you still consider the original patch as a fail safe mechanism?
>
> I don't think we have much choice, although I would *really* want to
> get this root-caused rather than just pa
>
> On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen
> wrote:
> >
> > Will you still consider the original patch as a fail safe mechanism?
>
> I don't think we have much choice, although I would *really* want to get this
> root-caused rather than just papering over the symptoms.
>
> Maybe still worth
On Wed, Aug 23, 2017 at 8:58 AM, Tim Chen wrote:
>
> Will you still consider the original patch as a fail safe mechanism?
I don't think we have much choice, although I would *really* want to
get this root-caused rather than just papering over the symptoms.
Maybe still worth testing that "sched/n
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote:
> On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan wrote:
> >
> > Although the patch doesn't trigger watchdog, the spin lock wait time
> > is not small (0.45s).
> > It may get worse again on larger systems.
>
> Yeah, I don't think Mel's
On 08/23/2017 07:49 AM, Liang, Kan wrote:
>> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote:
>>>
So I propose testing the attached trivial patch.
>>>
>>> It doesn’t work.
>>> The call stack is the same.
>>
>> So I would have expected the stack trace to be the same, and I would even
>> exp
> Subject: Re: [PATCH 1/2] sched/wait: Break up long wake list walk
>
> On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen wrote:
> >
> > I believe in this case it's used by threads, so a reference count
> > limit wouldn't help.
>
> For the first migration
> On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote:
> >
> >> So I propose testing the attached trivial patch.
> >
> > It doesn’t work.
> > The call stack is the same.
>
> So I would have expected the stack trace to be the same, and I would even
> expect the CPU usage to be fairly similar, becau
On Tue, Aug 22, 2017 at 3:52 PM, Linus Torvalds
wrote:
>
> The *other* memory policies look fairly sane. They basically have a
> fairly well-defined preferred node for the policy (although the
> "MPOL_INTERLEAVE" looks wrong for a hugepage). But
> MPOL_PREFERRED/MPOL_F_LOCAL really looks complete
On Tue, Aug 22, 2017 at 2:24 PM, Andi Kleen wrote:
>
> I believe in this case it's used by threads, so a reference count limit
> wouldn't help.
For the first migration try, yes. But if it's some kind of "try and
try again" pattern, the second time you try and there are people
waiting for the page
On Tue, Aug 22, 2017 at 04:08:52PM -0500, Christopher Lameter wrote:
> On Tue, 22 Aug 2017, Andi Kleen wrote:
>
> > We only see it on 4S+ today. But systems are always getting larger,
> > so what's a large system today, will be a normal medium scale system
> > tomorrow.
> >
> > BTW we also collect
On Tue, 22 Aug 2017, Andi Kleen wrote:
> We only see it on 4S+ today. But systems are always getting larger,
> so what's a large system today, will be a normal medium scale system
> tomorrow.
>
> BTW we also collected PT traces for the long hang cases, but it was
> hard to find a consistent patter
On Tue, Aug 22, 2017 at 1:53 PM, Peter Zijlstra wrote:
>
> So _the_ problem with yield() is when you hit this with a RT task it
> will busy spin and possibly not allow the task that actually has the
> lock to make progress at all.
I thought we had explicitly defined yield() to not do that.
But I
On Tue, Aug 22, 2017 at 01:42:13PM -0700, Linus Torvalds wrote:
> +void wait_on_page_bit_or_yield(struct page *page, int bit_nr)
> +{
> + if (PageWaiters(page)) {
> + yield();
> + return;
> + }
> + wait_on_page_bit(page, bit_nr);
> +}
So _the_ problem with yield
On Tue, Aug 22, 2017 at 12:55 PM, Liang, Kan wrote:
>
>> So I propose testing the attached trivial patch.
>
> It doesn’t work.
> The call stack is the same.
So I would have expected the stack trace to be the same, and I would
even expect the CPU usage to be fairly similar, because you'd see
repea
> So I propose testing the attached trivial patch.
It doesn’t work.
The call stack is the same.
100.00% (821af140)
|
---wait_on_page_bit
__migration_entry_wait
migration_entry_wait
do_swap_page
__h
> > Still, generating such a migration storm would be fairly tricky I think.
>
> Well, Mel seems to have been unable to generate a load that reproduces
> the long page waitqueues. And I don't think we've had any other
> reports of this either.
It could be that it requires a fairly large system. O
On Tue, Aug 22, 2017 at 12:08 PM, Peter Zijlstra wrote:
>
> So that migration stuff has a filter on, we need two consecutive numa
> faults from the same page_cpupid 'hash', see
> should_numa_migrate_memory().
Hmm. That is only called for MPOL_F_MORON.
We don't actually know what policy the probl
On Tue, Aug 22, 2017 at 11:56 AM, Peter Zijlstra wrote:
>
> Won't we now prematurely terminate the wait when we get a spurious
> wakeup?
I think there's two answers to that:
(a) do we even care?
(b) what spurious wakeup?
The "do we even care" quesiton is because wait_on_page_bit by
definitio
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote:
> And since everybody touches it, as a result everybody eventually
> thinks that page should be migrated to their NUMA node.
So that migration stuff has a filter on, we need two consecutive numa
faults from the same page_cpupid 'hash'
On Tue, Aug 22, 2017 at 11:19:12AM -0700, Linus Torvalds wrote:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a49702445ce0..75c29a1f90fb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -991,13 +991,11 @@ static inline int
> wait_on_page_bit_common(wait_queue_head_t *q,
>
On Tue, Aug 22, 2017 at 11:19 AM, Linus Torvalds
wrote:
>
> So I propose testing the attached trivial patch. It may not do
> anything at all. But the existing code is actually doing extra work
> just to be fragile, in case the scenario above can happen.
Side note: the patch compiles for me. But t
On Tue, Aug 22, 2017 at 10:23 AM, Liang, Kan wrote:
>
> Although the patch doesn't trigger watchdog, the spin lock wait time
> is not small (0.45s).
> It may get worse again on larger systems.
Yeah, I don't think Mel's patch is great - because I think we could do
so much better.
What I like abou
> > Covering both paths would be something like the patch below which
> > spins until the page is unlocked or it should reschedule. It's not
> > even boot tested as I spent what time I had on the test case that I
> > hoped would be able to prove it really works.
>
> I will give it a try.
Althoug
> > Because that code sequence doesn't actually depend on
> > "wait_on_page_lock()" for _correctness_ anyway, afaik. Anybody who
> > does "migration_entry_wait()" _has_ to retry anyway, since the page
> > table contents may have changed by waiting.
> >
> > So I'm not proud of the attached patch, an
On Fri, Aug 18, 2017 at 12:14:12PM -0700, Linus Torvalds wrote:
> On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman
> wrote:
> >
> > One option to mitigate (but not eliminate) the problem is to record when
> > the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to
> > task_numa_fault()
On Fri, Aug 18, 2017 at 1:29 PM, Liang, Kan wrote:
> Here is the profiling with THP disabled for wait_on_page_bit_common and
> wake_up_page_bit.
>
>
> The call stack of wait_on_page_bit_common
> # Overhead Trace output
> # ..
> #
>100.00% (821aefca)
>
On Fri, Aug 18, 2017 at 1:05 PM, Andi Kleen wrote:
>
> I think what's happening is that it allows more parallelism during wakeup:
>
> Normally it's like
>
> CPU 1 CPU 2 CPU 3
> .
>
> LOCK
> wake up tasks on other CPUswoken up
> >>
> >> That indicates that it may be a hot page and it's possible that the
> >> page is locked for a short time but waiters accumulate. What happens
> >> if you leave NUMA balancing enabled but disable THP?
> >
> > No, disabling THP doesn't help the case.
>
> Interesting. That particular code
On Fri, Aug 18, 2017 at 12:58 PM, Andi Kleen wrote:
>> which is hacky, but there's a rationale for it:
>>
>> (a) avoid the crazy long wait queues ;)
>>
>> (b) we know that migration is *supposed* to be CPU-bound (not IO
>> bound), so yielding the CPU and retrying may just be the right thing
>> t
> I was really hoping that we'd root-cause this and have a solution (and
> then apply Tim's patch as a "belt and suspenders" kind of thing), but
One thing I wanted to point out is that Tim's patch seems to make
several schedule intensive micro benchmarks faster.
I think what's happening is that
> which is hacky, but there's a rationale for it:
>
> (a) avoid the crazy long wait queues ;)
>
> (b) we know that migration is *supposed* to be CPU-bound (not IO
> bound), so yielding the CPU and retrying may just be the right thing
> to do.
So this would degenerate into a spin when the conte
On Fri, Aug 18, 2017 at 11:54 AM, Mel Gorman
wrote:
>
> One option to mitigate (but not eliminate) the problem is to record when
> the page lock is contended and pass in TNF_PAGE_CONTENDED (new flag) to
> task_numa_fault().
Well, finding it contended is fairly easy - just look at the page wait
qu
On Fri, Aug 18, 2017 at 10:48:23AM -0700, Linus Torvalds wrote:
> On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan wrote:
> >
> >> On Fri, Aug 18, 2017 Mel Gorman wrote:
> >>
> >> That indicates that it may be a hot page and it's possible that the page is
> >> locked for a short time but waiters accumu
On Fri, Aug 18, 2017 at 9:53 AM, Liang, Kan wrote:
>
>> On Fri, Aug 18, 2017 Mel Gorman wrote:
>>
>> That indicates that it may be a hot page and it's possible that the page is
>> locked for a short time but waiters accumulate. What happens if you leave
>> NUMA balancing enabled but disable THP?
On Fri, Aug 18, 2017 at 5:23 AM, Mel Gorman wrote:
>
> new_page = alloc_pages_node(node,
> - (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
> + (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) &
> ~__GFP_DIRECT_RECLAIM,
> HPAGE_PMD_ORDER);
That can't make any d
> On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote:
> > > Nothing fancy other than needing a comment if it works.
> > >
> >
> > No, the patch doesn't work.
> >
>
> That indicates that it may be a hot page and it's possible that the page is
> locked for a short time but waiters accumula
> Still, I don't think this problem is THP specific. If there is a hot
> regular page getting migrated, we'll also see many threads get
> queued up quickly. THP may have made the problem worse as migrating
> it takes a longer time, meaning more threads could get queued up.
Also THP probably make
On 08/18/2017 07:46 AM, Mel Gorman wrote:
> On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote:
>>> Nothing fancy other than needing a comment if it works.
>>>
>>
>> No, the patch doesn't work.
>>
>
> That indicates that it may be a hot page and it's possible that the page is
> locked for
On Fri, Aug 18, 2017 at 02:20:38PM +, Liang, Kan wrote:
> > Nothing fancy other than needing a comment if it works.
> >
>
> No, the patch doesn't work.
>
That indicates that it may be a hot page and it's possible that the page is
locked for a short time but waiters accumulate. What happens
> On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote:
> > On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote:
> > >
> > > Here is the call stack of wait_on_page_bit_common when the queue is
> > > long (entries >1000).
> > >
> > > # Overhead Trace output
> > > #
> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote:
> >
> > Here is the call stack of wait_on_page_bit_common when the queue is
> > long (entries >1000).
> >
> > # Overhead Trace output
> > # ..
> > #
> >100.00% (931aefca)
> > |
> >
On Thu, Aug 17, 2017 at 01:44:40PM -0700, Linus Torvalds wrote:
> On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote:
> >
> > Here is the call stack of wait_on_page_bit_common
> > when the queue is long (entries >1000).
> >
> > # Overhead Trace output
> > # ..
> > #
> >
On Thu, Aug 17, 2017 at 1:18 PM, Liang, Kan wrote:
>
> Here is the call stack of wait_on_page_bit_common
> when the queue is long (entries >1000).
>
> # Overhead Trace output
> # ..
> #
>100.00% (931aefca)
> |
> ---wait_on_page_bit
>
> > Here is the wake_up_page_bit call stack when the workaround is running,
> which
> > is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10
>
> It's actually not really wake_up_page_bit() that is all that
> interesting, it would be more interesting to see which path it is tha
On Thu, Aug 17, 2017 at 9:17 AM, Liang, Kan wrote:
>
> We tried both 12 and 16 bit table and that didn't make a difference.
> The long wake ups are mostly on the same page when we do instrumentation
Ok.
> Here is the wake_up_page_bit call stack when the workaround is running, which
> is collecte
> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen
> wrote:
> > We encountered workloads that have very long wake up list on large
> > systems. A waker takes a long time to traverse the entire wake list
> > and execute all the wake functions.
> >
> > We saw page wait list that are up to 3700+ entries lo
Linus Torvalds writes:
> On Tue, Aug 15, 2017 at 3:57 PM, Linus Torvalds
> wrote:
>>
>> Oh, and the page wait-queue really needs that key argument too, which
>> is another thing that swait queue code got rid of in the name of
>> simplicity.
>
> Actually, it gets worse.
>
> Because the page wait
On Tue, Aug 15, 2017 at 3:57 PM, Linus Torvalds
wrote:
>
> Oh, and the page wait-queue really needs that key argument too, which
> is another thing that swait queue code got rid of in the name of
> simplicity.
Actually, it gets worse.
Because the page wait queues are hashed, it's not an all-or-n
On Tue, Aug 15, 2017 at 3:56 PM, Linus Torvalds
wrote:
>
> Except they really don't actually work for this case, exactly because
> they also simplify away "minor" details like exclusive vs
> non-exclusive etc.
>
> The page wait-queue very much has a mix of "wake all" and "wake one"
> semantics.
On Tue, Aug 15, 2017 at 3:47 PM, Davidlohr Bueso wrote:
>
> Or you can always use wake_qs; which exists _exactly_ for the issues you
> are running into
Except they really don't actually work for this case, exactly because
they also simplify away "minor" details like exclusive vs
non-exclusive etc
On Mon, 14 Aug 2017, Linus Torvalds wrote:
On Mon, Aug 14, 2017 at 8:15 PM, Andi Kleen wrote:
But what should we do when some other (non page) wait queue runs into the
same problem?
Hopefully the same: root-cause it.
Or you can always use wake_qs; which exists _exactly_ for the issues you
On Tue, Aug 15, 2017 at 12:41 PM, Linus Torvalds
wrote:
>
> So if we have unnecessarily collisions because we have waiters looking
> at different bits of the same page, we could just hash in the bit
> number that we're waiting for too.
Oh, nope, we can't do that, because we only have one "PageWat
On Tue, Aug 15, 2017 at 12:05 PM, Tim Chen wrote:
>
> We have a test case but it is a customer workload. We'll try to get
> a bit more info.
Ok. Being a customer workload is lovely in the sense that it is
actually a real load, not just a microbecnhmark.
But yeah, it makes it harder to describe
On 08/14/2017 08:28 PM, Linus Torvalds wrote:
> On Mon, Aug 14, 2017 at 8:15 PM, Andi Kleen wrote:
>> But what should we do when some other (non page) wait queue runs into the
>> same problem?
>
> Hopefully the same: root-cause it.
>
> Once you have a test-case, it should generally be fairly sim
On Mon, Aug 14, 2017 at 8:15 PM, Andi Kleen wrote:
> But what should we do when some other (non page) wait queue runs into the
> same problem?
Hopefully the same: root-cause it.
Once you have a test-case, it should generally be fairly simple to do
with profiles, just seeing who the caller is whe
> That is what the patch does now, and that is why I dislike the patch.
>
> So I _am_ NAK'ing the patch if nobody is willing to even try alternatives.
Ok, perhaps larger hash table is the right solution for this one.
But what should we do when some other (non page) wait queue runs into the
same
On Mon, Aug 14, 2017 at 7:27 PM, Andi Kleen wrote:
>
> We could try it and it may even help in this case and it may
> be a good idea in any case on such a system, but:
>
> - Even with a large hash table it might be that by chance all CPUs
> will be queued up on the same page
> - There are a lot of
On Mon, Aug 14, 2017 at 06:48:06PM -0700, Linus Torvalds wrote:
> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen wrote:
> > We encountered workloads that have very long wake up list on large
> > systems. A waker takes a long time to traverse the entire wake list and
> > execute all the wake functions.
On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen wrote:
> We encountered workloads that have very long wake up list on large
> systems. A waker takes a long time to traverse the entire wake list and
> execute all the wake functions.
>
> We saw page wait list that are up to 3700+ entries long in tests of
63 matches
Mail list logo