Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-09 Thread Sergey Senozhatsky
On (09/06/19 16:01), Peter Zijlstra wrote:
> In fact, i've gotten output that is plain impossible with
> the current junk.

Peter, can you post any of those backtraces? Very curious.

-ss


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-08 Thread Peter Zijlstra
On Fri, Sep 06, 2019 at 04:01:26PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 06, 2019 at 02:42:11PM +0200, Petr Mladek wrote:

> > 7. People would complain when continuous lines become less
> >reliable. It might be most visible when mixing backtraces
> >from all CPUs. Simple sorting by prefix will not make
> >it readable. The historic way was to synchronize CPUs
> >by a spin lock. But then the cpu_lock() could cause
> >deadlock.
> 
> Why? I'm running with that thing on, I've never seen a deadlock ever
> because of it. In fact, i've gotten output that is plain impossible with
> the current junk.
> 
> The cpu-lock is inside the all-backtrace spinlock, not outside. And as I
> said yesterday, only the lockless console has any wait-loops while
> holding the cpu-lock. It _will_ make progress.

So I've been a huge flaming idiot.. so while I'm not particularly
sympathetic to NMIs that block, there are a number of really trivial
deadlocks possible -- and it is a minor miracle I've not actually hit
them (I suppose because printk() isn't really all that common).

The whole cpu-lock thing I had needs to go. But not having it makes
lockless console output unreadable and unsable garbage.

I've got some ideas on a replacement, but I need to further consider it.

:-/


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread John Ogness
On 2019-09-06, Peter Zijlstra  wrote:
>> I wish it was that simple. It is possible that I see it too
>> complicated. But this comes to my mind:
>> 
>> 1. The simple printk_buffer_store(buf, n) is not NMI safe. For this,
>>we might need the reserve-store approach.
>
> Of course it is, and sure it has a reserve+commit internally. I'm sure
> I posted an implenentation of something like this at some point.
>
> It is lockless (wait-free in fact, which is stronger) and supports
> multi-readers. I'm sure I posted something like that before, and ISTR
> John has something like that around somewhere too.

Yes. It was called RFCv1[0].

> The only thing I'm omitting is doing vscnprintf() twice, first to
> determine the length, and then into the reservation. Partly because I
> think that is silly and 256 chars should be plenty for everyone,
> partly because that avoids having vscnprintf() inside the cpu_lock()
> and partly because it is simpler to not do that.

Yes, this approach is more straight forward and was suggested in the
feedback to RFCv1. Although I think the current limit (1024) should
still be OK. Then we have 1 dedicated page per CPU for vscnprintf().

>> 2. The simple approach works only with lockless consoles. We need
>>something else for the rest at least for NMI. Simle offloading
>>to a kthread has been blocked for years. People wanted the
>>trylock-and-flush-immediately approach.
>
> Have an irq_work to wake up a kthread that will print to shit
> consoles.

This is the approach in all the RFC versions.

>> 5. John planed to use the cpu_lock in the lockless consoles.
>>I wonder if it was only in the console->write() callback
>>or if it would spread the lock more widely.

The 8250 driver in RFCv1 uses the cpu-lock in console->write() on a
per-character basis and in console->write_atomic() on a per-line
basis. This is necessary because the 8250 driver cannot run lockless. It
requires synchronization for its UART_IER clearing/setting before/after
transmit.

IMO the existing early console implementations are _not_ safe for
preemption. This was the reason for the new write_atomic() callback in
RFCv1.

> Right, I'm saying that since you need it anyway, lift it up one layer.
> It makes everything simpler. More simpler is more better.

This was my reasoning for using the cpu-lock in RFCv1. Moving to a
lockless ringbuffer for RFCv2 was because there was too much
resistance/concern surrounding the cpu-lock. But yes, if we want to
support atomic consoles, the cpu-lock will still be needed.

The cpu-lock (and the related concerns) were discussed here[1].

>> 7. People would complain when continuous lines become less
>>reliable. It might be most visible when mixing backtraces
>>from all CPUs. Simple sorting by prefix will not make
>>it readable. The historic way was to synchronize CPUs
>>by a spin lock. But then the cpu_lock() could cause
>>deadlock.
>
> Why? I'm running with that thing on, I've never seen a deadlock ever
> because of it.

As was discussed in the thread I just mentioned, introducing the
cpu-lock means that _all_ NMI functions taking spinlocks need to use the
cpu-lock. Even though Peter has never seen a deadlock, a deadlock is
possible if a BUG is triggered while one such spinlock is held. Also
note that it is not allowed to have 2 cpu-locks in the system. This is
where the BKL references started showing up.

Spinlocks in NMI context are rare, but they have existed in the past and
could exist again in the future. My suggestion was to create the policy
that any needed locking in NMI context must be done using the one
cpu-lock.

John Ogness

[0] https://lkml.kernel.org/r/20190212143003.48446-1-john.ogn...@linutronix.de
[1] https://lkml.kernel.org/r/20190227094655.ecdwhsc2bf5sp...@pathway.suse.cz


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Sergey Senozhatsky
On (09/06/19 16:01), Peter Zijlstra wrote:
> > 2. The simple approach works only with lockless consoles. We need
> >something else for the rest at least for NMI. Simle offloading
> >to a kthread has been blocked for years. People wanted the
> >trylock-and-flush-immediately approach.
> 
> Have an irq_work to wake up a kthread that will print to shit consoles.

Do we need sched dependency? We can print a batch of pending
logbuf messages and queue another irw_work if there are more
pending messages, right?

-ss


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Peter Zijlstra
On Fri, Sep 06, 2019 at 04:01:26PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 06, 2019 at 02:42:11PM +0200, Petr Mladek wrote:
> > 7. People would complain when continuous lines become less
> >reliable. It might be most visible when mixing backtraces
> >from all CPUs. Simple sorting by prefix will not make
> >it readable. The historic way was to synchronize CPUs
> >by a spin lock. But then the cpu_lock() could cause
> >deadlock.
> 
> Why? I'm running with that thing on, I've never seen a deadlock ever
> because of it. In fact, i've gotten output that is plain impossible with
> the current junk.
> 
> The cpu-lock is inside the all-backtrace spinlock, not outside. And as I
> said yesterday, only the lockless console has any wait-loops while
> holding the cpu-lock. It _will_ make progress.

Oooh, I think I see. So one solution would be to pass the NMI along in
chain like.  Send it to a single CPU at a time, when finished, send it
to the next.


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Peter Zijlstra
On Fri, Sep 06, 2019 at 02:42:11PM +0200, Petr Mladek wrote:

> I wish it was that simple. It is possible that I see it too
> complicated. But this comes to my mind:
> 
> 1. The simple printk_buffer_store(buf, n) is not NMI safe. For this,
>we might need the reserve-store approach.

Of course it is, and sure it has a reserve+commit internally. I'm sure I
posted an implenentation of something like this at some point.

It is lockless (wait-free in fact, which is stronger) and supports
multi-readers. I'm sure I posted something like that before, and ISTR
John has something like that around somewhere too.

The only thing I'm omitting is doing vscnprintf() twice, first to
determine the length, and then into the reservation. Partly because I
think that is silly and 256 chars should be plenty for everyone, partly
because that avoids having vscnprintf() inside the cpu_lock() and partly
because it is simpler to not do that.

> 2. The simple approach works only with lockless consoles. We need
>something else for the rest at least for NMI. Simle offloading
>to a kthread has been blocked for years. People wanted the
>trylock-and-flush-immediately approach.

Have an irq_work to wake up a kthread that will print to shit consoles.
Seriously.. the trylock and flush stuff is horrific crap. You guys been
piling on the hack for years now, surely you're tired of that gunk?

(and if you _realy_ care, build a flush function that 'works'
mostly and waits for the kthread of choice to finish printing to the
'imporant' shit console).

> 3. console_lock works in tty as a big kernel lock. I do not know
>much details. But people familiar with the code said that
>it was a disaster. I assume that tty is still rather
>important console. I am not sure how it would fit into the
>simple approach.

The kernel thread in charge of printing doesn't care.

> 4. The console handling has got non-synchronous (console_trylock)
>quite early (ver 2.4.10, year 2001). The reason was to do not
>serialize CPUs by the speed of the console.
> 
>Serialized output could remove many troubles. The logic in
>console_unlock() is really crazy. It might be acceptable
>for debugging. But is it acceptable on production systems?

The kernel thread doesn't care. If you care about independent consoles,
have a kernel thread per console. That way a fast console can print fast
while a slow console will print slow and everybody is happy.

> 5. John planed to use the cpu_lock in the lockless consoles.
>I wonder if it was only in the console->write() callback
>or if it would spread the lock more widely.

Right, I'm saying that since you need it anyway, lift it up one layer.
It makes everything simpler. More simpler is more better.

> 6. One huge nightmare is panic() and code called from there.
>It is a maze of hacks, including arch-specific code, to
>prevent deadlocks and get the messages out.
> 
>Any lock might be blocked on any CPU at the moment. Or it
>it might become blocked when CPUs are stopped by NMI.
> 
>Fully lock-less log buffer might save us some headache.
>I am not sure whether a single lock shared between printk()
>writers and console drivers will make the situation easier
>or more complicated.

So panic is a non issue for the lockless console.

It only matters if you care to get something out of the crap consoles.
So print everything to the lockless buffer and lockless consoles, then
try and force as much as you can out of the crap consoles.

If you die, tought luck, at least the lockless consoles and kdump image
have the whole message.

> 7. People would complain when continuous lines become less
>reliable. It might be most visible when mixing backtraces
>from all CPUs. Simple sorting by prefix will not make
>it readable. The historic way was to synchronize CPUs
>by a spin lock. But then the cpu_lock() could cause
>deadlock.

Why? I'm running with that thing on, I've never seen a deadlock ever
because of it. In fact, i've gotten output that is plain impossible with
the current junk.

The cpu-lock is inside the all-backtrace spinlock, not outside. And as I
said yesterday, only the lockless console has any wait-loops while
holding the cpu-lock. It _will_ make progress.

> I would be really happy when we could ignore some of the problems
> or find an easy solution. I just want to make sure that we take
> into account all the known aspects.
> 
> I am sure that we could do better than we do now. I do not want
> to block any improvements. I am just a bit lost in the many
> black corners.

I hope the above helps. Also note that Linus' memory buffer is a
lockless console.


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Sergey Senozhatsky
On (09/06/19 12:49), Peter Zijlstra wrote:
> On Fri, Sep 06, 2019 at 07:09:43PM +0900, Sergey Senozhatsky wrote:
> 
> > ---
> > diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
> > index 139c310049b1..9c73eb6259ce 100644
> > --- a/kernel/printk/printk_safe.c
> > +++ b/kernel/printk/printk_safe.c
> > @@ -103,7 +103,10 @@ static __printf(2, 0) int printk_safe_log_store(struct 
> > printk_safe_seq_buf *s,
> > if (atomic_cmpxchg(&s->len, len, len + add) != len)
> > goto again;
> >  
> > -   queue_flush_work(s);
> > +   if (early_console)
> > +   early_console->write(early_console, s->buffer + len, add);
> > +   else
> > +   queue_flush_work(s);
> > return add;
> >  }
> 
> You've not been following along, that generates absolutely unreadable
> garbage.

This was more of a joke/reference to "Those NMI buffers are a trainwreck
and need to die a horrible death". Of course this needs a re-entrant cpu
lock to serialize access to atomic/early consoles. But here is one more
missing thing - we need atomic/early consoles on a separate, sort of
immutable, list. And probably forbid any modifications of such console
drivers, (PM, etc.) If we can do this then we don't need to take console_sem
while we iterate that list, which removes sched/timekeeping locks out
of the fast printk() path.

We, at the same time, don't have that many options on systems without
atomic/early consoles. Move printing to NMI (e.g. up to X pending logbug
lines per NMI)? Move printing to IPI (again, up to X pending logbuf lines
per IPI)? printk() softirqs?

-ss


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Petr Mladek
On Fri 2019-09-06 11:06:27, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 04:31:18PM +0200, Peter Zijlstra wrote:
> > So I have something roughly like the below; I'm suggesting you add the
> > line with + on:
> > 
> >   int early_vprintk(const char *fmt, va_list args)
> >   {
> > char buf[256]; // teh suck!
> > int old, n = vscnprintf(buf, sizeof(buf), fmt, args);
> > 
> > old = cpu_lock();
> > +   printk_buffer_store(buf, n);
> > early_console->write(early_console, buf, n);
> > cpu_unlock(old);
> > 
> > return n;
> >   }
> > 
> > (yes, yes, we can get rid of the on-stack @buf thing with a
> > reserve+commit API, but who cares :-))
> 
> Another approach is something like:
> 
> DEFINE_PER_CPU(int, printk_nest);
> DEFINE_PER_CPU(char, printk_line[4][256]);
> 
> int vprintk(const char *fmt, va_list args)
> {
>   int c, n, i;
>   char *buf;
> 
>   preempt_disable();
>   i = min(3, this_cpu_inc_return(printk_nest) - 1);
>   buf = this_cpu_ptr(printk_line[i]);
>   n = vscnprintf(buf, 256, fmt, args);
> 
>   c = cpu_lock();
>   printk_buffer_store(buf, n);
>   if (early_console)
>   early_console->write(early_console, buf, n);
>   cpu_unlock(c);
> 
>   this_cpu_dec(printk_nest);
>   preempt_enable();
> 
>   return n;
> }
> 
> Again, simple and straight forward (and I'm sure it's been mentioned
> before too).
> 
> We really should not be making this stuff harder than it needs to be
> (and anybody whining about lines longer than 256 characters can just go
> away, those are unreadable anyway).

I wish it was that simple. It is possible that I see it too
complicated. But this comes to my mind:

1. The simple printk_buffer_store(buf, n) is not NMI safe. For this,
   we might need the reserve-store approach.


2. The simple approach works only with lockless consoles. We need
   something else for the rest at least for NMI. Simle offloading
   to a kthread has been blocked for years. People wanted the
   trylock-and-flush-immediately approach.


3. console_lock works in tty as a big kernel lock. I do not know
   much details. But people familiar with the code said that
   it was a disaster. I assume that tty is still rather
   important console. I am not sure how it would fit into the
   simple approach.


4. The console handling has got non-synchronous (console_trylock)
   quite early (ver 2.4.10, year 2001). The reason was to do not
   serialize CPUs by the speed of the console.

   Serialized output could remove many troubles. The logic in
   console_unlock() is really crazy. It might be acceptable
   for debugging. But is it acceptable on production systems?


5. John planed to use the cpu_lock in the lockless consoles.
   I wonder if it was only in the console->write() callback
   or if it would spread the lock more widely.


6. One huge nightmare is panic() and code called from there.
   It is a maze of hacks, including arch-specific code, to
   prevent deadlocks and get the messages out.

   Any lock might be blocked on any CPU at the moment. Or it
   it might become blocked when CPUs are stopped by NMI.

   Fully lock-less log buffer might save us some headache.
   I am not sure whether a single lock shared between printk()
   writers and console drivers will make the situation easier
   or more complicated.


7. People would complain when continuous lines become less
   reliable. It might be most visible when mixing backtraces
   from all CPUs. Simple sorting by prefix will not make
   it readable. The historic way was to synchronize CPUs
   by a spin lock. But then the cpu_lock() could cause
   deadlock.


I would be really happy when we could ignore some of the problems
or find an easy solution. I just want to make sure that we take
into account all the known aspects.

I am sure that we could do better than we do now. I do not want
to block any improvements. I am just a bit lost in the many
black corners.

Best Regards,
Petr


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Peter Zijlstra
On Fri, Sep 06, 2019 at 07:09:43PM +0900, Sergey Senozhatsky wrote:

> ---
> diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
> index 139c310049b1..9c73eb6259ce 100644
> --- a/kernel/printk/printk_safe.c
> +++ b/kernel/printk/printk_safe.c
> @@ -103,7 +103,10 @@ static __printf(2, 0) int printk_safe_log_store(struct 
> printk_safe_seq_buf *s,
> if (atomic_cmpxchg(&s->len, len, len + add) != len)
> goto again;
>  
> -   queue_flush_work(s);
> +   if (early_console)
> +   early_console->write(early_console, s->buffer + len, add);
> +   else
> +   queue_flush_work(s);
> return add;
>  }

You've not been following along, that generates absolutely unreadable
garbage.


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Sergey Senozhatsky
On (09/06/19 11:06), Peter Zijlstra wrote:
> Another approach is something like:
> 
> DEFINE_PER_CPU(int, printk_nest);
> DEFINE_PER_CPU(char, printk_line[4][256]);
>
> int vprintk(const char *fmt, va_list args)
> {
>   int c, n, i;
>   char *buf;
> 
>   preempt_disable();
>   i = min(3, this_cpu_inc_return(printk_nest) - 1);
>   buf = this_cpu_ptr(printk_line[i]);
>   n = vscnprintf(buf, 256, fmt, args);
> 
>   c = cpu_lock();
>   printk_buffer_store(buf, n);
>   if (early_console)
>   early_console->write(early_console, buf, n);
>   cpu_unlock(c);
>
>   this_cpu_dec(printk_nest);
>   preempt_enable();
> 
>   return n;
> }
> 
> Again, simple and straight forward (and I'm sure it's been mentioned
> before too).

 :)

---
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 139c310049b1..9c73eb6259ce 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -103,7 +103,10 @@ static __printf(2, 0) int printk_safe_log_store(struct 
printk_safe_seq_buf *s,
if (atomic_cmpxchg(&s->len, len, len + add) != len)
goto again;
 
-   queue_flush_work(s);
+   if (early_console)
+   early_console->write(early_console, s->buffer + len, add);
+   else
+   queue_flush_work(s);
return add;
 }
---

-ss


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Petr Mladek
On Thu 2019-09-05 12:11:01, Steven Rostedt wrote:
> 
> [ Added Ted and Linux Plumbers ]
> 
> On Thu, 5 Sep 2019 17:38:21 +0200 (CEST)
> Thomas Gleixner  wrote:
> 
> > On Thu, 5 Sep 2019, Peter Zijlstra wrote:
> > > On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:  
> > > > The alternative lockless approach is still more complicated than
> > > > the serialized one. But I think that it is manageable thanks to
> > > > the simplified state tracking. And I might safe use some pain
> > > > in the long term.  
> > > 
> > > I've not looked at it yet, sorry. But per the above argument of needing
> > > the CPU serialization _anyway_, I don't see a compelling reason not to
> > > use it.
> > > 
> > > It is simple, it works. Let's use it.
> > > 
> > > If you really fancy a multi-writer buffer, you can always switch to one
> > > later, if you can convince someone it actually brings benefits and not
> > > just head-aches.  
> > 
> > Can we please grab one of the TBD slots at kernel summit next week, sit
> > down in a room and hash that out?
> >
> 
> We should definitely be able to find a room that will be available next
> week.

Sounds great. I am blocked only during Livepatching miniconference
that is scheduled on Wednesday, Sep 11 at 15:00
(basically the very last slot).

Best Regards,
Petr


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-06 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 04:31:18PM +0200, Peter Zijlstra wrote:
> So I have something roughly like the below; I'm suggesting you add the
> line with + on:
> 
>   int early_vprintk(const char *fmt, va_list args)
>   {
>   char buf[256]; // teh suck!
>   int old, n = vscnprintf(buf, sizeof(buf), fmt, args);
> 
>   old = cpu_lock();
> + printk_buffer_store(buf, n);
>   early_console->write(early_console, buf, n);
>   cpu_unlock(old);
> 
>   return n;
>   }
> 
> (yes, yes, we can get rid of the on-stack @buf thing with a
> reserve+commit API, but who cares :-))

Another approach is something like:

DEFINE_PER_CPU(int, printk_nest);
DEFINE_PER_CPU(char, printk_line[4][256]);

int vprintk(const char *fmt, va_list args)
{
int c, n, i;
char *buf;

preempt_disable();
i = min(3, this_cpu_inc_return(printk_nest) - 1);
buf = this_cpu_ptr(printk_line[i]);
n = vscnprintf(buf, 256, fmt, args);

c = cpu_lock();
printk_buffer_store(buf, n);
if (early_console)
early_console->write(early_console, buf, n);
cpu_unlock(c);

this_cpu_dec(printk_nest);
preempt_enable();

return n;
}

Again, simple and straight forward (and I'm sure it's been mentioned
before too).

We really should not be making this stuff harder than it needs to be
(and anybody whining about lines longer than 256 characters can just go
away, those are unreadable anyway).


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-05 Thread John Ogness
On 2019-09-05, Steven Rostedt  wrote:
>>> But per the above argument of needing the CPU serialization
>>> _anyway_, I don't see a compelling reason not to use it.
>>> 
>>> It is simple, it works. Let's use it.
>>> 
>>> If you really fancy a multi-writer buffer, you can always switch to
>>> one later, if you can convince someone it actually brings benefits
>>> and not just head-aches.
>> 
>> Can we please grab one of the TBD slots at kernel summit next week,
>> sit down in a room and hash that out?
>>
>
> We should definitely be able to find a room that will be available
> next week.

FWIW, on Monday at 12:45 I am giving a talk[0] on the printk
rework. I'll be dedicating a few slides to presenting the lockless
multi-writer design, but will also talk about the serialized CPU
approach from RFCv1.

John Ogness

[0] https://www.linuxplumbersconf.org/event/4/contributions/290/


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-05 Thread Steven Rostedt


[ Added Ted and Linux Plumbers ]

On Thu, 5 Sep 2019 17:38:21 +0200 (CEST)
Thomas Gleixner  wrote:

> On Thu, 5 Sep 2019, Peter Zijlstra wrote:
> > On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:  
> > > The alternative lockless approach is still more complicated than
> > > the serialized one. But I think that it is manageable thanks to
> > > the simplified state tracking. And I might safe use some pain
> > > in the long term.  
> > 
> > I've not looked at it yet, sorry. But per the above argument of needing
> > the CPU serialization _anyway_, I don't see a compelling reason not to
> > use it.
> > 
> > It is simple, it works. Let's use it.
> > 
> > If you really fancy a multi-writer buffer, you can always switch to one
> > later, if you can convince someone it actually brings benefits and not
> > just head-aches.  
> 
> Can we please grab one of the TBD slots at kernel summit next week, sit
> down in a room and hash that out?
>

We should definitely be able to find a room that will be available next
week.

-- Steve


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-05 Thread Thomas Gleixner
On Thu, 5 Sep 2019, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:
> > The alternative lockless approach is still more complicated than
> > the serialized one. But I think that it is manageable thanks to
> > the simplified state tracking. And I might safe use some pain
> > in the long term.
> 
> I've not looked at it yet, sorry. But per the above argument of needing
> the CPU serialization _anyway_, I don't see a compelling reason not to
> use it.
> 
> It is simple, it works. Let's use it.
> 
> If you really fancy a multi-writer buffer, you can always switch to one
> later, if you can convince someone it actually brings benefits and not
> just head-aches.

Can we please grab one of the TBD slots at kernel summit next week, sit
down in a room and hash that out?

Thanks,

tglx


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 03:05:13PM +0200, Petr Mladek wrote:

> The serialized approach used a lock. It was re-entrant and thus less
> error-prone but still a lock.
> 
> The lock was planed to be used not only to access the buffer but also
> for eventual locking inside lockless consoles. It might allow to
> have some synchronization even in lockless consoles. But it
> would be big-kernel-lock-like style. It might create yet
> another maze of problems.

I really don't see your point. All it does is limit buffer writers to a
single CPU, and does the same for the atomic/early console output.

But it must very much be a leaf lock -- that is, there must not be any
locking inside it -- and that is fine, if a console cannot do lockless
output, it simply cannot be marked as having an atomic/early console.

You've seen the force_earlyprintk patches I use [*], that stuff works
and is infinitely better than the current printk trainwreck -- and it
uses exactly such serialization -- although I only added it to make the
output actually readable. And _that_ is exactly why I propose adding it,
you need it _anyway_.

So the argument goes like:

 - synchronous output to lockless consoles (early serial) is mandatory
 - such output needs to be CPU serialized, otherwise it becomes
   unreadable garbage.
 - since we need that serialization anyway, might as well lift it up one
   layer an put it around the buffer.

Since a single-cpu buffer writer can be wait free (and relatively
simple), the only possible waiting is on the lockless console (polling
until the UART is ready for it's next byte). There is nothing else. It
will make progress.

> If we remove per-CPU buffers in NMI. We would need to synchronize
> again printing backtraces from all CPUs. Otherwise they would get
> mixed and hard to read. It might be solved by some prefix and
> sorting in userspace but...

It must have cpu prefixes anyway; the multi-writer thing will equally
mix them together. This is a complete non sequitur.

That current printk stuff is just pure and utter crap. Those NMI buffers
are a trainwreck and need to die a horrible death.

> I agree that this lockless variant is really complicated. I am not
> able to prove that it is race free as it is now. I understand
> the algorithm. But there are too many synchronization points.
> 
> Peter, have you seen my alternative approach, please. See
> https://lore.kernel.org/lkml/20190704103321.10022-1-pmla...@suse.com/
> 
> It uses two tricks:
> 
>1. Two bits in the sequence number are used to track the state
>   of the related data. It allows to implement the entire
>   life cycle of each entry using atomic operation on a single
>   variable.
> 
>2. There is a helper function to read valid data for each entry,
>   see prb_read_desc(). It checks the state before and after
>   reading the data to make sure that they are valid. And
>   it includes the needed read barriers. As a result there
>   are only three explicit barriers in the code. All other
>   are implicitly done by cmpxchg() atomic operations.
> 
> The alternative lockless approach is still more complicated than
> the serialized one. But I think that it is manageable thanks to
> the simplified state tracking. And I might safe use some pain
> in the long term.

I've not looked at it yet, sorry. But per the above argument of needing
the CPU serialization _anyway_, I don't see a compelling reason not to
use it.

It is simple, it works. Let's use it.

If you really fancy a multi-writer buffer, you can always switch to one
later, if you can convince someone it actually brings benefits and not
just head-aches.

So I have something roughly like the below; I'm suggesting you add the
line with + on:

  int early_vprintk(const char *fmt, va_list args)
  {
char buf[256]; // teh suck!
int old, n = vscnprintf(buf, sizeof(buf), fmt, args);

old = cpu_lock();
+   printk_buffer_store(buf, n);
early_console->write(early_console, buf, n);
cpu_unlock(old);

return n;
  }

(yes, yes, we can get rid of the on-stack @buf thing with a
reserve+commit API, but who cares :-))



[*] git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git 
debug/experimental


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-05 Thread Petr Mladek
On Wed 2019-09-04 14:35:31, Peter Zijlstra wrote:
> On Thu, Aug 08, 2019 at 12:32:25AM +0206, John Ogness wrote:
> > Hello,
> > 
> > This is a follow-up RFC on the work to re-implement much of
> > the core of printk. The threads for the previous RFC versions
> > are here: v1[0], v2[1], v3[2].
> > 
> > This series only builds upon v3 (i.e. the first part of this
> > series is exactly v3). The main purpose of this series is to
> > replace the current printk ringbuffer with the new
> > ringbuffer. As was discussed[3], this is a conservative
> > first step to rework printk. For example, all logbuf_lock
> > usage is kept even though the new ringbuffer does not
> > require it. This avoids any side-effect bugs in case the
> > logbuf_lock is (unintentionally) synchronizing more than
> > just the ringbuffer. However, this also means that the
> > series does not bring any improvements, just swapping out
> > implementations. A future patch will remove the logbuf_lock.
> 
> So after reading most of the first patch (and it look _much_ better than
> previous times), I'm left wondering *why* ?!
> 
> That is, why do we need this complexity, as compared to that
> CPU serialized approach?

The serialized approach used a lock. It was re-entrant and thus less
error-prone but still a lock.

The lock was planed to be used not only to access the buffer but also
for eventual locking inside lockless consoles. It might allow to
have some synchronization even in lockless consoles. But it
would be big-kernel-lock-like style. It might create yet
another maze of problems.

If we remove per-CPU buffers in NMI. We would need to synchronize
again printing backtraces from all CPUs. Otherwise they would get
mixed and hard to read. It might be solved by some prefix and
sorting in userspace but...

This why I asked to see a fully lockless code to see how
more complicated it was. John told me that he had an early
version of it around.


I agree that this lockless variant is really complicated. I am not
able to prove that it is race free as it is now. I understand
the algorithm. But there are too many synchronization points.

Peter, have you seen my alternative approach, please. See
https://lore.kernel.org/lkml/20190704103321.10022-1-pmla...@suse.com/

It uses two tricks:

   1. Two bits in the sequence number are used to track the state
  of the related data. It allows to implement the entire
  life cycle of each entry using atomic operation on a single
  variable.

   2. There is a helper function to read valid data for each entry,
  see prb_read_desc(). It checks the state before and after
  reading the data to make sure that they are valid. And
  it includes the needed read barriers. As a result there
  are only three explicit barriers in the code. All other
  are implicitly done by cmpxchg() atomic operations.

The alternative lockless approach is still more complicated than
the serialized one. But I think that it is manageable thanks to
the simplified state tracking. And I might safe use some pain
in the long term.


> In my book simpler is better here. printk() is an absolute utter slow
> path anyway, nobody cares about the performance much, and I'm thinking
> that it should be plenty fast enough as long as you don't run a
> synchronous serial output (which is exactly what I do do/require
> anyway).

I fully agree.

Best Regards,
Petr


Re: [RFC PATCH v4 0/9] printk: new ringbuffer implementation

2019-09-04 Thread Peter Zijlstra
On Thu, Aug 08, 2019 at 12:32:25AM +0206, John Ogness wrote:
> Hello,
> 
> This is a follow-up RFC on the work to re-implement much of
> the core of printk. The threads for the previous RFC versions
> are here: v1[0], v2[1], v3[2].
> 
> This series only builds upon v3 (i.e. the first part of this
> series is exactly v3). The main purpose of this series is to
> replace the current printk ringbuffer with the new
> ringbuffer. As was discussed[3], this is a conservative
> first step to rework printk. For example, all logbuf_lock
> usage is kept even though the new ringbuffer does not
> require it. This avoids any side-effect bugs in case the
> logbuf_lock is (unintentionally) synchronizing more than
> just the ringbuffer. However, this also means that the
> series does not bring any improvements, just swapping out
> implementations. A future patch will remove the logbuf_lock.

So after reading most of the first patch (and it look _much_ better than
previous times), I'm left wondering *why* ?!

That is, why do we need this complexity, as compared to that
CPU serialized approach?

What do we hope to gain by doing a multi-writer buffer? Yes, it is
awesome, but from where I'm sitting it is also completely silly, because
we'll want to CPU serialize the serial console anyway (otherwise it gets
to be a completely unreadable mess).

By having the whole thing CPU serialized we looose multi-writer and
consequently the buffer gets to be significantly simpler (as you know;
because ISTR you've actually done this before -- but I cannot find here
why that didn't live).

In my book simpler is better here. printk() is an absolute utter slow
path anyway, nobody cares about the performance much, and I'm thinking
that it should be plenty fast enough as long as you don't run a
synchronous serial output (which is exactly what I do do/require
anyway).

So can we have a few words to explain why we need multi-writer and all
this complexity?