Re: numlist_pop(): Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-09-04 Thread Peter Zijlstra
On Tue, Aug 20, 2019 at 10:15:18AM +0200, Petr Mladek wrote:

>   do {
>   tail_id = atomic_long_read(>tail_id);
> 
>   /*
>* Read might fail when the tail node has been removed
>* and reused in parallel.
>*/
>   if (!numlist_read(nl, tail_id, NULL, _id))
>   continue;
> 
>   /* Make sure the node is not the only node on the list. */
>   if (next_id == tail_id)
>   return NULL;
> 
>   /* cC: Make sure the node is not busy. */
>   if (nl->busy(tail_id, nl->busy_arg))
>   return NULL;
> 
>   while (atomic_long_cmpxchg_relaxed(>tail_id, tail_id, next_id) !=
>   tail_id);

Both you and John should have a look at atomic*_try_cmpxchg*(); with
that you can write the above as:

tail_id = atomic_long_read(>tai_id);
do {
...
} while (!atomic_long_try_cmpxchg_relaxed(>tail_id, _id, 
next_id));

And get better code-gen to boot.


Re: numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-09-03 Thread Sergey Senozhatsky
On (08/27/19 15:03), Petr Mladek wrote:
[..]
> > IMHO the API is sane. The only bizarre rule is that the numlist must
> > always have at least 1 node. But since the readers are non-consuming,
> > there is no real tragedy here.
> >
> > My goal is not to create some fabulous abstract data structure that
> > everyone should use. But I did try to minimize numlist (and dataring) to
> > only be concerned with clearly defined and minimal responsibilities
> > without imposing unnecessary restrictions on the user.
>
> The API is complicated because of the callbacks. It depends on a logic
> that is implemented externally. It makes it abstract to some extent.
>
> My view is that the API would be much cleaner and easier to review
> when the ID handling is "hardcoded" (helper functions). It could be
> made abstract anytime later when there is another user.

Makes sense.

> There should always be a reason why to make a code more complicated
> than necessary. It seems that the only reason is some theoretical
> future user and its theoretical requirements.

Agreed.

> Symmetry is really important. It is often sign of a good design.
>
> Simple and straightforward code is another important thing at
> this stage. The code is complicated and we need to make sure
> that it works. Any optimizations and generalization might
> be done later when needed.

Agreed.

-ss


dataring API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-30 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
> new file mode 100644
> index ..911bac593ec1
> --- /dev/null
> +++ b/kernel/printk/dataring.c
> + * DOC: dataring overview

I have to spend a lot of time thinking about dataring API. I started
with some small comments. I ended with a description how I see
the API consistency.

I am still not sure that I see it from the right angle. Let's see.

> + * A dataring is a lockless ringbuffer consisting of variable length data
> + * blocks, each of which are assigned an ID. The IDs map to descriptors, 
> which
> + * contain metadata about the data block. The lookup function mapping IDs to
> + * descriptors is implemented by the user.
> + *
> + * Descriptors
> + * ---
> + * A descriptor is a handle to a data block. How descriptors are structured
> + * and mapped to IDs is implemented by the user.

We should add a note that logical IDs are used to avoid ABA problems.

> + * Descriptors contain the begin (begin_lpos) and end (next_lpos) logical
> + * positions of the data block they represent. The end logical position
> + * matches the begin logical position of the adjacent data block.
> + *
> + * Why Descriptors?
> + * 
> + * The data ringbuffer supports variable length entities, which means that
> + * data blocks will not always begin at a predictable offset of the byte
> + * array. This is a major problem for lockless writers that, for example, 
> will
> + * compete to expire and reuse old data blocks when the ringbuffer is full.
> + * Without a predictable begin for the data blocks, a writer has no reliable
> + * information about the status of the "free" area. Are any flags or state
> + * variables already set or is it just garbage left over from previous usage?
> + *
> + * Descriptors allow safe and controlled access to data block metadata by
> + * providing predictable offsets for such metadata. This is key to supporting
> + * multiple concurrent lockless writers.
> + *
> + * Behavior
> + * 
> + * The data ringbuffer allows writers to commit data without regard for
> + * readers. Readers must pre- and post-validate the data blocks they are
> + * processing to be sure the processed data is consistent. A function
> + * dataring_datablock_isvalid() is available for that. Readers can only
> + * iterate data blocks by utilizing an external implementation using
> + * descriptor lookups based on IDs.
> + *
> + * Writers commit data in two steps:
> + *
> + * (1) Reserve a new data block (dataring_push()).
> + * (2) Commit the data block (dataring_datablock_setid()).
> + *
> + * Once a data block is committed, it is available for recycling by another
> + * writer. Therefore, once committed, a writer must no longer access the data
> + * block.
> + *
> + * If data block reservation fails, it means the oldest reserved data block
> + * has not yet been committed by its writer. This acts as a blocker for any
> + * future data block reservation.

Let's start with something easier. I am not sure with the FIFO interface:

> +bool dataring_pop(struct dataring *dr);
> +char *dataring_push(struct dataring *dr, unsigned int size,
> + struct dr_desc *desc, unsigned long id);

This is the same FIFO interface as in numlist. But the semantic
is different, especially the push() part:

   + numlist_push():
  + adds a new (external node)
  + node is valid when added
  + always succeeds

   + dataring_push()
  + just reserves space in its own data structure;
  + data are written later and not valid until anyone calls
setid() (commits)
  + might fail when it is unable to remove old (non-committed)
data with wrong id

The pop() part is similar but the wording is slightly different:

+ numlist_pop()
  + succeeds only when the oldest node is not blocked().
  + blocked state is defined by external callback

+ dataring_pop()
  + succeeds only when the oldest node has correct id (committed)
  + the id is validated using external callback


I am somehow confused by the same names and the differences.
Also dataring_push() does not push any data. It is only making
space.

I believe that the following interface will be much easier
to understand. It is inspired by ringbuffer API. Both APIs
actually have similar usage:

  + dataring_reserve() instead of push()
  + dataring_commit() instead of setid()
  + dataring_remove_oldest() instead of pop()


> +struct dr_datablock *dataring_getdatablock(struct dataring *dr,
> +struct dr_desc *desc, int *size);

please: getdatablock -> get_datablock

On the same note, please, rename getdesc() callback to get_desc().

> +bool dataring_datablock_isvalid(struct dataring *dr, struct dr_desc *desc);

I am always in doubts what it exactly means, especially whether
the data are valid. I suggest something like:

   dataring_datablock_is_used()  

Re: numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-29 Thread Petr Mladek
On Wed 2019-08-28 16:03:38, John Ogness wrote:
> On 2019-08-28, Petr Mladek  wrote:
> > I only think that, especially, numlist API is too generic in v4.
> > It is not selfcontained. The consistency depends on external barriers.
> >
> > I believe that it might become fully self-contained and consistent
> > if we reduce possibilities of the generic usage. In particular,
> > the numlist should allow only linking of reusable structures
> > stored in an array.
> 
> OK. I will make the numlist the master of the ID-to-node mapping. To
> implement the getdesc() callback of the dataring, the printk_ringbuffer
> can call a numlist mapping function. Also, numlist will need to provide
> a function to bump the descriptor version (as your previous idea already
> showed).

Sounds good.

> I plan to change the array to be numlist nodes. The ID would move into
> the numlist node structure and a void-pointer private would be added so
> that the numlist user can add private data (for printk_ringbuffer that
> would just be a pointer to the dataring structure). When the
> printk_ringbuffer gets a never-used numlist node, it can set the private
> field.

I am not sure that I get the full picture. It would help to see some
snippet of the code (struct declaration).

Anyway, adding void-pointer into struct numlist looks like classic
(userspace?) implementation of dynamically linked structures.

I do not have strong opinion. But I would prefer to stay with
the kernel-style. I mean that the numlist structure is part
of the linked structures. And container_of() is eventually
used to get pointer to the upper structure. Also passing values
via a pointer with generic name (data) slightly complicates the code.

I know that numlist is special because id is used to get the pointer
of the numlist and also the upper structure. Anyway, the kernel
style looks more familiar to me in the kernel context.
But I'll leave it up to you.


> This has the added benefit of making it easy to detect accidental
> never-used descriptor usage when reading dataring garbage. This was
> non-trivial and I'm still not sure I solved it correctly. (I've already
> spent a week working on a definitive answer to your email[0] asking
> about this.)

Would a check for NULL data pointer help here? Well, it might be
motivation for using the pointer.

I wonder if begin_lpos == end_lpos check might be used to detect
never used descriptors. It is already used in another situation
in dataring_desc_init(). IMHO, the values even do not need to be
unaligned. But I might miss something.

Best Regards,
Petr


Re: numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-28 Thread John Ogness
On 2019-08-28, Petr Mladek  wrote:
> I only think that, especially, numlist API is too generic in v4.
> It is not selfcontained. The consistency depends on external barriers.
>
> I believe that it might become fully self-contained and consistent
> if we reduce possibilities of the generic usage. In particular,
> the numlist should allow only linking of reusable structures
> stored in an array.

OK. I will make the numlist the master of the ID-to-node mapping. To
implement the getdesc() callback of the dataring, the printk_ringbuffer
can call a numlist mapping function. Also, numlist will need to provide
a function to bump the descriptor version (as your previous idea already
showed).

I plan to change the array to be numlist nodes. The ID would move into
the numlist node structure and a void-pointer private would be added so
that the numlist user can add private data (for printk_ringbuffer that
would just be a pointer to the dataring structure). When the
printk_ringbuffer gets a never-used numlist node, it can set the private
field.

This has the added benefit of making it easy to detect accidental
never-used descriptor usage when reading dataring garbage. This was
non-trivial and I'm still not sure I solved it correctly. (I've already
spent a week working on a definitive answer to your email[0] asking
about this.)

John Ogness

[0] https://lkml.kernel.org/r/20190820151239.yzdqz56yeldlk...@pathway.suse.cz


Re: dataring_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-28 Thread John Ogness
On 2019-08-27, Petr Mladek  wrote:
 +/**
 + * dataring_push() - Reserve a data block in the data array.
 + *
 + * @dr:   The data ringbuffer to reserve data in.
 + *
 + * @size: The size to reserve.
 + *
 + * @desc: A pointer to a descriptor to store the data block information.
 + *
 + * @id:   The ID of the descriptor to be associated.
 + *The data block will not be set with @id, but rather initialized 
 with
 + *a value that is explicitly different than @id. This is to 
 handle the
 + *case when newly available garbage by chance matches the 
 descriptor
 + *ID.
 + *
 + * This function expects to move the head pointer forward. If this would
 + * result in overtaking the data array index of the tail, the tail data 
 block
 + * will be invalidated.
 + *
 + * Return: A pointer to the reserved writer data, otherwise NULL.
 + *
 + * This will only fail if it was not possible to invalidate the tail data
 + * block.
 + */
 +char *dataring_push(struct dataring *dr, unsigned int size,
 +  struct dr_desc *desc, unsigned long id)
 +{
 +  unsigned long begin_lpos;
 +  unsigned long next_lpos;
 +  struct dr_datablock *db;
 +  bool ret;
 +
 +  to_db_size();
 +
 +  do {
 +  /* fA: */
 +  ret = get_new_lpos(dr, size, _lpos, _lpos);
 +
 +  /*
 +   * fB:
 +   *
 +   * The data ringbuffer tail may have been pushed (by this or
 +   * any other task). The updated @tail_lpos must be visible to
 +   * all observers before changes to @begin_lpos, @next_lpos, or
 +   * @head_lpos by this task are visible in order to allow other
 +   * tasks to recognize the invalidation of the data
 +   * blocks.
>>>
>>> This sounds strange. The write barrier should be done only on CPU
>>> that really modified tail_lpos. I.e. it should be in _dataring_pop()
>>> after successful dr->tail_lpos modification.
>> 
>> The problem is that there are no data dependencies between the different
>> variables. When a new datablock is being reserved, it is critical that
>> all other observers see that the tail_lpos moved forward _before_ any
>> other changes. _dataring_pop() uses an smp_rmb() to synchronize for
>> tail_lpos movement.
>
> It should be symmetric. It makes sense that _dataring_pop() uses an
> smp_rmb(). Then there should be wmb() in dataring_push().

dataring_pop() is adjusting the tail. dataring_push() is adjusting the
head. These operations are handled (ordered) separately. They do not
need be happening in lockstep. They don't need to be happening on the
same CPU.

> The wmb() should be done only by the CPU that actually did the write.
> And it should be done after the write. This is why I suggested to
> do it after cmpxchg(dr->head_lpos).

If CPU0 issues an smp_wmb() after moving the tail and (after seeing the
moved tail) CPU1 issues an smp_wmb() after updating the head, it is
still possible for CPU2 to see the head move (and possibly even overtake
the tail) before seeing the tail move.

If a CPU didn't move the tail but _will_ move the head, only a full
memory barrier will allow _all_ observers to see the tail move before
seeing the head move.

>> This CPU is about to make some changes and may have seen an updated
>> tail_lpos. An smp_wmb() is useless if this is not the CPU that
>> performed that update. The full memory barrier ensures that all other
>> observers will see what this CPU sees before any of its future
>> changes are seen.
>
> I do not understand it. Full memory barrier will not cause that all
> CPUs will see the same.

I did not write that. I wrote (emphasis added):

The full memory barrier ensures that all other observers will see
what _this_ CPU sees before any of _its_ future changes are seen.

> These barriers need to be symmetric.

They are. The comments for fB list the pairs (all being
smp_mb()/smp_rmb() pairings).

> Back to our situation:
>
> + rmb() should not be needed here because get_new_lpos() provided
>   a valid lpos.
>
> + wmb() is not needed because we have not written anything yet
>
> If there was a race with another CPU than cmpxchg(dr->head_lpos)
> would fail and we will need to repeat everything again.

It's not about racing to update the head. It's about making sure that
_all_ CPUs observe that a datablock was invalidated _before_ observing
that _this_ CPU started modifying other shared variables. And again,
this CPU might _not_ be the one that invalidated the datablock
(i.e. moved the tail).

John Ogness


Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-28 Thread John Ogness
On 2019-08-27, Petr Mladek  wrote:
>> --- /dev/null
>> +++ b/kernel/printk/numlist.c
>> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long 
>> id)
>> +{
>> > [...]
>> +
>> +/* bB: #1 */
>> +head_id = atomic_long_read(>head_id);
>> +
>> +for (;;) {
>> +/* bC: */
>> +while (!numlist_read(nl, head_id, , NULL)) {
>> +/*
>> + * @head_id is invalid. Try again with an
>> + * updated value.
>> + */
 
 But there is still an issue that this could spin hard if the head
 was recycled and this CPU does not yet see the new head value.
>>>
>>> I do not understand this. The head could get reused only after
>>> head_id was replaced with the following valid node.
>>> The next cycle is done with a new id that should be valid.
>>>
>>> Of course, the new ID might get reused as well. But then we just
>>> repeat the cycle. We have to be able to find a valid head after
>>> few cycles. The last valid ID could not get reused because nodes
>>> can be removed only if was not the last valid node.
>> 
>> nl->head_id is read using a relaxed read.
>
> I wonder if the "relaxed read" causes the confusion. Could it read
> the old id even when numlist_read() for this id failed?

Yes. It is possible that the new head ID is not yet visible. That is
what the litmus test shows.

> If this is true then it should not be relaxed read.

That is essentially what the new change facilitates. By using an ACQUIRE
to load the descriptor ID, this CPU sees what the CPU that changed the
descriptor ID saw. And that CPU must have seen a different head ID
because a successful numlist_pop() means the popped node's @next_id was
verified that it isn't a terminator, and seeing @next_id means the new
head ID from the CPU that set @next_id is also visible.

Now this still isn't a guarantee that the head ID hasn't changed since
then. So the CPU still may loop. But the CPU is guaranteed to make
forward progress with each new head ID it sees.

>> A second CPU may have added new nodes and removed/recycled
>> the node with the ID that the first CPU read as the head.
>
> This sounds like ABA problem. My understanding is that we
> use ID to prevent these problems and could ignore them.

Yes, it is an ABA problem. And yes, because of IDs, it is prevented via
detection and numlist_read() failing. The ABA problem is not ignored, it
is handled.

[snipped vague litmus test]

> I think that the Litmus test does not describe the code.

OK, at the end is a new litmus test with extensive annotation so that
you see exactly how it is modelling the code. I have also increased the
concurrency, splitting the push/pop to 2 different CPUs. And to be fair,
I left in general memory barriers (smp_rmb/smp_wmb) that, although
unrelated, do exist in the code paths as well.

All annotation/code are based on RFCv4.

> If it does then we need to fix the algorithm or barriers.

Indeed, I've fixed the barriers for the next version.

> My undestanding is that only valid nodes are added to the list.

Correct.

> If a node read via head_id is not valid then head_id already
> points to another valid node. Am I wrong, please?

You are correct.

John Ogness


C numlist_push_loop

(*
 * Result: Never
 *
 * Check numlist_push() loop to re-read the head,
 * if it is possible that a new head ID is not visible.
 *)

{
int node1 = 1;
int node1_next = 1;
int node2 = 2;
int *numlist_head = 
int *numlist_tail = 
}

/*
 * This CPU wants to push a new node (not shown) to the numlist but another
 * CPU pushed a node after this CPU had read the head ID, and yet another
 * CPU pops/recycles the node that this CPU first saw as the head.
 */
P0(int **numlist_head)
{
int *head;
int id;

/*
 * Read the head ID.
 *
 * numlist_push() numlist.c:215
 *
 * head_id = atomic_long_read(>head_id);
 */
head = READ_ONCE(*numlist_head);

/*
 * Read/validate the descriptor ID.
 *
 * NOTE: To guarantee seeing a new head ID, this should be:
 *   id = smp_load_acquire(head);
 *
 * numlist_push() numlist.c:219
 *   numlist_read() numlist.c:116
 * prb_desc_node() ringbuffer.c:220-223
 *
 * struct prb_desc *d = to_desc(arg, id);
 * if (id != atomic_long_read(>id))
 * return NULL;
 */
id = READ_ONCE(*head);

/*
 * Re-read the head ID when validation failed.
 *
 * numlist_push() numlist.c:228
 *
 * head_id = atomic_long_read(>head_id);
 */
head = READ_ONCE(*numlist_head);
}

/* This CPU pushes a new node (node2) to the numlist. */
P1(int **numlist_head, int *node1, int *node2, int 

Re: numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-28 Thread Petr Mladek
On Wed 2019-08-28 09:13:39, John Ogness wrote:
> On 2019-08-27, Petr Mladek  wrote:
> > The API is complicated because of the callbacks. It depends on a logic
> > that is implemented externally. It makes it abstract to some extent.
> >
> > My view is that the API would be much cleaner and easier to review
> > when the ID handling is "hardcoded" (helper functions). It could be
> > made abstract anytime later when there is another user.
> >
> > There should always be a reason why to make a code more complicated
> > than necessary. It seems that the only reason is some theoretical
> > future user and its theoretical requirements.
> 
> FWIW, I did _not_ create the numlist and dataring structures in order to
> support some theoretical future user. PeterZ helped[0] me realize that
> RFCv2 was actually using multiple internal data structures. Each of
> these internal data structures has their own set of memory barriers and
> semantics. By explicitly refactoring them behind strong APIs, the memory
> barriers could be clearly visible and the semantics clearly defined.
> 
> For me this was a great help in _simplifying_ the design. For me it also
> greatly simplified debugging, testing, and verifying because I could
> write tests for numlist and datalist that explicitly targeted those data
> structures. Once I believed they were bullet-proof, I could move on to
> higher-level tests of the printk_ringbuffer. And once I believed the
> printk_ringbuffer was bullet-proof, I could move on to the higher-level
> printk tests. When a problem was found, I could effectively isolate
> which component failed their job.
> 
> I understand that we disagree about the abstractions being a
> simplification.

This is a misunderstanding. I probably was not clear enough. It makes
perfect sense to have separate APIs for numlist and dataring. I agree
that they allow to split the problem into smaller pieces.

I only think that, especially, numlist API is too generic in v4.
It is not selfcontained. The consistency depends on external barriers.

I believe that it might become fully self-contained and consistent
if we reduce possibilities of the generic usage. In particular,
the numlist should allow only linking of reusable structures
stored in an array.

I explained in the previous mail that other use cases are
questionable. If anyone really finds another usecase,
the API might be made more generic. But we should start
with something simple.


> And I'm not sure how to proceed in this regard. (Maybe
> once we get everything bullet-proof, we can put everything back together
> into a monolith like RFCv2.)

I would actually go the other way. It would be nice to add numlist
and dataring API in separate patches.

> Either way, please understand that the
> abstractions were done for the benefit of printk_ringbuffer, not for any
> theoretical future user.

I understand. I hope that it is more clear now.

Best Regards,
Petr


Re: numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-28 Thread John Ogness
On 2019-08-27, Petr Mladek  wrote:
> The API is complicated because of the callbacks. It depends on a logic
> that is implemented externally. It makes it abstract to some extent.
>
> My view is that the API would be much cleaner and easier to review
> when the ID handling is "hardcoded" (helper functions). It could be
> made abstract anytime later when there is another user.
>
> There should always be a reason why to make a code more complicated
> than necessary. It seems that the only reason is some theoretical
> future user and its theoretical requirements.

FWIW, I did _not_ create the numlist and dataring structures in order to
support some theoretical future user. PeterZ helped[0] me realize that
RFCv2 was actually using multiple internal data structures. Each of
these internal data structures has their own set of memory barriers and
semantics. By explicitly refactoring them behind strong APIs, the memory
barriers could be clearly visible and the semantics clearly defined.

For me this was a great help in _simplifying_ the design. For me it also
greatly simplified debugging, testing, and verifying because I could
write tests for numlist and datalist that explicitly targeted those data
structures. Once I believed they were bullet-proof, I could move on to
higher-level tests of the printk_ringbuffer. And once I believed the
printk_ringbuffer was bullet-proof, I could move on to the higher-level
printk tests. When a problem was found, I could effectively isolate
which component failed their job.

I understand that we disagree about the abstractions being a
simplification. And I'm not sure how to proceed in this regard. (Maybe
once we get everything bullet-proof, we can put everything back together
into a monolith like RFCv2.) Either way, please understand that the
abstractions were done for the benefit of printk_ringbuffer, not for any
theoretical future user.

John Ogness

[0] 
https://lkml.kernel.org/r/20190628154435.gz7...@worktop.programming.kicks-ass.net


Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-27 Thread Petr Mladek
On Tue 2019-08-27 16:28:55, John Ogness wrote:
> On 2019-08-27, Petr Mladek  wrote:
> >> On 2019-08-23, Petr Mladek  wrote:
>  --- /dev/null
>  +++ b/kernel/printk/numlist.c
>  +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long 
>  id)
>  +{
> > [...]
>  +
>  +/* bB: #1 */
>  +head_id = atomic_long_read(>head_id);
>  +
>  +for (;;) {
>  +/* bC: */
>  +while (!numlist_read(nl, head_id, , NULL)) {
>  +/*
>  + * @head_id is invalid. Try again with an
>  + * updated value.
>  + */
>  +
>  +cpu_relax();
> >>>
> >>> I have got very confused by this. cpu_relax() suggests that this
> >>> cycle is busy waiting until a particular node becomes valid.
> >>> My first though was that it must cause deadlock in NMI when
> >>> the interrupted code is supposed to make the node valid.
> >>>
> >>> But it is the other way. The head is always valid when it is
> >>> added to the list. It might become invalid when another CPU
> >>> moves the head and the old one gets reused.
> >>>
> >>> Anyway, I do not see any reason for cpu_relax() here.
> >> 
> >> You are correct. The cpu_relax() should not be there. But there is
> >> still an issue that this could spin hard if the head was recycled and
> >> this CPU does not yet see the new head value.
> >
> > I do not understand this. The head could get reused only after
> > head_id was replaced with the following valid node.
> > The next cycle is done with a new id that should be valid.
> >
> > Of course, the new ID might get reused as well. But then we just
> > repeat the cycle. We have to be able to find a valid head after
> > few cycles. The last valid ID could not get reused because nodes
> > can be removed only if was not the last valid node.
> 
> Sorry, I was not very precise with my language. I will try again...
> 
> nl->head_id is read using a relaxed read.

I wonder if the "relaxed read" causes the confusion. Could it read
the old id even when numlist_read() for this id failed?

If this is true then it should not be relaxed read.


> A second CPU may have added new nodes and removed/recycled
> the node with the ID that the first CPU read as the head.

This sounds like ABA problem. My understanding is that we
use ID to prevent these problems and could ignore them.


> As a result, the first CPU's numlist_read() will (correctly) fail. If
> numlist_read() failed in the first node() callback within numlist_read()
> (i.e. it sees that the node already has a new ID), there is no guarantee
> that rereading the head ID will provide a new ID. At some point the
> memory system would make the new head ID visible, but there could be
> some heavy spinning until that happens.
>
> Here is a litmus test showing the problem (using comments and verbose
> variable names):
> 
> C numlist_push_loop
> 
> {
>   int node1 = 1;
>   int node2 = 2;
>   int *numlist_head = 
> }
> 
> P0(int **numlist_head)
> {
>   int *head;
>   int id;
> 
>   // read head ID
>   head = READ_ONCE(*numlist_head);
> 
>   // read head node ID
>   id = READ_ONCE(*head);
> 
>   // re-read head ID when node ID is unexpected
>   head = READ_ONCE(*numlist_head);
> }
> 
> P1(int **numlist_head, int *node1, int *node2)
> {
>   int *r0;
> 
>   // push node2
>   r0 = cmpxchg_release(numlist_head, node1, node2);
> 
>   // pop node1, reassigning a new ID
>   smp_store_release(node1, 3);
> }

I think that the Litmus test does not describe the code.
If it does then we need to fix the algorithm or barriers.

> The results show that P0 sees the head is node1 but also sees that
> node1's ID has changed. (And if node1's ID changed, it means P1 had
> previously replaced the head.) If P0 ran in a while-loop, at some point
> it _would_ see that node2 is now the head. But that is wasteful spinning
> and may possibly have negative influence on the memory system.

My undestanding is that only valid nodes are added to the list.

If a node read via head_id is not valid then head_id already
points to another valid node. Am I wrong, please?

Best Regards,
Petr


Re: dataring_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-27 Thread Petr Mladek
On Sun 2019-08-25 04:42:37, John Ogness wrote:
> On 2019-08-20, Petr Mladek  wrote:
> >> +/**
> >> + * dataring_push() - Reserve a data block in the data array.
> >> + *
> >> + * @dr:   The data ringbuffer to reserve data in.
> >> + *
> >> + * @size: The size to reserve.
> >> + *
> >> + * @desc: A pointer to a descriptor to store the data block information.
> >> + *
> >> + * @id:   The ID of the descriptor to be associated.
> >> + *The data block will not be set with @id, but rather initialized 
> >> with
> >> + *a value that is explicitly different than @id. This is to 
> >> handle the
> >> + *case when newly available garbage by chance matches the 
> >> descriptor
> >> + *ID.
> >> + *
> >> + * This function expects to move the head pointer forward. If this would
> >> + * result in overtaking the data array index of the tail, the tail data 
> >> block
> >> + * will be invalidated.
> >> + *
> >> + * Return: A pointer to the reserved writer data, otherwise NULL.
> >> + *
> >> + * This will only fail if it was not possible to invalidate the tail data
> >> + * block.
> >> + */
> >> +char *dataring_push(struct dataring *dr, unsigned int size,
> >> +  struct dr_desc *desc, unsigned long id)
> >> +{
> >> +  unsigned long begin_lpos;
> >> +  unsigned long next_lpos;
> >> +  struct dr_datablock *db;
> >> +  bool ret;
> >> +
> >> +  to_db_size();
> >> +
> >> +  do {
> >> +  /* fA: */
> >> +  ret = get_new_lpos(dr, size, _lpos, _lpos);
> >> +
> >> +  /*
> >> +   * fB:
> >> +   *
> >> +   * The data ringbuffer tail may have been pushed (by this or
> >> +   * any other task). The updated @tail_lpos must be visible to
> >> +   * all observers before changes to @begin_lpos, @next_lpos, or
> >> +   * @head_lpos by this task are visible in order to allow other
> >> +   * tasks to recognize the invalidation of the data
> >> +   * blocks.
> >
> > This sounds strange. The write barrier should be done only on CPU
> > that really modified tail_lpos. I.e. it should be in _dataring_pop()
> > after successful dr->tail_lpos modification.
> 
> The problem is that there are no data dependencies between the different
> variables. When a new datablock is being reserved, it is critical that
> all other observers see that the tail_lpos moved forward _before_ any
> other changes. _dataring_pop() uses an smp_rmb() to synchronize for
> tail_lpos movement.

It should be symmetric. It makes sense that _dataring_pop() uses an
smp_rmb(). Then there should be wmb() in dataring_push().

The wmb() should be done only by the CPU that actually did the write.
And it should be done after the write. This is why I suggested to
do it after cmpxchg(dr->head_lpos).

> This CPU is about to make some changes and may have
> seen an updated tail_lpos. An smp_wmb() is useless if this is not the
> CPU that performed that update. The full memory barrier ensures that all
> other observers will see what this CPU sees before any of its future
> changes are seen.

I do not understand it. Full memory barrier will not cause that all
CPUs will see the same.

My understanding of barriers is:

   + wmb() is needed after some value is modified and any following
 modifications must be done later.

   + rmb() is needed when a value has to be read before the other
 values are read.

These barriers need to be symmetric. The reader will see the values
in the right order only when both the writer and the reader use
the right barriers.

+ wmb() full barrier is needed around some critical section
  to make sure that all operations happened inside the section


Back to our situation:

+ rmb() should not be needed here because get_new_lpos() provided
  a valid lpos.

  It is possible that get_new_lpos() used rmb() to make sure
  that there was enough space. But such wmb() would be
  between reading dr->tail_lpos and dr->head_lpos. No
  other rmb() is needed once the check passed.

+ wmb() is not needed because we have not written anything yet

If there was a race with another CPU than cmpxchg(dr->head_lpos)
would fail and we will need to repeat everything again.

> >> +  /* fE: */
> >> +  } while (atomic_long_cmpxchg_relaxed(>head_lpos, begin_lpos,
> >> +   next_lpos) != begin_lpos);
> >> +
> >
> > We need a write barrier here to make sure that dr->head_lpos
> > is updated before we start updating other values, e.g.
> > db->id below.
> 
> My RFCv2 implemented it that way. The function was called data_reserve()
> and it moved the head using cmpxchg_release(). For RFCv3 I changed to a
> full memory barrier instead because using acquire/release here is a bit
> messy. There are 2 different places where the acquire needed to be:
> 
> - In _dataring_pop() a load_acquire() of head_lpos would need to be
>   _before_ loading of begin_lpos and next_lpos.
> 
> - In 

Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-27 Thread John Ogness
On 2019-08-27, Petr Mladek  wrote:
>> On 2019-08-23, Petr Mladek  wrote:
 --- /dev/null
 +++ b/kernel/printk/numlist.c
 +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
 +{
> [...]
 +
 +  /* bB: #1 */
 +  head_id = atomic_long_read(>head_id);
 +
 +  for (;;) {
 +  /* bC: */
 +  while (!numlist_read(nl, head_id, , NULL)) {
 +  /*
 +   * @head_id is invalid. Try again with an
 +   * updated value.
 +   */
 +
 +  cpu_relax();
>>>
>>> I have got very confused by this. cpu_relax() suggests that this
>>> cycle is busy waiting until a particular node becomes valid.
>>> My first though was that it must cause deadlock in NMI when
>>> the interrupted code is supposed to make the node valid.
>>>
>>> But it is the other way. The head is always valid when it is
>>> added to the list. It might become invalid when another CPU
>>> moves the head and the old one gets reused.
>>>
>>> Anyway, I do not see any reason for cpu_relax() here.
>> 
>> You are correct. The cpu_relax() should not be there. But there is
>> still an issue that this could spin hard if the head was recycled and
>> this CPU does not yet see the new head value.
>
> I do not understand this. The head could get reused only after
> head_id was replaced with the following valid node.
> The next cycle is done with a new id that should be valid.
>
> Of course, the new ID might get reused as well. But then we just
> repeat the cycle. We have to be able to find a valid head after
> few cycles. The last valid ID could not get reused because nodes
> can be removed only if was not the last valid node.

Sorry, I was not very precise with my language. I will try again...

nl->head_id is read using a relaxed read. A second CPU may have added
new nodes and removed/recycled the node with the ID that the first CPU
read as the head.

As a result, the first CPU's numlist_read() will (correctly) fail. If
numlist_read() failed in the first node() callback within numlist_read()
(i.e. it sees that the node already has a new ID), there is no guarantee
that rereading the head ID will provide a new ID. At some point the
memory system would make the new head ID visible, but there could be
some heavy spinning until that happens.

Here is a litmus test showing the problem (using comments and verbose
variable names):

C numlist_push_loop

{
int node1 = 1;
int node2 = 2;
int *numlist_head = 
}

P0(int **numlist_head)
{
int *head;
int id;

// read head ID
head = READ_ONCE(*numlist_head);

// read head node ID
id = READ_ONCE(*head);

// re-read head ID when node ID is unexpected
head = READ_ONCE(*numlist_head);
}

P1(int **numlist_head, int *node1, int *node2)
{
int *r0;

// push node2
r0 = cmpxchg_release(numlist_head, node1, node2);

// pop node1, reassigning a new ID
smp_store_release(node1, 3);
}

exists (0:head=node1 /\ 0:id=3)

$ herd7 -conf linux-kernel.cfg numlist_push_loop.litmus 
Test numlist_push_loop Allowed
States 5
0:head=node1; 0:id=1;
0:head=node1; 0:id=3;
0:head=node2; 0:id=1;
0:head=node2; 0:id=2;
0:head=node2; 0:id=3;
Ok
Witnesses
Positive: 1 Negative: 4
Condition exists (0:head=node1 /\ 0:id=3)
Observation numlist_push_loop Sometimes 1 4
Time numlist_push_loop 0.01
Hash=27b10efb171ab4cf390bd612a9e79bf0

The results show that P0 sees the head is node1 but also sees that
node1's ID has changed. (And if node1's ID changed, it means P1 had
previously replaced the head.) If P0 ran in a while-loop, at some point
it _would_ see that node2 is now the head. But that is wasteful spinning
and may possibly have negative influence on the memory system.

>> To handle that, and in preparation for my next version, I'm now using
>> a read_acquire() to load the ID in the node() callback (matching the
>> set_release() in assign_desc()). This ensures that if numlist_read()
>> fails, the new head will be visible.
>
> I do not understand this either. The above paragraph seems to
> describe a race. I do not see how it could cause an infinite loop.

It isn't an infinite loop. It is burning some/many CPU cycles.

By changing P0's ID read to:

id = smp_load_acquire(head);

the results change to:

$ herd7 -conf linux-kernel.cfg numlist_push_loop.litmus 
Test numlist_push_loop Allowed
States 4
0:head=node1; 0:id=1;
0:head=node2; 0:id=1;
0:head=node2; 0:id=2;
0:head=node2; 0:id=3;
No
Witnesses
Positive: 0 Negative: 4
Condition exists (0:head=node1 /\ 0:id=3)
Observation numlist_push_loop Never 0 4
Time numlist_push_loop 0.01
Hash=3eb63ea3bec59f8941f61faddb5499da

Meaning that if a new ID is seen, a new head ID is also visible.

Loading the ID is what the node() callback does, and the ACQUIRE pairs
with the set_release() in assign_desc(). Both in ringbuffer.c.

John Ogness


Re: numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-27 Thread Petr Mladek
On Tue 2019-08-27 01:57:39, John Ogness wrote:
> On 2019-08-23, Petr Mladek  wrote:
> >> --- /dev/null
> >> +++ b/kernel/printk/numlist.c
> >> @@ -0,0 +1,375 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +
> >> +#include 
> >> +#include "numlist.h"
> >
> > struct numlist is really special variant of a list. Let me to
> > do a short summary:
> >
> >+ FIFO queue interface
> >
> >+ nodes sequentially numbered
> >
> >+ nodes referenced by ID instead pointers to avoid ABA problems
> >  + requires custom node() callback to get pointer for given ID
> >
> >+ lockless access:
> >  + pushed nodes must not longer get modified by push() caller
> >  + pop() caller gets exclusive write access, except that they
> >must modify ID first and do smp_wmb() later
> 
> Only if the "numlist user" decides to recycle descriptors (which the
> printk_ringbuffer does) is ID modification of descriptors necessary. How
> that is synchronized with readers is up to the user (for example,
> whether a RELEASE or an smp_wmb() is used).

IMHO, the most tricky part of numlist API is handling of IDs.
The IDs are there to avoid ABA races when reusing the nodes.

I want to say that this API is useful only when the nodes are reused.
All other users would want anything easier.

> >+ pop() does not work:
> >  + tail node is "busy"
> > + needs a custom callback that defines when a node is busy
> 
> Note that busy() could always return false if the user has no concept of
> nodes that should not be popped.

This would be append only list. Again the is no need for IDs
and other complexities.


> >  + tail is the last node
> > + needed for lockless sequential numbering
> >
> > I will start with one inevitable question ;-) Is it realistic to find
> > another user for this API, please?
> 
> If someone needs a FIFO queue that supports:
> 
> 1. multiple concurrent writers and multiple concurrent non-consuming
>readers
> 
> 2. where readers are allowed to miss nodes but are able to detect how
>many were missed
> 
> 3. from any context (including NMI)
>
> then I know of no other data structure available. (Otherwise I would
> have used it!)

It might be also because nobody else needed structure with
our numlist semantic. I guess that lockless read/write
structures are usually implemented using RCU.


> > I am not sure that all the indirections, caused by the generic API,
> > are worth the gain.
>
> IMHO the API is sane. The only bizarre rule is that the numlist must
> always have at least 1 node. But since the readers are non-consuming,
> there is no real tragedy here.
>
> My goal is not to create some fabulous abstract data structure that
> everyone should use. But I did try to minimize numlist (and dataring) to
> only be concerned with clearly defined and minimal responsibilities
> without imposing unnecessary restrictions on the user.

The API is complicated because of the callbacks. It depends on a logic
that is implemented externally. It makes it abstract to some extent.

My view is that the API would be much cleaner and easier to review
when the ID handling is "hardcoded" (helper functions). It could be
made abstract anytime later when there is another user.

There should always be a reason why to make a code more complicated
than necessary. It seems that the only reason is some theoretical
future user and its theoretical requirements.


> > Well, the separate API makes sense anyway. I have some ideas that
> > might make it cleaner.
> 
> [snipped the nice refactoring of the ID into the nl_node]
> 
> Your idea (along with previous discussions) convinced me of the
> importance of moving the ID-related barriers into the same
> file. However, rather than pushing the ID parts into the numlist, I will
> be moving them all into the "numlist user"
> (i.e. printk_ringbuffer). Your use of the ACQUIRE to load the ID made me
> realize that I need to be doing that as well! (but in the node()
> callback)
> 
> The reasons why I do not want the ID in nl_node is:
> 
> - The numlist would need to implement the ID-to-node mapping. For the
>   printk_ringbuffer that mapping is simply masking to an index within an
>   array. But why should a numlist user be forced to do it that way? I
>   see no advantage to restricting numlists to being arrays of nodes.

It might be done a generic way when there is a user with another need.

Honestly, I have big troubles to imagine another reasonable mapping
between id and pointer than masking. We are talking about lockless
code. Anything more complicated might become a nightmare.


> - The dataring structure also uses IDs and requires an ID-to-node
>   mapping. I do not want to bind the dataring and numlist data
>   structures together at this level because they really have nothing to
>   do with each other. Having the dataring and numlist ID-to-node
>   mappings (and their barriers) in the same place (in the
>   numlist/dataring _user_) simplifies the big picture.

The ID 

Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-27 Thread Petr Mladek
On Tue 2019-08-27 00:36:18, John Ogness wrote:
> Hi Petr,
> 
> AndreaP responded with some explanation (and great links!) on the topic
> of READ_ONCE. But I feel like your comments about the WRITE_ONCE were
> not addressed. I address that (and your other comments) below...
> 
> On 2019-08-23, Petr Mladek  wrote:
> >> --- /dev/null
> >> +++ b/kernel/printk/numlist.c
> >> +/**
> >> + * numlist_push() - Add a node to the list and assign it a sequence 
> >> number.
> >> + *
> >> + * @nl: The numbered list to push to.
> >> + *
> >> + * @n:  A node to push to the numbered list.
> >> + *  The node must not already be part of a list.
> >> + *
> >> + * @id: The ID of the node.
> >> + *
> >> + * A node is added in two steps: The first step is to make this node the
> >> + * head, which causes a following push to add to this node. The second 
> >> step is
> >> + * to update @next_id of the former head node to point to this one, which
> >> + * makes this node visible to any task that sees the former head node.
> >> + */
> >> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
> >> +{
[...]
> >> +
> >> +  /* bB: #1 */
> >> +  head_id = atomic_long_read(>head_id);
> >> +
> >> +  for (;;) {
> >> +  /* bC: */
> >> +  while (!numlist_read(nl, head_id, , NULL)) {
> >> +  /*
> >> +   * @head_id is invalid. Try again with an
> >> +   * updated value.
> >> +   */
> >> +
> >> +  cpu_relax();
> >
> > I have got very confused by this. cpu_relax() suggests that this
> > cycle is busy waiting until a particular node becomes valid.
> > My first though was that it must cause deadlock in NMI when
> > the interrupted code is supposed to make the node valid.
> >
> > But it is the other way. The head is always valid when it is
> > added to the list. It might become invalid when another CPU
> > moves the head and the old one gets reused.
> >
> > Anyway, I do not see any reason for cpu_relax() here.
> 
> You are correct. The cpu_relax() should not be there. But there is still
> an issue that this could spin hard if the head was recycled and this CPU
> does not yet see the new head value.

I do not understand this. The head could get reused only after
head_id was replaced with the following valid node.
The next cycle is done with a new id that should be valid.

Of course, the new ID might get reused as well. But then we just
repeat the cycle. We have to be able to find a valid head after
few cycles. The last valid ID could not get reused because nodes
can be removed only if was not the last valid node.


> To handle that, and in preparation for my next version, I'm now using a
> read_acquire() to load the ID in the node() callback (matching the
> set_release() in assign_desc()). This ensures that if numlist_read()
> fails, the new head will be visible.

I do not understand this either. The above paragraph seems to
describe a race. I do not see how it could cause an infinite loop.

Best Regards,
Petr


Re: numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-26 Thread John Ogness
On 2019-08-23, Petr Mladek  wrote:
>> --- /dev/null
>> +++ b/kernel/printk/numlist.c
>> @@ -0,0 +1,375 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include 
>> +#include "numlist.h"
>
> struct numlist is really special variant of a list. Let me to
> do a short summary:
>
>+ FIFO queue interface
>
>+ nodes sequentially numbered
>
>+ nodes referenced by ID instead pointers to avoid ABA problems
>  + requires custom node() callback to get pointer for given ID
>
>+ lockless access:
>  + pushed nodes must not longer get modified by push() caller
>  + pop() caller gets exclusive write access, except that they
>must modify ID first and do smp_wmb() later

Only if the "numlist user" decides to recycle descriptors (which the
printk_ringbuffer does) is ID modification of descriptors necessary. How
that is synchronized with readers is up to the user (for example,
whether a RELEASE or an smp_wmb() is used).

>+ pop() does not work:
>  + tail node is "busy"
>   + needs a custom callback that defines when a node is busy

Note that busy() could always return false if the user has no concept of
nodes that should not be popped.

>  + tail is the last node
>   + needed for lockless sequential numbering
>
> I will start with one inevitable question ;-) Is it realistic to find
> another user for this API, please?

If someone needs a FIFO queue that supports:

1. multiple concurrent writers and multiple concurrent non-consuming
   readers

2. where readers are allowed to miss nodes but are able to detect how
   many were missed

3. from any context (including NMI)

then I know of no other data structure available. (Otherwise I would
have used it!)

> I am not sure that all the indirections, caused by the generic API,
> are worth the gain.

IMHO the API is sane. The only bizarre rule is that the numlist must
always have at least 1 node. But since the readers are non-consuming,
there is no real tragedy here.

My goal is not to create some fabulous abstract data structure that
everyone should use. But I did try to minimize numlist (and dataring) to
only be concerned with clearly defined and minimal responsibilities
without imposing unnecessary restrictions on the user.

> Well, the separate API makes sense anyway. I have some ideas that
> might make it cleaner.

[snipped the nice refactoring of the ID into the nl_node]

Your idea (along with previous discussions) convinced me of the
importance of moving the ID-related barriers into the same
file. However, rather than pushing the ID parts into the numlist, I will
be moving them all into the "numlist user"
(i.e. printk_ringbuffer). Your use of the ACQUIRE to load the ID made me
realize that I need to be doing that as well! (but in the node()
callback)

The reasons why I do not want the ID in nl_node is:

- The numlist would need to implement the ID-to-node mapping. For the
  printk_ringbuffer that mapping is simply masking to an index within an
  array. But why should a numlist user be forced to do it that way? I
  see no advantage to restricting numlists to being arrays of nodes.

- The dataring structure also uses IDs and requires an ID-to-node
  mapping. I do not want to bind the dataring and numlist data
  structures together at this level because they really have nothing to
  do with each other. Having the dataring and numlist ID-to-node
  mappings (and their barriers) in the same place (in the
  numlist/dataring _user_) simplifies the big picture.

- ID-related barriers are only needed if node recycling is involved. The
  numlist user decides if recycling is used and if yes, then the numlist
  user is responsible for correctly implementing that.

- By moving all the ID-related barriers to the callbacks, the numlist
  code remains clean and (with the exception of the one smp_rmb()) does
  not expect anything from the numlist user.

I believe your main concern was having easily visible symmetric
barriers. We can achieve that if the read-barriers are in the callbacks
(for both numlist and dataring). I think it makes more sense to put them
there. dataring and numlist should not care about the ID-to-node
mapping.

John Ogness


Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-26 Thread John Ogness
Hi Petr,

AndreaP responded with some explanation (and great links!) on the topic
of READ_ONCE. But I feel like your comments about the WRITE_ONCE were
not addressed. I address that (and your other comments) below...

On 2019-08-23, Petr Mladek  wrote:
>> --- /dev/null
>> +++ b/kernel/printk/numlist.c
>> +/**
>> + * numlist_push() - Add a node to the list and assign it a sequence number.
>> + *
>> + * @nl: The numbered list to push to.
>> + *
>> + * @n:  A node to push to the numbered list.
>> + *  The node must not already be part of a list.
>> + *
>> + * @id: The ID of the node.
>> + *
>> + * A node is added in two steps: The first step is to make this node the
>> + * head, which causes a following push to add to this node. The second step 
>> is
>> + * to update @next_id of the former head node to point to this one, which
>> + * makes this node visible to any task that sees the former head node.
>> + */
>> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
>> +{
>> +unsigned long head_id;
>> +unsigned long seq;
>> +unsigned long r;
>> +
>> +/*
>> + * bA:
>> + *
>> + * Setup the node to be a list terminator: next_id == id.
>> + */
>> +WRITE_ONCE(n->next_id, id);
>
> Do we need WRITE_ONCE() here?
> Both "n" and "id" are given as parameters and do not change.
> The assigment must be done before "id" is set as nl->head_id.
> The ordering is enforced by cmpxchg_release().

The cmpxchg_release() ensures that if the node is visible to writers,
then the finalized assignment is also visible. And the store_release()
ensures that if the previous node is visible to any readers, then the
finalized assignment is also visible. In the reader case, if any readers
happen to be sitting on the node, numlist_read() will fail because the
ID was updated when the node was popped. So for all these cases any
compiler optimizations leading to that assigment (tearing, speculation,
etc) should be irrelevant. Therefore, IMO the WRITE_ONCE() is not
needed.

Since all of this is lockless, I used WRITE_ONCE() whenever touching
shared variables. I must admit the decision may be motivated primarily
by fear of compiler optimizations. Although "documenting lockless shared
variable access" did play a role as well.

I will replace the WRITE_ONCE with an assignment.

>> +
>> +/* bB: #1 */
>> +head_id = atomic_long_read(>head_id);
>> +
>> +for (;;) {
>> +/* bC: */
>> +while (!numlist_read(nl, head_id, , NULL)) {
>> +/*
>> + * @head_id is invalid. Try again with an
>> + * updated value.
>> + */
>> +
>> +cpu_relax();
>
> I have got very confused by this. cpu_relax() suggests that this
> cycle is busy waiting until a particular node becomes valid.
> My first though was that it must cause deadlock in NMI when
> the interrupted code is supposed to make the node valid.
>
> But it is the other way. The head is always valid when it is
> added to the list. It might become invalid when another CPU
> moves the head and the old one gets reused.
>
> Anyway, I do not see any reason for cpu_relax() here.

You are correct. The cpu_relax() should not be there. But there is still
an issue that this could spin hard if the head was recycled and this CPU
does not yet see the new head value.

To handle that, and in preparation for my next version, I'm now using a
read_acquire() to load the ID in the node() callback (matching the
set_release() in assign_desc()). This ensures that if numlist_read()
fails, the new head will be visible.

> Also the entire cycle would deserve a comment to avoid this mistake.
> For example:
>
>   /*
>* bC: Read seq from current head. Repeat with new
>* head when it has changed and the old one got reused.
>*/

Agreed.

>> +
>> +/* bB: #2 */
>> +head_id = atomic_long_read(>head_id);
>> +}
>> +
>> +/*
>> + * bD:
>> + *
>> + * Set @seq to +1 of @seq from the previous head.
>> + *
>> + * Memory barrier involvement:
>> + *
>> + * If bB reads from bE, then bC->aA reads from bD.
>> + *
>> + * Relies on:
>> + *
>> + * RELEASE from bD to bE
>> + *matching
>> + * ADDRESS DEP. from bB to bC->aA
>> + */
>> +WRITE_ONCE(n->seq, seq + 1);
>
> Do we really need WRITE_ONCE() here? 
> It is the same problem as with setting n->next_id above.

For the same reasons as the other WRITE_ONCE, I will replace the
WRITE_ONCE with an assignment.

>> +
>> +/*
>> + * bE:
>> + *
>> + * This store_release() guarantees that @seq and @next are
>> + * stored before the node with @id is visible to any popping
>> +  

Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-26 Thread Andrea Parri
> > C S+ponarelease+addroncena
> > 
> > {
> > int *y = 
> > }
> > 
> > P0(int *x, int **y, int *a)
> > {
> > int *r0;
> > 
> > *x = 2;
> > r0 = cmpxchg_release(y, a, x);
> > }
> > 
> > P1(int *x, int **y)
> > {
> > int *r0;
> >
> > r0 = READ_ONCE(*y);
> > *r0 = 1;
> > }
> > 
> > exists (1:r0=x /\ x=2)
> 
> Which r0 the above exists rule refers to, please?
> Do both P0 and P1 define r0 by purpose?

"1:r0" is the value returned by the above READ_ONCE(*y), following the
convention [thread number]:[local variable]; but yes, I could probably
have saved you this question by picking a different name,  ;-)  sorry.


> 
> > Then
> > 
> >   $ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
> >   Test S+ponarelease+addroncena Allowed
> >   States 2
> >   1:r0=a; x=2;
> >   1:r0=x; x=1;
> >   No
> >   Witnesses
> >   Positive: 0 Negative: 2
> >   Condition exists (1:r0=x /\ x=2)
> >   Observation S+ponarelease+addroncena Never 0 2
> >   Time S+ponarelease+addroncena 0.01
> >   Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5
> > 
> > that is, the test is not racy and the "exists" clause is not satisfiable
> > in the LKMM.  Notice that _if the READ_ONCE(*y) in P1 were replaced by a
> > plain read, then we would obtain:
> > 
> >   Test S+ponarelease+addrnana Allowed
> >   States 2
> >   1:r0=x; x=1;
> >   1:r0=x; x=2;
> 
> Do you have any explanation how r0=x; x=2; could happen, please?

I should have remarked: the states listed here lose their significance
when there is a data race: "data race" is LKMM's way of saying "I give
up, I'm unable to list all the reachable states; your call...".  ;-)

This example is "complicated", e.g., by the tearing of the plain read,
tearing which is envisaged/modelled by the LKMM: however, this tearing
doesn't explain the "1:r0=x; x=2;" state by itself, AFAICT.

Said this, I'm not sure how I copied this output...  For completeness,
I report the full/intended test at the bottom of my email.


> 
> Does the ommited READ_ONCE allows to do r0 = (*y) twice
> before and after *r0 = 1?
> Or the two operations P1 can be called in any order?
> 
> I am sorry if it obvious. Feel free to ask me to re-read Paul's
> articles on LWN more times or point me to another resources.
> 
> 
> 
> >   Ok
> >   Witnesses
> >   Positive: 1 Negative: 1
> >   Flag data-race[ <-- the LKMM warns about a data-race ]
> >   Condition exists (1:r0=x /\ x=2)
> >   Observation S+ponarelease+addrnana Sometimes 1 1
> >   Time S+ponarelease+addrnana 0.00
> >   Hash=a61acf2e8e51c2129d33ddf5e4c76a49
> > 
> > N.B. This analysis generally depends on the assumption that every marked
> > access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
> > heading the address dependencies) are _single-copy atomic, an assumption
> > which has been recently shown to _not be valid in such generality:
> > 
> >   https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck
> 
> So, it might be even worse. Do I get it correctly?

Worse than I was hoping..., definitely!  ;-)

  Andrea

---
C S+ponarelease+addrnana

{
int *y = 
}

P0(int *x, int **y, int *a)
{
int *r0;

*x = 2;
r0 = cmpxchg_release(y, a, x);
}

P1(int *x, int **y)
{
int *r0;

r0 = *y;
*r0 = 1;
}

exists (1:r0=x /\ x=2)


Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-26 Thread Petr Mladek
On Mon 2019-08-26 10:34:36, Andrea Parri wrote:
> > > + /*
> > > +  * bA:
> > > +  *
> > > +  * Setup the node to be a list terminator: next_id == id.
> > > +  */
> > > + WRITE_ONCE(n->next_id, id);
> > 
> > Do we need WRITE_ONCE() here?
> > Both "n" and "id" are given as parameters and do not change.
> > The assigment must be done before "id" is set as nl->head_id.
> > The ordering is enforced by cmpxchg_release().
> 
> (Disclaimer: this is still a very much debated issue...)
> 
> According to the LKMM, this question boils down to the question:
> 
>   Is there "ordering"/synchronization between the above access and
>   the "matching accesses" bF and aA' to the same location?
> 
> Again according to the LKMM's analysis, such synchronization is provided
> by the RELEASE -> "reads-from" -> ADDR relation.  (Encoding address dep.
> in litmus tests is kind of tricky but possible, e.g., for the pattern in
> question, we could write/model as follows:
> 
> C S+ponarelease+addroncena
> 
> {
>   int *y = 
> }
> 
> P0(int *x, int **y, int *a)
> {
>   int *r0;
> 
>   *x = 2;
>   r0 = cmpxchg_release(y, a, x);
> }
> 
> P1(int *x, int **y)
> {
>   int *r0;
>
>   r0 = READ_ONCE(*y);
>   *r0 = 1;
> }
> 
> exists (1:r0=x /\ x=2)

Which r0 the above exists rule refers to, please?
Do both P0 and P1 define r0 by purpose?

> Then
> 
>   $ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
>   Test S+ponarelease+addroncena Allowed
>   States 2
>   1:r0=a; x=2;
>   1:r0=x; x=1;
>   No
>   Witnesses
>   Positive: 0 Negative: 2
>   Condition exists (1:r0=x /\ x=2)
>   Observation S+ponarelease+addroncena Never 0 2
>   Time S+ponarelease+addroncena 0.01
>   Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5
> 
> that is, the test is not racy and the "exists" clause is not satisfiable
> in the LKMM.  Notice that _if the READ_ONCE(*y) in P1 were replaced by a
> plain read, then we would obtain:
> 
>   Test S+ponarelease+addrnana Allowed
>   States 2
>   1:r0=x; x=1;
>   1:r0=x; x=2;

Do you have any explanation how r0=x; x=2; could happen, please?

Does the ommited READ_ONCE allows to do r0 = (*y) twice
before and after *r0 = 1?
Or the two operations P1 can be called in any order?

I am sorry if it obvious. Feel free to ask me to re-read Paul's
articles on LWN more times or point me to another resources.



>   Ok
>   Witnesses
>   Positive: 1 Negative: 1
>   Flag data-race  [ <-- the LKMM warns about a data-race ]
>   Condition exists (1:r0=x /\ x=2)
>   Observation S+ponarelease+addrnana Sometimes 1 1
>   Time S+ponarelease+addrnana 0.00
>   Hash=a61acf2e8e51c2129d33ddf5e4c76a49
> 
> N.B. This analysis generally depends on the assumption that every marked
> access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
> heading the address dependencies) are _single-copy atomic, an assumption
> which has been recently shown to _not be valid in such generality:
> 
>   https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck

So, it might be even worse. Do I get it correctly?

Best Regards,
Petr



Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-26 Thread Andrea Parri
Sorry for top posting, but I forgot to mention: as you might have
noticed, my @amarulasolutions address is not active anymore; FWIW,
you should still be able to reach me at this @gmail address.

Thanks,
  Andrea


On Mon, Aug 26, 2019 at 10:34:36AM +0200, Andrea Parri wrote:
> > > + /*
> > > +  * bA:
> > > +  *
> > > +  * Setup the node to be a list terminator: next_id == id.
> > > +  */
> > > + WRITE_ONCE(n->next_id, id);
> > 
> > Do we need WRITE_ONCE() here?
> > Both "n" and "id" are given as parameters and do not change.
> > The assigment must be done before "id" is set as nl->head_id.
> > The ordering is enforced by cmpxchg_release().
> 
> (Disclaimer: this is still a very much debated issue...)
> 
> According to the LKMM, this question boils down to the question:
> 
>   Is there "ordering"/synchronization between the above access and
>   the "matching accesses" bF and aA' to the same location?
> 
> Again according to the LKMM's analysis, such synchronization is provided
> by the RELEASE -> "reads-from" -> ADDR relation.  (Encoding address dep.
> in litmus tests is kind of tricky but possible, e.g., for the pattern in
> question, we could write/model as follows:
> 
> C S+ponarelease+addroncena
> 
> {
>   int *y = 
> }
> 
> P0(int *x, int **y, int *a)
> {
>   int *r0;
> 
>   *x = 2;
>   r0 = cmpxchg_release(y, a, x);
> }
> 
> P1(int *x, int **y)
> {
>   int *r0;
> 
>   r0 = READ_ONCE(*y);
>   *r0 = 1;
> }
> 
> exists (1:r0=x /\ x=2)
> 
> Then
> 
>   $ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
>   Test S+ponarelease+addroncena Allowed
>   States 2
>   1:r0=a; x=2;
>   1:r0=x; x=1;
>   No
>   Witnesses
>   Positive: 0 Negative: 2
>   Condition exists (1:r0=x /\ x=2)
>   Observation S+ponarelease+addroncena Never 0 2
>   Time S+ponarelease+addroncena 0.01
>   Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5
> 
> that is, the test is not racy and the "exists" clause is not satisfiable
> in the LKMM.  Notice that _if the READ_ONCE(*y) in P1 were replaced by a
> plain read, then we would obtain:
> 
>   Test S+ponarelease+addrnana Allowed
>   States 2
>   1:r0=x; x=1;
>   1:r0=x; x=2;
>   Ok
>   Witnesses
>   Positive: 1 Negative: 1
>   Flag data-race  [ <-- the LKMM warns about a data-race ]
>   Condition exists (1:r0=x /\ x=2)
>   Observation S+ponarelease+addrnana Sometimes 1 1
>   Time S+ponarelease+addrnana 0.00
>   Hash=a61acf2e8e51c2129d33ddf5e4c76a49
> 
> N.B. This analysis generally depends on the assumption that every marked
> access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
> heading the address dependencies) are _single-copy atomic, an assumption
> which has been recently shown to _not be valid in such generality:
> 
>   https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck
> 
> (Bug in the LKMM? or in the Linux implementation of these primitives? or
> in the compiler? your blame here...)
> 
> 
> [...]
> 
> > > + /*
> > > +  * bD:
> > > +  *
> > > +  * Set @seq to +1 of @seq from the previous head.
> > > +  *
> > > +  * Memory barrier involvement:
> > > +  *
> > > +  * If bB reads from bE, then bC->aA reads from bD.
> > > +  *
> > > +  * Relies on:
> > > +  *
> > > +  * RELEASE from bD to bE
> > > +  *matching
> > > +  * ADDRESS DEP. from bB to bC->aA
> > > +  */
> > > + WRITE_ONCE(n->seq, seq + 1);
> > 
> > Do we really need WRITE_ONCE() here? 
> > It is the same problem as with setting n->next_id above.
> 
> Same considerations as above would apply here.
> 
>   Andrea


Re: numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-26 Thread Andrea Parri
> > +   /*
> > +* bA:
> > +*
> > +* Setup the node to be a list terminator: next_id == id.
> > +*/
> > +   WRITE_ONCE(n->next_id, id);
> 
> Do we need WRITE_ONCE() here?
> Both "n" and "id" are given as parameters and do not change.
> The assigment must be done before "id" is set as nl->head_id.
> The ordering is enforced by cmpxchg_release().

(Disclaimer: this is still a very much debated issue...)

According to the LKMM, this question boils down to the question:

  Is there "ordering"/synchronization between the above access and
  the "matching accesses" bF and aA' to the same location?

Again according to the LKMM's analysis, such synchronization is provided
by the RELEASE -> "reads-from" -> ADDR relation.  (Encoding address dep.
in litmus tests is kind of tricky but possible, e.g., for the pattern in
question, we could write/model as follows:

C S+ponarelease+addroncena

{
int *y = 
}

P0(int *x, int **y, int *a)
{
int *r0;

*x = 2;
r0 = cmpxchg_release(y, a, x);
}

P1(int *x, int **y)
{
int *r0;

r0 = READ_ONCE(*y);
*r0 = 1;
}

exists (1:r0=x /\ x=2)

Then

  $ herd7 -conf linux-kernel.cfg S+ponarelease+addroncena
  Test S+ponarelease+addroncena Allowed
  States 2
  1:r0=a; x=2;
  1:r0=x; x=1;
  No
  Witnesses
  Positive: 0 Negative: 2
  Condition exists (1:r0=x /\ x=2)
  Observation S+ponarelease+addroncena Never 0 2
  Time S+ponarelease+addroncena 0.01
  Hash=7eaf7b5e95419a3c352d7fd50b9cd0d5

that is, the test is not racy and the "exists" clause is not satisfiable
in the LKMM.  Notice that _if the READ_ONCE(*y) in P1 were replaced by a
plain read, then we would obtain:

  Test S+ponarelease+addrnana Allowed
  States 2
  1:r0=x; x=1;
  1:r0=x; x=2;
  Ok
  Witnesses
  Positive: 1 Negative: 1
  Flag data-race[ <-- the LKMM warns about a data-race ]
  Condition exists (1:r0=x /\ x=2)
  Observation S+ponarelease+addrnana Sometimes 1 1
  Time S+ponarelease+addrnana 0.00
  Hash=a61acf2e8e51c2129d33ddf5e4c76a49

N.B. This analysis generally depends on the assumption that every marked
access (e.g., the cmpxchg_release() called out above and the READ_ONCE()
heading the address dependencies) are _single-copy atomic, an assumption
which has been recently shown to _not be valid in such generality:

  https://lkml.kernel.org/r/20190821103200.kpufwtviqhpbuv2n@willie-the-truck

(Bug in the LKMM? or in the Linux implementation of these primitives? or
in the compiler? your blame here...)


[...]

> > +   /*
> > +* bD:
> > +*
> > +* Set @seq to +1 of @seq from the previous head.
> > +*
> > +* Memory barrier involvement:
> > +*
> > +* If bB reads from bE, then bC->aA reads from bD.
> > +*
> > +* Relies on:
> > +*
> > +* RELEASE from bD to bE
> > +*matching
> > +* ADDRESS DEP. from bB to bC->aA
> > +*/
> > +   WRITE_ONCE(n->seq, seq + 1);
> 
> Do we really need WRITE_ONCE() here? 
> It is the same problem as with setting n->next_id above.

Same considerations as above would apply here.

  Andrea


Re: assign_desc() barriers: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-26 Thread John Ogness
On 2019-08-25, John Ogness  wrote:
>> --- /dev/null
>> +++ b/kernel/printk/ringbuffer.c
>> +static bool assign_desc(struct prb_reserved_entry *e)
>> +{
[...]
>> +atomic_long_set_release(>id, atomic_long_read(>id) +
>> +DESCS_COUNT(rb));
> 
> atomic_long_set_release() might be a bit confusing here.
> There is no related acquire.
>>> 
>>> As the comment states, this release is for prb_getdesc() users. The
>>> only prb_getdesc() user is _dataring_pop().  (i.e. the descriptor's
>>> ID is not what _dataring_pop() was expecting), then the tail must
>>> have moved and _dataring_pop() needs to see that. Since there are no
>>> data dependencies between descriptor ID and tail_pos, an explicit
>>> memory barrier is used. More on this below.
>
>>   + The two related barriers are in different source files
>> and APIs:
>>
>>+ assign_desc() in ringbuffer.c; ringbuffer API
>>+ _dataring_pop in dataring.c; dataring API
>
> Agreed. This is a consequence of the ID management being within the
> high-level ringbuffer code. I could have added an smp_rmb() to the
> NULL case in prb_getdesc(). Then both barriers would be in the same
> file. However, this would mean smp_rmb() is called many times
> (particularly by readers) when it is not necessary.

What I wrote here is wrong. prb_getdesc() is not called "many times
(particularly by readers)". It is only called once within the writer
function _dataring_pop().

Looking at this again, I think it would be better to move the smp_rmb()
into the NULL case of prb_getdesc(). Then both barrier pairs are located
(and documented) in the same file. This also simplifies the
documentation by not saying "the caller's smp_rmb() everywhere".

I would also change _dataring_pop() so that the smp_rmb() is located
within the handling of the other two failed checks (begin_lpos !=
tail_lpos and !_datablock_valid()). Then the out: at the end is just
return atomic_long_read(>tail_lpos).

After modifying the code in this way, I think it looks more straight
forward and would have avoided your confusion: The RMB in
dataring.c:_dataring_pop() matches the MB in dataring.c:dataring_push()
and the RMB in ringbuffer.c:prb_getdesc() matches the SET_RELEASE in
ringbuffer.c:assign_desc().

John Ogness


Re: dataring_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-24 Thread John Ogness
On 2019-08-20, Petr Mladek  wrote:
>> +/**
>> + * dataring_push() - Reserve a data block in the data array.
>> + *
>> + * @dr:   The data ringbuffer to reserve data in.
>> + *
>> + * @size: The size to reserve.
>> + *
>> + * @desc: A pointer to a descriptor to store the data block information.
>> + *
>> + * @id:   The ID of the descriptor to be associated.
>> + *The data block will not be set with @id, but rather initialized 
>> with
>> + *a value that is explicitly different than @id. This is to handle 
>> the
>> + *case when newly available garbage by chance matches the descriptor
>> + *ID.
>> + *
>> + * This function expects to move the head pointer forward. If this would
>> + * result in overtaking the data array index of the tail, the tail data 
>> block
>> + * will be invalidated.
>> + *
>> + * Return: A pointer to the reserved writer data, otherwise NULL.
>> + *
>> + * This will only fail if it was not possible to invalidate the tail data
>> + * block.
>> + */
>> +char *dataring_push(struct dataring *dr, unsigned int size,
>> +struct dr_desc *desc, unsigned long id)
>> +{
>> +unsigned long begin_lpos;
>> +unsigned long next_lpos;
>> +struct dr_datablock *db;
>> +bool ret;
>> +
>> +to_db_size();
>> +
>> +do {
>> +/* fA: */
>> +ret = get_new_lpos(dr, size, _lpos, _lpos);
>> +
>> +/*
>> + * fB:
>> + *
>> + * The data ringbuffer tail may have been pushed (by this or
>> + * any other task). The updated @tail_lpos must be visible to
>> + * all observers before changes to @begin_lpos, @next_lpos, or
>> + * @head_lpos by this task are visible in order to allow other
>> + * tasks to recognize the invalidation of the data
>> + * blocks.
>
> This sounds strange. The write barrier should be done only on CPU
> that really modified tail_lpos. I.e. it should be in _dataring_pop()
> after successful dr->tail_lpos modification.

The problem is that there are no data dependencies between the different
variables. When a new datablock is being reserved, it is critical that
all other observers see that the tail_lpos moved forward _before_ any
other changes. _dataring_pop() uses an smp_rmb() to synchronize for
tail_lpos movement. This CPU is about to make some changes and may have
seen an updated tail_lpos. An smp_wmb() is useless if this is not the
CPU that performed that update. The full memory barrier ensures that all
other observers will see what this CPU sees before any of its future
changes are seen.

You suggest an alternative implementation below. I will address that
there.

>> + * This pairs with the smp_rmb() in _dataring_pop() as well as
>> + * any reader task using smp_rmb() to post-validate data that
>> + * has been read from a data block.
>> +
>> + * Memory barrier involvement:
>> + *
>> + * If dE reads from fE, then dI reads from fA->eA.
>> + * If dC reads from fG, then dI reads from fA->eA.
>> + * If dD reads from fH, then dI reads from fA->eA.
>> + * If mC reads from fH, then mF reads from fA->eA.
>> + *
>> + * Relies on:
>> + *
>> + * FULL MB between fA->eA and fE
>> + *matching
>> + * RMB between dE and dI
>> + *
>> + * FULL MB between fA->eA and fG
>> + *matching
>> + * RMB between dC and dI
>> + *
>> + * FULL MB between fA->eA and fH
>> + *matching
>> + * RMB between dD and dI
>> + *
>> + * FULL MB between fA->eA and fH
>> + *matching
>> + * RMB between mC and mF
>> + */
>> +smp_mb();
>
> All these comments talk about sychronization against read barriers.
> It means that we would need a write barrier here. But it does
> not make much sense to do write barrier before actually
> writing dr->head_lpos.

I think my comments above address this.

> After all I think that we do not need any barrier here.
> The write barrier for dr->tail_lpos should be in
> _dataring_pop(). The read barrier is not needed because
> we are not reading anything here.
>
> Instead we should put a barrier after modyfying dr->head_lpos,
> see below.

Comments below.

>> +if (!ret) {
>> +/*
>> + * Force @desc permanently invalid to minimize risk
>> + * of the descriptor later unexpectedly being
>> + * determined as valid due to overflowing/wrapping of
>> + * @head_lpos. An unaligned @begin_lpos can never
>> + * point to a data block and having the same value
>> + * for @begin_lpos and @next_lpos is also invalid.
>> +  

Re: assign_desc() barriers: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-24 Thread John Ogness
On 2019-08-22, Petr Mladek  wrote:
> --- /dev/null
> +++ b/kernel/printk/ringbuffer.c
> +/**
> + * assign_desc() - Assign a descriptor to the caller.
> + *
> + * @e: The entry structure to store the assigned descriptor to.
> + *
> + * Find an available descriptor to assign to the caller. First it is 
> checked
> + * if the tail descriptor from the committed list can be recycled. If 
> not,
> + * perhaps a never-used descriptor is available. Otherwise, data blocks 
> will
> + * be invalidated until the tail descriptor from the committed list can 
> be
> + * recycled.
> + *
> + * Assigned descriptors are invalid until data has been reserved for 
> them.
> + *
> + * Return: true if a descriptor was assigned, otherwise false.
> + *
> + * This will only fail if it was not possible to invalidate data blocks 
> in
> + * order to recycle a descriptor. This can happen if a writer has 
> reserved but
> + * not yet committed data and that reserved data is currently the oldest 
> data.
> + */
> +static bool assign_desc(struct prb_reserved_entry *e)
> +{
> + struct printk_ringbuffer *rb = e->rb;
> + struct prb_desc *d;
> + struct nl_node *n;
> + unsigned long i;
> +
> + for (;;) {
> + /*
> +  * jA:
> +  *
> +  * Try to recycle a descriptor on the committed list.
> +  */
> + n = numlist_pop(>nl);
> + if (n) {
> + d = container_of(n, struct prb_desc, list);
> + break;
> + }
> +
> + /* Fallback to static never-used descriptors. */
> + if (atomic_read(>desc_next_unused) < DESCS_COUNT(rb)) {
> + i = atomic_fetch_inc(>desc_next_unused);
> + if (i < DESCS_COUNT(rb)) {
> + d = >descs[i];
> + atomic_long_set(>id, i);
> + break;
> + }
> + }
> +
> + /*
> +  * No descriptor available. Make one available for recycling
> +  * by invalidating data (which some descriptor will be
> +  * referencing).
> +  */
> + if (!dataring_pop(>dr))
> + return false;
> + }
> +
> + /*
> +  * jB:
> +  *
> +  * Modify the descriptor ID so that users of the descriptor see that
> +  * it has been recycled. A _release() is used so that prb_getdesc()
> +  * callers can see all data ringbuffer updates after issuing a
> +  * pairing smb_rmb(). See iA for details.
> +  *
> +  * Memory barrier involvement:
> +  *
> +  * If dB->iA reads from jB, then dI reads the same value as
> +  * jA->cD->hA.
> +  *
> +  * Relies on:
> +  *
> +  * RELEASE from jA->cD->hA to jB
> +  *matching
> +  * RMB between dB->iA and dI
> +  */
> + atomic_long_set_release(>id, atomic_long_read(>id) +
> + DESCS_COUNT(rb));
 
 atomic_long_set_release() might be a bit confusing here.
 There is no related acquire.
>> 
>> As the comment states, this release is for prb_getdesc() users. The
>> only prb_getdesc() user is _dataring_pop().  (i.e. the descriptor's
>> ID is not what _dataring_pop() was expecting), then the tail must
>> have moved and _dataring_pop() needs to see that. Since there are no
>> data dependencies between descriptor ID and tail_pos, an explicit
>> memory barrier is used. More on this below.

After reading through your post, I think you are pairing the wrong
barriers together. jB pairs with dH (i.e. the set_release() in
assign_desc() pairs with the smp_rmb() in _dataring_pop()).

(The comment for jB wrongly says dI instead of dH! Argh!)

> OK, let me show how complicated and confusing this looks for me:

I want to address all your points here. _Not_ because I want to justify
or defend my insanity, but because it may help to provide some clarity.

>   + The two related barriers are in different source files
> and APIs:
>
>+ assign_desc() in ringbuffer.c; ringbuffer API
>+ _dataring_pop in dataring.c; dataring API

Agreed. This is a consequence of the ID management being within the
high-level ringbuffer code. I could have added an smp_rmb() to the NULL
case in prb_getdesc(). Then both barriers would be in the same
file. However, this would mean smp_rmb() is called many times
(particularly by readers) when it is not necessary.

>   + Both the related barriers are around "id" manipulation.
> But one is in dataring, other is in descriptors array.
> One is about an old released "id". One is about a newly
> assigned "id".

dB is not the pairing barrier of jB. As dB's comment says, it pairs with
gA. (The load_acquire(id) in _dataring_pop() pairs with the

numlist API Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-23 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/numlist.c
> @@ -0,0 +1,375 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include "numlist.h"

struct numlist is really special variant of a list. Let me to
do a short summary:

   + FIFO queue interface

   + nodes sequentially numbered

   + nodes referenced by ID instead pointers to avoid ABA problems
 + requires custom node() callback to get pointer for given ID

   + lockless access:
 + pushed nodes must not longer get modified by push() caller
 + pop() caller gets exclusive write access, except that they
   must modify ID first and do smp_wmb() later

   + pop() does not work:
 + tail node is "busy"
+ needs a custom callback that defines when a node is busy
 + tail is the last node
+ needed for lockless sequential numbering

I will start with one inevitable question ;-) Is it realistic to find
another user for this API, please?

I am not sure that all the indirections, caused by the generic API,
are worth the gain.


Well, the separate API makes sense anyway. I have some ideas that
might make it cleaner.

The barriers are because of validating the ID. Now we have:

struct nl_node {
unsigned long   seq;
unsigned long   next_id;
};

that is used in:

struct prb_desc {
/* private */
atomic_long_t   id;
struct dr_desc  desc;
struct nl_node  list;
};

What will happen when we move id from struct prb_desc into struct nl_node?

struct nl_node {
unsigned long   seq;
atomic_long_t   id;
unsigned long   next_id;
};

struct prb_desc {
struct dr_desc  desc;
struct nl_node  list;
};


Then the "node" callback might just return the structure. It makes
perfect sense. struct nl_node is always static for a given id.

For the printk ringbuffer it would look like:

struct nl_node *prb_nl_get_node(unsigned long id, void *nl_user)
{
struct printk_ringbuffer *rb = (struct printk_ringbuffer *)nl_user;
struct prb_desc *d = to_desc(rb, id);

return >list;
}

I would also hide the callback behind a generic wrapper:

struct nl_node *numlist_get_node(struct numlist *nl, unsigned long id)
{
return nl->get_node(id, nl->user_data);
}


Then we could have nicely symetric and self contained barriers
in numlist_read():

bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
  unsigned long *next_id)
{
struct nl_node *n;
unsigned long cur_id;

n = numlist_get_node(nl, id);
if (!n)
return false;

/*
 * Make sure that seq and next_id values will be read
 * for the expected id.
 */
cur_id = atomic_long_read_acquire(>id);
if (cur_id != id)
return false;

if (seq) {
*seq = n->seq;

if (next_id)
*next_id = n->next_id;
}

/*
 * Make sure that seq and next_id values were read for
 * the expected ID.
 */
cur_id = atomic_long_read_release(>id);

return cur_id == id;
}

numlist_push() might be the same, except the I would
remove several WRITE_ONCE as discussed in another mail:

void numlist_push(struct numlist *nl, struct nl_node *n)
{
unsigned long head_id;
unsigned long seq;
unsigned long r;

/* Setup the node to be a list terminator: next_id == id. */
n->next_id = n->id;

do {
do {
head_id = atomic_long_read(>head_id);
} while (!numlist_read(nl, head_id, , NULL));

n->seq = seq + 1;

/*
 * This store_release() guarantees that @seq and @next are
 * stored before the node with @id is visible to any popping
 * writers.
 * 
 * It pairs with the acquire() when tail_id gets updated
 * in headlist_pop();
 */
} while (atomic_long_cmpxchg_release(>head_id, head_id, id) !=
head_id);

n = nl->get_node(nl, head_id);

/*
 * This barrier makes sure that nl->head_id already points to
 * the newly pushed node.
 *
 * It pairs with acquire when new id is written in numlist_pop().
 * It allows to pop() and reuse this node. It can not longer
 * be the last one.
 */
smp_store_release(>next_id, id);
}

Then I would add a symetric callback that would generate ID for
a newly popped struct. It will allow to set new ID in the numlist
API and have the barriers symetric. Something like:

unsined long 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-23 Thread Andrea Parri
> I am not suggesting to remove all comments. Some human readable
> explanation is important as long as the code is developed by humans.
> 
> I think that I'll have to accept also the extra comments if you are
> really going to use them to check the consistency by a tool. Or
> if they are really used for review by some people.

Glad to hear this.  Thank you, Petr.


> Do all this manuals, tools, people use any common syntax, please?
> Would it be usable in our case as well?
> 
> I would like to avoid reinventing the wheel. Also I do not want
> to create a dialect for few people that other potentially interested
> parties will not understand.

Right; I think that terms such as "(barrier) matching", "reads-from"
and "overwrites" are commonly used to refer to litmus tests.  (The
various primitives/instructions are of course specific to the given
context: the language, the memory model, etc. )

IOW, I'd say that that wheel _and a common denominator here can be
represented by the notion of "litmus test".  I'm not suggesting to
reinvent this wheel of course; my point was more along the lines of
"let's use the wheel, it'll be helpful..."  ;-)

  Andrea


Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-23 Thread Petr Mladek
On Thu 2019-08-22 19:38:01, Andrea Parri wrote:
> On Thu, Aug 22, 2019 at 03:50:52PM +0200, Petr Mladek wrote:
> > On Wed 2019-08-21 07:46:28, John Ogness wrote:
> > > On 2019-08-20, Sergey Senozhatsky  
> > > wrote:
> > > > [..]
> > > >> > + *
> > > >> > + * Memory barrier involvement:
> > > >> > + *
> > > >> > + * If dB reads from gA, then dC reads from fG.
> > > >> > + * If dB reads from gA, then dD reads from fH.
> > > >> > + * If dB reads from gA, then dE reads from fE.
> > > >> > + *
> > > >> > + * Note that if dB reads from gA, then dC cannot read from fC.
> > > >> > + * Note that if dB reads from gA, then dD cannot read from fD.
> > > >> > + *
> > > >> > + * Relies on:
> > > >> > + *
> > > >> > + * RELEASE from fG to gA
> > > >> > + *matching
> > > >> > + * ADDRESS DEP. from dB to dC
> > > >> > + *
> > > >> > + * RELEASE from fH to gA
> > > >> > + *matching
> > > >> > + * ADDRESS DEP. from dB to dD
> > > >> > + *
> > > >> > + * RELEASE from fE to gA
> > > >> > + *matching
> > > >> > + * ACQUIRE from dB to dE
> > > >> > + */
> > > >> 
> > > >> But I am not sure how much this is useful. It would take ages to 
> > > >> decrypt
> > > >> all these shortcuts (signs) and translate them into something
> > > >> human readable. Also it might get outdated easily.
> > > >> 
> > > The labels are necessary for the technical documentation of the
> > > barriers. And, after spending much time in this, I find them very
> > > useful. But I agree that there needs to be a better way to assign label
> > > names.
> > 
> > I could understand that you spend a lot of time on creating the
> > labels and that they are somehow useful for you.
> > 
> > But I am not using them and I hope that I will not have to:
> > 
> >   + Grepping takes a lot of time, especially over several files.
> > 
> >   + Grepping is actually not enough. It is required to read
> > the following comment or code to realize what the label is for.
> > 
> >   + Several barriers have multiple dependencies. Grepping one
> > label helps to check that one connection makes sense.
> > But it is hard to keep all relations in head to confirm
> > that they are complete and make sense overall.
> > 
> >   + There are about 50 labels in the code. "Entry Lifecycle"
> > section in dataring.c talks about 8 step. One would
> > expect that it would require 8 read and 8 write barriers.
> > 
> > Even coordination of 16 barriers might be complicated to check.
> > Where 50 is just scary.
> > 
> > 
> >   + It seems to be a newly invented format and it is not documented.
> > I personally do not understand it completely, for example,
> > the meaning of "RELEASE from jA->cD->hA to jB".
> 
> IIUC, something like "hA is the interested access, happening within
> cD (should have been cC?), which in turn happens within jA".  But I
> should defer to John (FWIW, I found that notation quite helpful).
> 
> 
> > 
> > 
> > I hope that we could do better. I believe that human readable
> > comments all less error prone because they describe the intention.
> > Pseudo code based on labels just describes the code but it
> > does not explain why it was done this way.
> > 
> > From my POV, the labels do more harm than good. The code gets
> > too scattered and is harder to follow.
> > 
> > 
> > > I hope that we can agree that the labels are important.
> > 
> > It would be great to hear from others.
> 
> I agree with you that reviewing these comments might be "scary" and
> not suitable for a bed-reading  ;-) (I didn't have time to complete
> such review yet).  OTOH, from my POV, removing such comments/labels
> could only make such (and future) reviews scarier, because then the
> (memory-ordering) "intention" would then be _hidden in the code.

I am not suggesting to remove all comments. Some human readable
explanation is important as long as the code is developed by humans.

I think that I'll have to accept also the extra comments if you are
really going to use them to check the consistency by a tool. Or
if they are really used for review by some people.


> > > And that a formal documentation of the barriers is also important.
> > 
> > It might be helpful if it can be somehow feed to a tool that would
> > prove correctness. Is this the case?
> 
> >From what I've read so far, it _should be relatively straighforward
> to write down a litmus test from any such comment (and give this to
> the LKMM simulator).

Sounds good.

> > In each case, it should follow some "widely" used format.
> > We should not invent a new one that nobody else would use
> > and understand.
> 
> Agreed.  Well, litmus tests (or the comments here in question, that
> are intended to convey the same information) have been successfully
> adopted by memory model and concurrency people for as long as I can
> remember, current architecture reference manuals use these tools to
> describe the 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-23 Thread Petr Mladek
On Fri 2019-08-23 14:54:45, Sergey Senozhatsky wrote:
> On (08/21/19 07:46), John Ogness wrote:
> [..]
> > The labels are necessary for the technical documentation of the
> > barriers. And, after spending much time in this, I find them very
> > useful. But I agree that there needs to be a better way to assign label
> > names.
> [..]
> > > Where dp stands for descriptor push. For dataring we can add a 'dr'
> > > prefix, to avoid confusion with desc barriers, which have 'd' prefix.
> > > And so on. Dunno.
> > 
> > Yeah, I spent a lot of time going in circles on this one.
> [..]
> > I hope that we can agree that the labels are important. And that a
> > formal documentation of the barriers is also important. Yes, they are a
> > lot of work, but I find it makes it a lot easier to go back to the code
> > after I've been away for a while. Even now, as I go through your
> > feedback on code that I wrote over a month ago, I find the formal
> > comments critical to quickly understand _exactly_ why the memory
> > barriers exist.
> 
> Yeah. I like those tagsi/labels, and appreciate your efforts.
> 
> Speaking about it in general, not necessarily related to printk patch set.
> With or without labels/tags we still have to grep. But grep-ing is much
> easier when we have labels/tags. Otherwise it's sometimes hard to understand
> what to grep for - _acquire, _relaxed, smp barrier, write_once, or
> anything else.

Grepping is not needed when function names are used in the comment
and cscope might be used. Each function should be short
and easy enough so that any nested label can be found by eyes.

A custom script is an alternative but it would be better to use
existing tools.

In each case, two letter labels would get redundant sooner or later
when the semantic gets used widely. And I hope that we use a semantic
that is going to be used widely.

> > Perhaps we should choose labels that are more clear, like:
> > 
> > dataring_push:A
> > dataring_push:B
> > 
> > Then we would see comments like:
> > 
> > Memory barrier involvement:
> > 
> > If _dataring_pop:B reads from dataring_datablock_setid:A, then
> > _dataring_pop:C reads from dataring_push:G.
> [..]
> > RELEASE from dataring_push:E to dataring_datablock_setid:A
> >matching
> > ACQUIRE from _dataring_pop:B to _dataring_pop:E
> 
> I thought about it. That's very informative, albeit pretty hard to maintain.
> The same applies to drA or prA and any other context dependent prefix.

The maintenance is my concern as well. The labels should be primary
for an automatized consistency checker. They make sense only when
they are in sync with the code.

Best Regards,
Petr


Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-23 Thread Sergey Senozhatsky
On (08/22/19 15:50), Petr Mladek wrote:
[..]
> I could understand that you spend a lot of time on creating the
> labels and that they are somehow useful for you.
>
> But I am not using them and I hope that I will not have to:
>
>   + Grepping takes a lot of time, especially over several files.

But without labels one still has to grep. A label at least points
to one exact location.

>   + Grepping is actually not enough. It is required to read
> the following comment or code to realize what the label is for.
>
>   + Several barriers have multiple dependencies. Grepping one
> label helps to check that one connection makes sense.
> But it is hard to keep all relations in head to confirm
> that they are complete and make sense overall.

Hmm. Labels don't add dependencies per se. Those tricky and hard to
follow dependencies will still be there, even if we'd remove
labels from comments. Labels just attempt to document them and
to show the intent.

The most important label, which should be added, is John's cell
phone number. So people can call/text him when something is not
working ;)

>   + There are about 50 labels in the code. "Entry Lifecycle"
> section in dataring.c talks about 8 step. One would
> expect that it would require 8 read and 8 write barriers.
> 
> Even coordination of 16 barriers might be complicated to check.
> Where 50 is just scary.
> 
>   + It seems to be a newly invented format and it is not documented.
> I personally do not understand it completely, for example,
> the meaning of "RELEASE from jA->cD->hA to jB".

I was under impression that this is the lingo used by LMM, but
can't find it in Documentation.

I agree, things can be improved and, may be, standardized.
It feels that tooling is a big part of the problem here.

-ss


numlist_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-23 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/numlist.c
> +/**
> + * numlist_push() - Add a node to the list and assign it a sequence number.
> + *
> + * @nl: The numbered list to push to.
> + *
> + * @n:  A node to push to the numbered list.
> + *  The node must not already be part of a list.
> + *
> + * @id: The ID of the node.
> + *
> + * A node is added in two steps: The first step is to make this node the
> + * head, which causes a following push to add to this node. The second step 
> is
> + * to update @next_id of the former head node to point to this one, which
> + * makes this node visible to any task that sees the former head node.
> + */
> +void numlist_push(struct numlist *nl, struct nl_node *n, unsigned long id)
> +{
> + unsigned long head_id;
> + unsigned long seq;
> + unsigned long r;
> +
> + /*
> +  * bA:
> +  *
> +  * Setup the node to be a list terminator: next_id == id.
> +  */
> + WRITE_ONCE(n->next_id, id);

Do we need WRITE_ONCE() here?
Both "n" and "id" are given as parameters and do not change.
The assigment must be done before "id" is set as nl->head_id.
The ordering is enforced by cmpxchg_release().

> +
> + /* bB: #1 */
> + head_id = atomic_long_read(>head_id);
> +
> + for (;;) {
> + /* bC: */
> + while (!numlist_read(nl, head_id, , NULL)) {
> + /*
> +  * @head_id is invalid. Try again with an
> +  * updated value.
> +  */
> +
> + cpu_relax();

I have got very confused by this. cpu_relax() suggests that this
cycle is busy waiting until a particular node becomes valid.
My first though was that it must cause deadlock in NMI when
the interrupted code is supposed to make the node valid.

But it is the other way. The head is always valid when it is
added to the list. It might become invalid when another CPU
moves the head and the old one gets reused.

Anyway, I do not see any reason for cpu_relax() here.

Also the entire cycle would deserve a comment to avoid this mistake.
For example:

/*
 * bC: Read seq from current head. Repeat with new
 * head when it has changed and the old one got reused.
 */

> +
> + /* bB: #2 */
> + head_id = atomic_long_read(>head_id);
> + }
> +
> + /*
> +  * bD:
> +  *
> +  * Set @seq to +1 of @seq from the previous head.
> +  *
> +  * Memory barrier involvement:
> +  *
> +  * If bB reads from bE, then bC->aA reads from bD.
> +  *
> +  * Relies on:
> +  *
> +  * RELEASE from bD to bE
> +  *matching
> +  * ADDRESS DEP. from bB to bC->aA
> +  */
> + WRITE_ONCE(n->seq, seq + 1);

Do we really need WRITE_ONCE() here? 
It is the same problem as with setting n->next_id above.

> +
> + /*
> +  * bE:
> +  *
> +  * This store_release() guarantees that @seq and @next are
> +  * stored before the node with @id is visible to any popping
> +  * writers. It pairs with the address dependency between @id
> +  * and @seq/@next provided by numlist_read(). See bD and bF
> +  * for details.
> +  */
> + r = atomic_long_cmpxchg_release(>head_id, head_id, id);
> + if (r == head_id)
> + break;
> +
> + /* bB: #3 */
> + head_id = r;
> + }
> +
> + n = nl->node(head_id, nl->node_arg);
> +
> + /*
> +  * The old head (which is still the list terminator), cannot be
> +  * removed because the list will always have at least one node.
> +  * Therefore @n must be non-NULL.
> +  */

Please, move this comment above the nl->node() call. Both locations
makes sense. I just see it as an important note for the call and thus
is should be above. Also it will be better separated from the below
comments for the _release() barrier.

> + /*
> +  * bF: the STORE part for @next_id
> +  *
> +  * Set @next_id of the previous head to @id.
> +  *
> +  * Memory barrier involvement:
> +  *
> +  * If bB reads from bE, then bF overwrites bA.
> +  *
> +  * Relies on:
> +  *
> +  * RELEASE from bA to bE
> +  *matching
> +  * ADDRESS DEP. from bB to bF
> +  */
> + /*
> +  * bG: the RELEASE part for @next_id
> +  *
> +  * This _release() guarantees that a reader will see the updates to
> +  * this node's @seq/@next_id if the reader saw the @next_id of the
> +  * previous node in the list. It pairs with the address dependency
> +  * between @id and @seq/@next provided by numlist_read().
> +  *
> +  * 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-22 Thread Sergey Senozhatsky
On (08/21/19 07:46), John Ogness wrote:
[..]
> The labels are necessary for the technical documentation of the
> barriers. And, after spending much time in this, I find them very
> useful. But I agree that there needs to be a better way to assign label
> names.
[..]
> > Where dp stands for descriptor push. For dataring we can add a 'dr'
> > prefix, to avoid confusion with desc barriers, which have 'd' prefix.
> > And so on. Dunno.
> 
> Yeah, I spent a lot of time going in circles on this one.
[..]
> I hope that we can agree that the labels are important. And that a
> formal documentation of the barriers is also important. Yes, they are a
> lot of work, but I find it makes it a lot easier to go back to the code
> after I've been away for a while. Even now, as I go through your
> feedback on code that I wrote over a month ago, I find the formal
> comments critical to quickly understand _exactly_ why the memory
> barriers exist.

Yeah. I like those tagsi/labels, and appreciate your efforts.

Speaking about it in general, not necessarily related to printk patch set.
With or without labels/tags we still have to grep. But grep-ing is much
easier when we have labels/tags. Otherwise it's sometimes hard to understand
what to grep for - _acquire, _relaxed, smp barrier, write_once, or
anything else.

> Perhaps we should choose labels that are more clear, like:
> 
> dataring_push:A
> dataring_push:B
> 
> Then we would see comments like:
> 
> Memory barrier involvement:
> 
> If _dataring_pop:B reads from dataring_datablock_setid:A, then
> _dataring_pop:C reads from dataring_push:G.
[..]
> RELEASE from dataring_push:E to dataring_datablock_setid:A
>matching
> ACQUIRE from _dataring_pop:B to _dataring_pop:E

I thought about it. That's very informative, albeit pretty hard to maintain.
The same applies to drA or prA and any other context dependent prefix.

> But then how should the labels in the code look? Just the letter looks
> simple in code, but cannot be grepped.

Yes, good point.

> dataring_push()
> {
> ...
> /* E */
> ...
> }

If only there was something as cool as grep-ing, but cooler. Something
that "just sucks less". Something that even folks like myself could use.

Bare with me.

Apologies. This email is rather long; but it's pretty easy to read.
Let's see if this can fly.

So what I did.

I changed several LMM tags/labels definitions, so they have common format:

LMM_TAG(name)

I don't insist on this particular naming scheme, it can be improved.

==
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
index e48069dc27bc..54eb28d47d30 100644
--- a/kernel/printk/dataring.c
+++ b/kernel/printk/dataring.c
@@ -577,11 +577,11 @@ char *dataring_push(struct dataring *dr, unsigned long 
size,
to_db_size();
 
do {
-   /* fA: */
+   /* LMM_TAG(fA) */
ret = get_new_lpos(dr, size, _lpos, _lpos);
 
/*
-* fB:
+* LMM_TAG(fB)
 *
 * The data ringbuffer tail may have been pushed (by this or
 * any other task). The updated @tail_lpos must be visible to
@@ -621,7 +621,7 @@ char *dataring_push(struct dataring *dr, unsigned long size,
 
if (!ret) {
/*
-* fC:
+* LMM_TAG(fC)
 *
 * Force @desc permanently invalid to minimize risk
 * of the descriptor later unexpectedly being
@@ -631,14 +631,14 @@ char *dataring_push(struct dataring *dr, unsigned long 
size,
dataring_desc_init(desc);
return NULL;
}
-   /* fE: */
+   /* LMM_TAG(fE) */
} while (atomic_long_cmpxchg_relaxed(>head_lpos, begin_lpos,
 next_lpos) != begin_lpos);
 
db = to_datablock(dr, begin_lpos);
 
/*
-* fF:
+* LMM_TAG(fF)
 *
 * @db->id is a garbage value and could possibly match the @id. This
 * would be a problem because the data block would be considered
@@ -648,7 +648,7 @@ char *dataring_push(struct dataring *dr, unsigned long size,
WRITE_ONCE(db->id, id - 1);
 
/*
-* fG:
+* LMM_TAG(fG)
 *
 * Ensure that @db->id is initialized to a wrong ID value before
 * setting @begin_lpos so that there is no risk of accidentally
@@ -668,7 +668,7 @@ char *dataring_push(struct dataring *dr, unsigned long size,
 */
smp_store_release(>begin_lpos, begin_lpos);
 
-   /* fH: */
+   /* LMM_TAG(fH) */
WRITE_ONCE(desc->next_lpos, next_lpos);
 
/* If this data block wraps, use @data from the content data block. */
diff --git 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-22 Thread Andrea Parri
On Thu, Aug 22, 2019 at 03:50:52PM +0200, Petr Mladek wrote:
> On Wed 2019-08-21 07:46:28, John Ogness wrote:
> > On 2019-08-20, Sergey Senozhatsky  wrote:
> > > [..]
> > >> > +   *
> > >> > +   * Memory barrier involvement:
> > >> > +   *
> > >> > +   * If dB reads from gA, then dC reads from fG.
> > >> > +   * If dB reads from gA, then dD reads from fH.
> > >> > +   * If dB reads from gA, then dE reads from fE.
> > >> > +   *
> > >> > +   * Note that if dB reads from gA, then dC cannot read from fC.
> > >> > +   * Note that if dB reads from gA, then dD cannot read from fD.
> > >> > +   *
> > >> > +   * Relies on:
> > >> > +   *
> > >> > +   * RELEASE from fG to gA
> > >> > +   *matching
> > >> > +   * ADDRESS DEP. from dB to dC
> > >> > +   *
> > >> > +   * RELEASE from fH to gA
> > >> > +   *matching
> > >> > +   * ADDRESS DEP. from dB to dD
> > >> > +   *
> > >> > +   * RELEASE from fE to gA
> > >> > +   *matching
> > >> > +   * ACQUIRE from dB to dE
> > >> > +   */
> > >> 
> > >> But I am not sure how much this is useful. It would take ages to decrypt
> > >> all these shortcuts (signs) and translate them into something
> > >> human readable. Also it might get outdated easily.
> > >> 
> > The labels are necessary for the technical documentation of the
> > barriers. And, after spending much time in this, I find them very
> > useful. But I agree that there needs to be a better way to assign label
> > names.
> 
> I could understand that you spend a lot of time on creating the
> labels and that they are somehow useful for you.
> 
> But I am not using them and I hope that I will not have to:
> 
>   + Grepping takes a lot of time, especially over several files.
> 
>   + Grepping is actually not enough. It is required to read
> the following comment or code to realize what the label is for.
> 
>   + Several barriers have multiple dependencies. Grepping one
> label helps to check that one connection makes sense.
> But it is hard to keep all relations in head to confirm
> that they are complete and make sense overall.
> 
>   + There are about 50 labels in the code. "Entry Lifecycle"
> section in dataring.c talks about 8 step. One would
> expect that it would require 8 read and 8 write barriers.
> 
> Even coordination of 16 barriers might be complicated to check.
> Where 50 is just scary.
> 
> 
>   + It seems to be a newly invented format and it is not documented.
> I personally do not understand it completely, for example,
> the meaning of "RELEASE from jA->cD->hA to jB".

IIUC, something like "hA is the interested access, happening within
cD (should have been cC?), which in turn happens within jA".  But I
should defer to John (FWIW, I found that notation quite helpful).


> 
> 
> I hope that we could do better. I believe that human readable
> comments all less error prone because they describe the intention.
> Pseudo code based on labels just describes the code but it
> does not explain why it was done this way.
> 
> From my POV, the labels do more harm than good. The code gets
> too scattered and is harder to follow.
> 
> 
> > I hope that we can agree that the labels are important.
> 
> It would be great to hear from others.

I agree with you that reviewing these comments might be "scary" and
not suitable for a bed-reading  ;-) (I didn't have time to complete
such review yet).  OTOH, from my POV, removing such comments/labels
could only make such (and future) reviews scarier, because then the
(memory-ordering) "intention" would then be _hidden in the code.


> 
> > And that a formal documentation of the barriers is also important.
> 
> It might be helpful if it can be somehow feed to a tool that would
> prove correctness. Is this the case?

>From what I've read so far, it _should be relatively straighforward
to write down a litmus test from any such comment (and give this to
the LKMM simulator).


> 
> In each case, it should follow some "widely" used format.
> We should not invent a new one that nobody else would use
> and understand.

Agreed.  Well, litmus tests (or the comments here in question, that
are intended to convey the same information) have been successfully
adopted by memory model and concurrency people for as long as I can
remember, current architecture reference manuals use these tools to
describe the semantics of fence or atomic instructions, discussions
about memory barriers on LKML, gcc MLs often reduce to a discussion
around one or more litmus tests...

[trimming]


> > Andrea suggested that the documentation should be within the code, which
> > I think is a good idea. Even if it means we have more comments than
> > code.
> 
> It depends on the type of the information. I would describe:
> 
>   + The overall design on top of the source file or in
> Documentation/...
> 
>   + The behavior of externally used API and non-obvious functions
> 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-22 Thread Petr Mladek
On Wed 2019-08-21 07:46:28, John Ogness wrote:
> On 2019-08-20, Sergey Senozhatsky  wrote:
> > [..]
> >> > + *
> >> > + * Memory barrier involvement:
> >> > + *
> >> > + * If dB reads from gA, then dC reads from fG.
> >> > + * If dB reads from gA, then dD reads from fH.
> >> > + * If dB reads from gA, then dE reads from fE.
> >> > + *
> >> > + * Note that if dB reads from gA, then dC cannot read from fC.
> >> > + * Note that if dB reads from gA, then dD cannot read from fD.
> >> > + *
> >> > + * Relies on:
> >> > + *
> >> > + * RELEASE from fG to gA
> >> > + *matching
> >> > + * ADDRESS DEP. from dB to dC
> >> > + *
> >> > + * RELEASE from fH to gA
> >> > + *matching
> >> > + * ADDRESS DEP. from dB to dD
> >> > + *
> >> > + * RELEASE from fE to gA
> >> > + *matching
> >> > + * ACQUIRE from dB to dE
> >> > + */
> >> 
> >> But I am not sure how much this is useful. It would take ages to decrypt
> >> all these shortcuts (signs) and translate them into something
> >> human readable. Also it might get outdated easily.
> >> 
> The labels are necessary for the technical documentation of the
> barriers. And, after spending much time in this, I find them very
> useful. But I agree that there needs to be a better way to assign label
> names.

I could understand that you spend a lot of time on creating the
labels and that they are somehow useful for you.

But I am not using them and I hope that I will not have to:

  + Grepping takes a lot of time, especially over several files.

  + Grepping is actually not enough. It is required to read
the following comment or code to realize what the label is for.

  + Several barriers have multiple dependencies. Grepping one
label helps to check that one connection makes sense.
But it is hard to keep all relations in head to confirm
that they are complete and make sense overall.

  + There are about 50 labels in the code. "Entry Lifecycle"
section in dataring.c talks about 8 step. One would
expect that it would require 8 read and 8 write barriers.

Even coordination of 16 barriers might be complicated to check.
Where 50 is just scary.


  + It seems to be a newly invented format and it is not documented.
I personally do not understand it completely, for example,
the meaning of "RELEASE from jA->cD->hA to jB".


I hope that we could do better. I believe that human readable
comments all less error prone because they describe the intention.
Pseudo code based on labels just describes the code but it
does not explain why it was done this way.

>From my POV, the labels do more harm than good. The code gets
too scattered and is harder to follow.


> I hope that we can agree that the labels are important.

It would be great to hear from others.

> And that a formal documentation of the barriers is also important.

It might be helpful if it can be somehow feed to a tool that would
prove correctness. Is this the case?

In each case, it should follow some "widely" used format.
We should not invent a new one that nobody else would use
and understand.

> Perhaps we should choose labels that are more clear, like:
> 
> dataring_push:A
> dataring_push:B

The dataring_push is clear. The A or B codes have no meaning
without searching.

It might look better if we replace A or B with variable names.

> Then we would see comments like:
> 
> Memory barrier involvement:
> 
> If _dataring_pop:B reads from dataring_datablock_setid:A, then
> _dataring_pop:C reads from dataring_push:G.

Is this some known syntax, please? I do not understand it.

> 
> Andrea suggested that the documentation should be within the code, which
> I think is a good idea. Even if it means we have more comments than
> code.

It depends on the type of the information. I would describe:

  + The overall design on top of the source file or in
Documentation/...

  + The behavior of externally used API and non-obvious functions
above the function definition.

  + Implementation details, non-obvious effects, side effects,
relations, meaning of tricky calculation, meaning of
a block of code inside the code. But each function should
ideally fit on the screen.

I personally tend to write more documentation but it is sometimes
too much. I am trying to become more effective and to the point.

Best Regards,
Petr


Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-22 Thread Petr Mladek
On Wed 2019-08-21 07:42:57, John Ogness wrote:
> On 2019-08-20, Petr Mladek  wrote:
> >> --- /dev/null
> >> +++ b/kernel/printk/dataring.c
> >> +/**
> >> + * _datablock_valid() - Check if given positions yield a valid data block.
> >> + *
> >> + * @dr: The associated data ringbuffer.
> >> + *
> >> + * @head_lpos:  The newest data logical position.
> >> + *
> >> + * @tail_lpos:  The oldest data logical position.
> >> + *
> >> + * @begin_lpos: The beginning logical position of the data block to check.
> >> + *
> >> + * @next_lpos:  The logical position of the next adjacent data block.
> >> + *  This value is used to identify the end of the data block.
> >> + *
> >
> > Please remove the empty lines between arguments description. They make
> > the comments too scattered.
> 
> Your feedback is contradicting what PeterZ requested[0]. Particularly
> when multiple lines are involved with a description, I find the spacing
> helpful. I've grown to like the spacing, but I won't fight for it.

I do not want to fight over it. Just note that >90% of argument
descriptors seem to be one liners.

Best Regards,
Petr


Re: assign_desc() barriers: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-22 Thread Petr Mladek
On Wed 2019-08-21 07:52:26, John Ogness wrote:
> On 2019-08-20, Petr Mladek  wrote:
> >> > --- /dev/null
> >> > +++ b/kernel/printk/ringbuffer.c
> >> > +/**
> >> > + * assign_desc() - Assign a descriptor to the caller.
> >> > + *
> >> > + * @e: The entry structure to store the assigned descriptor to.
> >> > + *
> >> > + * Find an available descriptor to assign to the caller. First it is 
> >> > checked
> >> > + * if the tail descriptor from the committed list can be recycled. If 
> >> > not,
> >> > + * perhaps a never-used descriptor is available. Otherwise, data blocks 
> >> > will
> >> > + * be invalidated until the tail descriptor from the committed list can 
> >> > be
> >> > + * recycled.
> >> > + *
> >> > + * Assigned descriptors are invalid until data has been reserved for 
> >> > them.
> >> > + *
> >> > + * Return: true if a descriptor was assigned, otherwise false.
> >> > + *
> >> > + * This will only fail if it was not possible to invalidate data blocks 
> >> > in
> >> > + * order to recycle a descriptor. This can happen if a writer has 
> >> > reserved but
> >> > + * not yet committed data and that reserved data is currently the 
> >> > oldest data.
> >> > + */
> >> > +static bool assign_desc(struct prb_reserved_entry *e)
> >> > +{
> >> > +struct printk_ringbuffer *rb = e->rb;
> >> > +struct prb_desc *d;
> >> > +struct nl_node *n;
> >> > +unsigned long i;
> >> > +
> >> > +for (;;) {
> >> > +/*
> >> > + * jA:
> >> > + *
> >> > + * Try to recycle a descriptor on the committed list.
> >> > + */
> >> > +n = numlist_pop(>nl);
> >> > +if (n) {
> >> > +d = container_of(n, struct prb_desc, list);
> >> > +break;
> >> > +}
> >> > +
> >> > +/* Fallback to static never-used descriptors. */
> >> > +if (atomic_read(>desc_next_unused) < 
> >> > DESCS_COUNT(rb)) {
> >> > +i = atomic_fetch_inc(>desc_next_unused);
> >> > +if (i < DESCS_COUNT(rb)) {
> >> > +d = >descs[i];
> >> > +atomic_long_set(>id, i);
> >> > +break;
> >> > +}
> >> > +}
> >> > +
> >> > +/*
> >> > + * No descriptor available. Make one available for 
> >> > recycling
> >> > + * by invalidating data (which some descriptor will be
> >> > + * referencing).
> >> > + */
> >> > +if (!dataring_pop(>dr))
> >> > +return false;
> >> > +}
> >> > +
> >> > +/*
> >> > + * jB:
> >> > + *
> >> > + * Modify the descriptor ID so that users of the descriptor see 
> >> > that
> >> > + * it has been recycled. A _release() is used so that 
> >> > prb_getdesc()
> >> > + * callers can see all data ringbuffer updates after issuing a
> >> > + * pairing smb_rmb(). See iA for details.
> >> > + *
> >> > + * Memory barrier involvement:
> >> > + *
> >> > + * If dB->iA reads from jB, then dI reads the same value as
> >> > + * jA->cD->hA.
> >> > + *
> >> > + * Relies on:
> >> > + *
> >> > + * RELEASE from jA->cD->hA to jB
> >> > + *matching
> >> > + * RMB between dB->iA and dI
> >> > + */
> >> > +atomic_long_set_release(>id, atomic_long_read(>id) +
> >> > +DESCS_COUNT(rb));
> >> 
> >> atomic_long_set_release() might be a bit confusing here.
> >> There is no related acquire.
> 
> As the comment states, this release is for prb_getdesc() users. The only
> prb_getdesc() user is _dataring_pop().
> (i.e. the descriptor's ID is not what _dataring_pop() was expecting),
> then the tail must have moved and _dataring_pop() needs to see
> that. Since there are no data dependencies between descriptor ID and
> tail_pos, an explicit memory barrier is used. More on this below.

OK, let me show how complicated and confusing this looks for me:

  + The two related barriers are in different source files
and APIs:

   + assign_desc() in ringbuffer.c; ringbuffer API
   + _dataring_pop in dataring.c; dataring API

  + Both the related barriers are around "id" manipulation.
But one is in dataring, other is in descriptors array.
One is about an old released "id". One is about a newly
assigned "id".

  + The release() barrier is called once for each assigned
descriptor. The acquire() barrier is called more times
or not at all depending on the amount of free space
in dataring.

  + prb_getdesc() is mentioned in the comment but the barrier
is in _dataring_pop()

  + prb_getdesc() is called via dr->getdesc() callback and thus
 

Re: assign_desc() barriers: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread John Ogness
On 2019-08-20, Petr Mladek  wrote:
>> > --- /dev/null
>> > +++ b/kernel/printk/ringbuffer.c
>> > +/**
>> > + * assign_desc() - Assign a descriptor to the caller.
>> > + *
>> > + * @e: The entry structure to store the assigned descriptor to.
>> > + *
>> > + * Find an available descriptor to assign to the caller. First it is 
>> > checked
>> > + * if the tail descriptor from the committed list can be recycled. If not,
>> > + * perhaps a never-used descriptor is available. Otherwise, data blocks 
>> > will
>> > + * be invalidated until the tail descriptor from the committed list can be
>> > + * recycled.
>> > + *
>> > + * Assigned descriptors are invalid until data has been reserved for them.
>> > + *
>> > + * Return: true if a descriptor was assigned, otherwise false.
>> > + *
>> > + * This will only fail if it was not possible to invalidate data blocks in
>> > + * order to recycle a descriptor. This can happen if a writer has 
>> > reserved but
>> > + * not yet committed data and that reserved data is currently the oldest 
>> > data.
>> > + */
>> > +static bool assign_desc(struct prb_reserved_entry *e)
>> > +{
>> > +  struct printk_ringbuffer *rb = e->rb;
>> > +  struct prb_desc *d;
>> > +  struct nl_node *n;
>> > +  unsigned long i;
>> > +
>> > +  for (;;) {
>> > +  /*
>> > +   * jA:
>> > +   *
>> > +   * Try to recycle a descriptor on the committed list.
>> > +   */
>> > +  n = numlist_pop(>nl);
>> > +  if (n) {
>> > +  d = container_of(n, struct prb_desc, list);
>> > +  break;
>> > +  }
>> > +
>> > +  /* Fallback to static never-used descriptors. */
>> > +  if (atomic_read(>desc_next_unused) < DESCS_COUNT(rb)) {
>> > +  i = atomic_fetch_inc(>desc_next_unused);
>> > +  if (i < DESCS_COUNT(rb)) {
>> > +  d = >descs[i];
>> > +  atomic_long_set(>id, i);
>> > +  break;
>> > +  }
>> > +  }
>> > +
>> > +  /*
>> > +   * No descriptor available. Make one available for recycling
>> > +   * by invalidating data (which some descriptor will be
>> > +   * referencing).
>> > +   */
>> > +  if (!dataring_pop(>dr))
>> > +  return false;
>> > +  }
>> > +
>> > +  /*
>> > +   * jB:
>> > +   *
>> > +   * Modify the descriptor ID so that users of the descriptor see that
>> > +   * it has been recycled. A _release() is used so that prb_getdesc()
>> > +   * callers can see all data ringbuffer updates after issuing a
>> > +   * pairing smb_rmb(). See iA for details.
>> > +   *
>> > +   * Memory barrier involvement:
>> > +   *
>> > +   * If dB->iA reads from jB, then dI reads the same value as
>> > +   * jA->cD->hA.
>> > +   *
>> > +   * Relies on:
>> > +   *
>> > +   * RELEASE from jA->cD->hA to jB
>> > +   *matching
>> > +   * RMB between dB->iA and dI
>> > +   */
>> > +  atomic_long_set_release(>id, atomic_long_read(>id) +
>> > +  DESCS_COUNT(rb));
>> 
>> atomic_long_set_release() might be a bit confusing here.
>> There is no related acquire.

As the comment states, this release is for prb_getdesc() users. The only
prb_getdesc() user is _dataring_pop(). If getdesc() returns NULL
(i.e. the descriptor's ID is not what _dataring_pop() was expecting),
then the tail must have moved and _dataring_pop() needs to see
that. Since there are no data dependencies between descriptor ID and
tail_pos, an explicit memory barrier is used. More on this below.

>> In fact, d->id manipulation has barriers from both sides:
>> 
>>   + smp_rmb() before so that all reads are finished before
>> the id is updated (release)
>
> Uh, this statement does not make sense. The read barrier is not
> needed here. Instead the readers need it.
>
> Well, we might need a write barrier before d->id manipulation.
> It should be in numlist_pop() after successfully updating nl->tail_id.
> It will allow readers to detect that the desriptor is being reused
> (not in valid tail_id..head_id range) before we start manipulating it.
>
>>   + smp_wmb() after so that the new ID is written before other
>> related values are modified (acquire).
>> 
>> The smp_wmb() barrier is in prb_reserve(). I would move it here.
>
> This still makes sense. I would move the write barrier from
> prb_reserve() here.

The issue is that _dataring_pop() needs to see a moved dataring tail if
prb_getdesc() fails. Just because numlist_pop() succeeded, doesn't mean
that this was the task that changed the dataring tail. I.e. another CPU
could observe that this task changed the ID but _not_ yet see that
another task changed the dataring tail.

Issuing an smp_mb() before setting the the new ID would also suffice,
but that is a pretty big hammer for something that a set_release can
take care of.

> Sigh, I have to admit that I am not familiar with the _acquire(),
> 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread John Ogness
On 2019-08-20, Sergey Senozhatsky  wrote:
> [..]
>> > +   *
>> > +   * Memory barrier involvement:
>> > +   *
>> > +   * If dB reads from gA, then dC reads from fG.
>> > +   * If dB reads from gA, then dD reads from fH.
>> > +   * If dB reads from gA, then dE reads from fE.
>> > +   *
>> > +   * Note that if dB reads from gA, then dC cannot read from fC.
>> > +   * Note that if dB reads from gA, then dD cannot read from fD.
>> > +   *
>> > +   * Relies on:
>> > +   *
>> > +   * RELEASE from fG to gA
>> > +   *matching
>> > +   * ADDRESS DEP. from dB to dC
>> > +   *
>> > +   * RELEASE from fH to gA
>> > +   *matching
>> > +   * ADDRESS DEP. from dB to dD
>> > +   *
>> > +   * RELEASE from fE to gA
>> > +   *matching
>> > +   * ACQUIRE from dB to dE
>> > +   */
>> 
>> But I am not sure how much this is useful. It would take ages to decrypt
>> all these shortcuts (signs) and translate them into something
>> human readable. Also it might get outdated easily.
>> 
>> That said, I haven't found yet if there was a system in all
>> the shortcuts. I mean if they can be descrypted easily
>> out of head. Also I am not familiar with the notation
>> of the dependencies.
>
> Does not appear to be systematic to me, but maybe I'm missing something
> obvious. For chains like
>
>   jA->cD->hA to jB
>
> I haven't found anything better than just git grep jA kernel/printk/
> so far.

I really struggled to find a way to label the code in order to document
the memory barriers. By grepping on "jA:" you will land at the exact
location.

> But once you'll grep for label cD, for instance, you'd see
> that it's not defined. It's mentioned but not defined
>
>   kernel/printk/ringbuffer.c:  * jA->cD->hA.
>   kernel/printk/ringbuffer.c:  * RELEASE from jA->cD->hA to jB

I tried to be very careful about the labeling, but you just found an
error. cD is supposed to be cC. (I probably refactored the labels and
missed this one.) Particularly with referencing labels from other files
I was not happy (which is the case with cC). This is one area that I
think it would be really helpful if the kernel guidelines had some
format.

The labels are necessary for the technical documentation of the
barriers. And, after spending much time in this, I find them very
useful. But I agree that there needs to be a better way to assign label
names.

FWIW, I chose a lowercase letter for each function and an uppercase
letter for each label within that function. The camel case (followed by
the colon) created a pair that was unique for grepping.

Petr, in case you missed it, this comment language came from my
discussion[0] with AndreaP.

> I was thinking about renaming labels. E.g.
>
>   dataring_desc_init()
>   {
>   /* di1 */
>   WRITE_ONCE(desc->begin_lpos, 1);
>   /* di2 */
>   WRITE_ONCE(desc->next_lpos, 1);
>   }
>
> Where di stands for descriptor init.
>
>   dataring_push()
>   {
>   /* dp1 */
>   ret = get_new_lpos(dr, size, _lpos, _lpos);
>   ...
>   /* dp2 */
>   smp_mb();
>   ...
>   }
>
> Where dp stands for descriptor push. For dataring we can add a 'dr'
> prefix, to avoid confusion with desc barriers, which have 'd' prefix.
> And so on. Dunno.

Yeah, I spent a lot of time going in circles on this one.

I hope that we can agree that the labels are important. And that a
formal documentation of the barriers is also important. Yes, they are a
lot of work, but I find it makes it a lot easier to go back to the code
after I've been away for a while. Even now, as I go through your
feedback on code that I wrote over a month ago, I find the formal
comments critical to quickly understand _exactly_ why the memory
barriers exist.

Perhaps we should choose labels that are more clear, like:

dataring_push:A
dataring_push:B

Then we would see comments like:

Memory barrier involvement:

If _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:C reads from dataring_push:G.

If _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:D reads from dataring_push:H.

If _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:E reads from dataring_push:E.

Note that if _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:C cannot read from dataring_push:C->dataring_desc_init:A.

Note that if _dataring_pop:B reads from dataring_datablock_setid:A, then
_dataring_pop:D cannot read from dataring_push:C->dataring_desc_init:B.

Relies on:

RELEASE from dataring_push:G to dataring_datablock_setid:A
   matching
ADDRESS DEP. from _dataring_pop:B to _dataring_pop:C

RELEASE from dataring_push:H to dataring_datablock_setid:A
   matching
ADDRESS DEP. from _dataring_pop:B to _dataring_pop:D

RELEASE from dataring_push:E to 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread John Ogness
On 2019-08-20, Petr Mladek  wrote:
>> --- /dev/null
>> +++ b/kernel/printk/dataring.c
>> +/**
>> + * _datablock_valid() - Check if given positions yield a valid data block.
>> + *
>> + * @dr: The associated data ringbuffer.
>> + *
>> + * @head_lpos:  The newest data logical position.
>> + *
>> + * @tail_lpos:  The oldest data logical position.
>> + *
>> + * @begin_lpos: The beginning logical position of the data block to check.
>> + *
>> + * @next_lpos:  The logical position of the next adjacent data block.
>> + *  This value is used to identify the end of the data block.
>> + *
>
> Please remove the empty lines between arguments description. They make
> the comments too scattered.

Your feedback is contradicting what PeterZ requested[0]. Particularly
when multiple lines are involved with a description, I find the spacing
helpful. I've grown to like the spacing, but I won't fight for it.

>> +/*
>> + * dB:
>> + *
>> + * When a writer has completed accessing its data block, it sets the
>> + * @id thus making the data block available for invalidation. This
>> + * _acquire() ensures that this task sees all data ringbuffer and
>> + * descriptor values seen by the writer as @id was set. This is
>> + * necessary to ensure that the data block can be correctly identified
>> + * as valid (i.e. @begin_lpos, @next_lpos, @head_lpos are at least the
>> + * values seen by that writer, which yielded a valid data block at
>> + * that time). It is not enough to rely on the address dependency of
>> + * @desc to @id because @head_lpos is not depedent on @id. This pairs
>> + * with the _release() in dataring_datablock_setid().
>
> This human readable description is really useful.
>
>> + *
>> + * Memory barrier involvement:
>> + *
>> + * If dB reads from gA, then dC reads from fG.
>> + * If dB reads from gA, then dD reads from fH.
>> + * If dB reads from gA, then dE reads from fE.
>> + *
>> + * Note that if dB reads from gA, then dC cannot read from fC.
>> + * Note that if dB reads from gA, then dD cannot read from fD.
>> + *
>> + * Relies on:
>> + *
>> + * RELEASE from fG to gA
>> + *matching
>> + * ADDRESS DEP. from dB to dC
>> + *
>> + * RELEASE from fH to gA
>> + *matching
>> + * ADDRESS DEP. from dB to dD
>> + *
>> + * RELEASE from fE to gA
>> + *matching
>> + * ACQUIRE from dB to dE
>> + */
>
> But I am not sure how much this is useful.

When I was first implementing RFCv3, the "human-readable" text version
was very useful for me. However, now it is the formal descriptions that
I find more useful. They provide the proof and a far more detailed
description.

> It would take ages to decrypt all these shortcuts (signs) and
> translate them into something human readable. Also it might get
> outdated easily.
>
> That said, I haven't found yet if there was a system in all
> the shortcuts. I mean if they can be descrypted easily
> out of head. Also I am not familiar with the notation
> of the dependencies.

I'll respond to this part in Sergey's followup post.

> If this is really needed then I am really scared of some barriers
> that guard too many things. This one is a good example.
>
>> +desc = dr->getdesc(smp_load_acquire(>id), dr->getdesc_arg);

The variable's value (in this case db->id) is doing the guarding. The
barriers ensure that db->id is read first (and set last).

>> +
>> +/* dD: */
>
> It would be great if all these shortcuts (signs) are followed with
> something human readable. Few words might be enough.

I'll respond to this part in Sergey's followup post.

>> +next_lpos = READ_ONCE(desc->next_lpos);
>> +
>> +if (!_datablock_valid(dr,
>> +  /* dE: */
>> +  atomic_long_read(>head_lpos),
>> +  tail_lpos, begin_lpos, next_lpos)) {
>> +/* Another task has already invalidated the data block. */
>> +goto out;
>> +}
>> +
>> +
>> +++ b/kernel/printk/numlist.c
>> +bool numlist_read(struct numlist *nl, unsigned long id, unsigned long *seq,
>> +  unsigned long *next_id)
>> +
>> +struct nl_node *n;
>> +
>> +n = nl->node(id, nl->node_arg);
>> +if (!n)
>> +return false;
>> +
>> +if (seq) {
>> +/*
>> + * aA:
>> + *
>> + * Adresss dependency on @id.
>> + */
>
> This is too scattered. If we really need so many shortcuts (signs)
> then we should find a better style. The following looks perfectly
> fine to me:
>
>   /* aA: Adresss dependency on @id. */

I'll respond to this part in Sergey's followup post.

>> +*seq = READ_ONCE(n->seq);
>> +}
>> +
>> +if (next_id) {
>> +/*
>> + * aB:
>> + *
>> + * Adresss dependency on @id.
>> + */
>> +

Re: numlist_pop(): Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread John Ogness
On 2019-08-20, Petr Mladek  wrote:
>> --- /dev/null
>> +++ b/kernel/printk/numlist.c
>> +/**
>> + * numlist_pop() - Remove the oldest node from the list.
>> + *
>> + * @nl: The numbered list from which to remove the tail node.
>> + *
>> + * The tail node can only be removed if two conditions are satisfied:
>> + *
>> + * * The node is not the only node on the list.
>> + * * The node is not busy.
>> + *
>> + * If, during this function, another task removes the tail, this function
>> + * will try again with the new tail.
>> + *
>> + * Return: The removed node or NULL if the tail node cannot be removed.
>> + */
>> +struct nl_node *numlist_pop(struct numlist *nl)
>> +{
>> +unsigned long tail_id;
>> +unsigned long next_id;
>> +unsigned long r;
>> +
>> +/* cA: #1 */
>> +tail_id = atomic_long_read(>tail_id);
>> +
>> +for (;;) {
>> +/* cB */
>> +while (!numlist_read(nl, tail_id, NULL, _id)) {
>> +/*
>> + * @tail_id is invalid. Try again with an
>> + * updated value.
>> + */
>> +
>> +cpu_relax();
>> +
>> +/* cA: #2 */
>> +tail_id = atomic_long_read(>tail_id);
>> +}
>
> The above while-cycle basically does the same as the upper for-cycle.
> It tries again with freshly loaded nl->tail_id. The following code
> looks easier to follow:
>
>   do {
>   tail_id = atomic_long_read(>tail_id);
>
>   /*
>* Read might fail when the tail node has been removed
>* and reused in parallel.
>*/
>   if (!numlist_read(nl, tail_id, NULL, _id))
>   continue;
>
>   /* Make sure the node is not the only node on the list. */
>   if (next_id == tail_id)
>   return NULL;
>
>   /* cC: Make sure the node is not busy. */
>   if (nl->busy(tail_id, nl->busy_arg))
>   return NULL;
>
>   while (atomic_long_cmpxchg_relaxed(>tail_id, tail_id, next_id) !=
>   tail_id);
>
>   /* This should never fail. The node is ours. */
>   return nl->node(tail_id, nl->node_arg);

You will see that pattern in several cmpxchg() loops. The reason I chose
to do it that way was so that I could make use of the return value of
the failed cmpcxhg(). This avoids an unnecessary LOAD and establishes a
data dependency between the failed cmpxchg() and the following
numlist_read(). I suppose none of that matters since we only care about
the case where cmpxchg() is successful.

I agree that your variation is easier to read.

>> +/* Make sure the node is not the only node on the list. */
>> +if (next_id == tail_id)
>> +return NULL;
>> +
>> +/*
>> + * cC:
>> + *
>> + * Make sure the node is not busy.
>> + */
>> +if (nl->busy(tail_id, nl->busy_arg))
>> +return NULL;
>> +
>> +r = atomic_long_cmpxchg_relaxed(>tail_id,
>> +tail_id, next_id);
>> +if (r == tail_id)
>> +break;
>> +
>> +/* cA: #3 */
>> +tail_id = r;
>> +}
>> +
>> +return nl->node(tail_id, nl->node_arg);
>
> If I get it correctly, the above nl->node() call should never fail.
> The node has been removed from the list and nobody else could
> touch it. It is pretty useful information and it might be worth
> mention it in a comment.

You are correct and I will add a comment.

> PS: I am scratching my head around the patchset. I'll try Peter's
> approach and comment independent things is separate mails.

I think it is an excellent approach. Especially when discussing the
memory barriers.

John Ogness


datablock reuse races Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> +/**
> + * _dataring_pop() - Move tail forward, invalidating the oldest data block.
> + *
> + * @dr:The data ringbuffer containing the data block.
> + *
> + * @tail_lpos: The logical position of the oldest data block.
> + *
> + * This function expects to move the pointer to the oldest data block 
> forward,
> + * thus invalidating the oldest data block. Before attempting to move the
> + * tail, it is verified that the data block is valid. An invalid data block
> + * means that another task has already moved the tail pointer forward.
> + *
> + * Return: The new/current value (logical position) of the tail.
> + *
> + * From the return value the caller can identify if the tail was moved
> + * forward. However, the caller does not know if it was the task that
> + * performed the move.
> + *
> + * If, after seeing a moved tail, the caller will be modifying @begin_lpos or
> + * @next_lpos of a descriptor or will be modifying the head, a full memory
> + * barrier is required before doing so. This ensures that if any update to a
> + * descriptor's @begin_lpos or @next_lpos or the data ringbuffer's head is
> + * visible, that the previous update to the tail is also visible. This avoids
> + * the possibility of failure to notice when another task has moved the tail.
> + *
> + * If the tail has not moved forward it means the @id for the data block was
> + * not set yet. In this case the tail cannot move forward.
> + */
> +static unsigned long _dataring_pop(struct dataring *dr,
> +unsigned long tail_lpos)
> +{
> + unsigned long new_tail_lpos;
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> + struct dr_datablock *db;
> + struct dr_desc *desc;
> +
> + /*
> +  * dA:
> +  *
> +  * @db has an address dependency on @tail_pos. Therefore @tail_lpos
> +  * must be loaded before dB, which accesses @db.
> +  */
> + db = to_datablock(dr, tail_lpos);
> +
> + /*
> +  * dB:
> +  *
> +  * When a writer has completed accessing its data block, it sets the
> +  * @id thus making the data block available for invalidation. This
> +  * _acquire() ensures that this task sees all data ringbuffer and
> +  * descriptor values seen by the writer as @id was set. This is
> +  * necessary to ensure that the data block can be correctly identified
> +  * as valid (i.e. @begin_lpos, @next_lpos, @head_lpos are at least the
> +  * values seen by that writer, which yielded a valid data block at
> +  * that time). It is not enough to rely on the address dependency of
> +  * @desc to @id because @head_lpos is not depedent on @id. This pairs
> +  * with the _release() in dataring_datablock_setid().
> +  *
> + desc = dr->getdesc(smp_load_acquire(>id), dr->getdesc_arg);

I guess that we might read a garbage here before dataring_push()
writes an invalid ID after it shuffled dr->head_lpos.

I am not completely sure how we detect the garbage. prb_getdesc()
might reach a valid descriptor just by chance. My understanding
is below.

> + if (!desc) {
> + /*
> +  * The data block @id is invalid. The data block is either in
> +  * use by the writer (@id not yet set) or has already been
> +  * invalidated by another task and the data array area or
> +  * descriptor have already been recycled. The latter case
> +  * (descriptor already recycled) relies on the implementation
> +  * of getdesc(), which, when using an smp_rmb(), must allow
> +  * this task to see @tail_lpos as it was visible to the task
> +  * that changed the ID-to-descriptor mapping. See the
> +  * implementation of getdesc() for details.
> +  */
> + goto out;
> + }
> +
> + /*
> +  * dC:
> +  *
> +  * Even though the data block @id was determined to be valid, it is
> +  * possible that it is a data block recently made available and @id
> +  * has not yet been initialized. The @id needs to be re-validated (dF)
> +  * after checking if the descriptor points to the data block. Use
> +  * _acquire() to ensure that the re-loading of @id occurs after
> +  * loading @begin_lpos. This pairs with the _release() in
> +  * dataring_push(). See fG for details.
> +  */
> + begin_lpos = smp_load_acquire(>begin_lpos);
> +
> + if (begin_lpos != tail_lpos) {
> + /*
> +  * @desc is not describing the data block at @tail_lpos. Since
> +  * a data block and its descriptor always become valid before
> +  * @id is set (see dB for details) the data block at
> +  * @tail_lpos has already been invalidated.
> +  */
> + goto out;

I believe that this check should be enough to detect the garbage
or non-initialized db->id.

All descriptors 

Re: assign_desc() barriers: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread Petr Mladek
On Tue 2019-08-20 10:22:53, Petr Mladek wrote:
> On Thu 2019-08-08 00:32:26, John Ogness wrote:
> > --- /dev/null
> > +++ b/kernel/printk/ringbuffer.c
> > +/**
> > + * assign_desc() - Assign a descriptor to the caller.
> > + *
> > + * @e: The entry structure to store the assigned descriptor to.
> > + *
> > + * Find an available descriptor to assign to the caller. First it is 
> > checked
> > + * if the tail descriptor from the committed list can be recycled. If not,
> > + * perhaps a never-used descriptor is available. Otherwise, data blocks 
> > will
> > + * be invalidated until the tail descriptor from the committed list can be
> > + * recycled.
> > + *
> > + * Assigned descriptors are invalid until data has been reserved for them.
> > + *
> > + * Return: true if a descriptor was assigned, otherwise false.
> > + *
> > + * This will only fail if it was not possible to invalidate data blocks in
> > + * order to recycle a descriptor. This can happen if a writer has reserved 
> > but
> > + * not yet committed data and that reserved data is currently the oldest 
> > data.
> > + */
> > +static bool assign_desc(struct prb_reserved_entry *e)
> > +{
> > +   struct printk_ringbuffer *rb = e->rb;
> > +   struct prb_desc *d;
> > +   struct nl_node *n;
> > +   unsigned long i;
> > +
> > +   for (;;) {
> > +   /*
> > +* jA:
> > +*
> > +* Try to recycle a descriptor on the committed list.
> > +*/
> > +   n = numlist_pop(>nl);
> > +   if (n) {
> > +   d = container_of(n, struct prb_desc, list);
> > +   break;
> > +   }
> > +
> > +   /* Fallback to static never-used descriptors. */
> > +   if (atomic_read(>desc_next_unused) < DESCS_COUNT(rb)) {
> > +   i = atomic_fetch_inc(>desc_next_unused);
> > +   if (i < DESCS_COUNT(rb)) {
> > +   d = >descs[i];
> > +   atomic_long_set(>id, i);
> > +   break;
> > +   }
> > +   }
> > +
> > +   /*
> > +* No descriptor available. Make one available for recycling
> > +* by invalidating data (which some descriptor will be
> > +* referencing).
> > +*/
> > +   if (!dataring_pop(>dr))
> > +   return false;
> > +   }
> > +
> > +   /*
> > +* jB:
> > +*
> > +* Modify the descriptor ID so that users of the descriptor see that
> > +* it has been recycled. A _release() is used so that prb_getdesc()
> > +* callers can see all data ringbuffer updates after issuing a
> > +* pairing smb_rmb(). See iA for details.
> > +*
> > +* Memory barrier involvement:
> > +*
> > +* If dB->iA reads from jB, then dI reads the same value as
> > +* jA->cD->hA.
> > +*
> > +* Relies on:
> > +*
> > +* RELEASE from jA->cD->hA to jB
> > +*matching
> > +* RMB between dB->iA and dI
> > +*/
> > +   atomic_long_set_release(>id, atomic_long_read(>id) +
> > +   DESCS_COUNT(rb));
> 
> atomic_long_set_release() might be a bit confusing here.
> There is no related acquire.
> 
> In fact, d->id manipulation has barriers from both sides:
> 
>   + smp_rmb() before so that all reads are finished before
> the id is updated (release)

Uh, this statement does not make sense. The read barrier is not
needed here. Instead the readers need it.

Well, we might need a write barrier before d->id manipulation.
It should be in numlist_pop() after successfully updating nl->tail_id.
It will allow readers to detect that the desriptor is being reused
(not in valid tail_id..head_id range) before we start manipulating it.


>   + smp_wmb() after so that the new ID is written before other
> related values are modified (acquire).
> 
> The smp_wmb() barrier is in prb_reserve(). I would move it here.

This still makes sense. I would move the write barrier from
prb_reserve() here.


Sigh, I have to admit that I am not familiar with the _acquire(),
_release(), and _relaxed() variants of the atomic operations.

They probably make it easier to implement some locking API.
I am not sure how to use it here. This code implements a complex
interlock between several variables. I mean that several variables
lock each other in a cycle, like a state machine? In each case,
it is not a simple locking where we check state of a single
variable.

Best Regards,
Petr


dataring_push() barriers Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> +/**
> + * dataring_push() - Reserve a data block in the data array.
> + *
> + * @dr:   The data ringbuffer to reserve data in.
> + *
> + * @size: The size to reserve.
> + *
> + * @desc: A pointer to a descriptor to store the data block information.
> + *
> + * @id:   The ID of the descriptor to be associated.
> + *The data block will not be set with @id, but rather initialized 
> with
> + *a value that is explicitly different than @id. This is to handle 
> the
> + *case when newly available garbage by chance matches the descriptor
> + *ID.
> + *
> + * This function expects to move the head pointer forward. If this would
> + * result in overtaking the data array index of the tail, the tail data block
> + * will be invalidated.
> + *
> + * Return: A pointer to the reserved writer data, otherwise NULL.
> + *
> + * This will only fail if it was not possible to invalidate the tail data
> + * block.
> + */
> +char *dataring_push(struct dataring *dr, unsigned int size,
> + struct dr_desc *desc, unsigned long id)
> +{
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> + struct dr_datablock *db;
> + bool ret;
> +
> + to_db_size();
> +
> + do {
> + /* fA: */
> + ret = get_new_lpos(dr, size, _lpos, _lpos);
> +
> + /*
> +  * fB:
> +  *
> +  * The data ringbuffer tail may have been pushed (by this or
> +  * any other task). The updated @tail_lpos must be visible to
> +  * all observers before changes to @begin_lpos, @next_lpos, or
> +  * @head_lpos by this task are visible in order to allow other
> +  * tasks to recognize the invalidation of the data
> +  * blocks.

This sounds strange. The write barrier should be done only on CPU
that really modified tail_lpos. I.e. it should be in _dataring_pop()
after successful dr->tail_lpos modification.

> +  * This pairs with the smp_rmb() in _dataring_pop() as well as
> +  * any reader task using smp_rmb() to post-validate data that
> +  * has been read from a data block.
> +
> +  * Memory barrier involvement:
> +  *
> +  * If dE reads from fE, then dI reads from fA->eA.
> +  * If dC reads from fG, then dI reads from fA->eA.
> +  * If dD reads from fH, then dI reads from fA->eA.
> +  * If mC reads from fH, then mF reads from fA->eA.
> +  *
> +  * Relies on:
> +  *
> +  * FULL MB between fA->eA and fE
> +  *matching
> +  * RMB between dE and dI
> +  *
> +  * FULL MB between fA->eA and fG
> +  *matching
> +  * RMB between dC and dI
> +  *
> +  * FULL MB between fA->eA and fH
> +  *matching
> +  * RMB between dD and dI
> +  *
> +  * FULL MB between fA->eA and fH
> +  *matching
> +  * RMB between mC and mF
> +  */
> + smp_mb();

All these comments talk about sychronization against read barriers.
It means that we would need a write barrier here. But it does
not make much sense to do write barrier before actually
writing dr->head_lpos.

After all I think that we do not need any barrier here.
The write barrier for dr->tail_lpos should be in
_dataring_pop(). The read barrier is not needed because
we are not reading anything here.

Instead we should put a barrier after modyfying dr->head_lpos,
see below.

> + if (!ret) {
> + /*
> +  * Force @desc permanently invalid to minimize risk
> +  * of the descriptor later unexpectedly being
> +  * determined as valid due to overflowing/wrapping of
> +  * @head_lpos. An unaligned @begin_lpos can never
> +  * point to a data block and having the same value
> +  * for @begin_lpos and @next_lpos is also invalid.
> +  */
> +
> + /* fC: */
> + WRITE_ONCE(desc->begin_lpos, 1);
> +
> + /* fD: */
> + WRITE_ONCE(desc->next_lpos, 1);
> +
> + return NULL;
> + }
> + /* fE: */
> + } while (atomic_long_cmpxchg_relaxed(>head_lpos, begin_lpos,
> +  next_lpos) != begin_lpos);
> +

We need a write barrier here to make sure that dr->head_lpos
is updated before we start updating other values, e.g.
db->id below.

Best Regards,
Petr

> + db = to_datablock(dr, begin_lpos);
> +
> + /*
> +  * fF:
> +  *
> +  * @db->id is a garbage value and could possibly match the @id. This
> +  * would be a problem 

Re: comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread Sergey Senozhatsky
On (08/20/19 10:55), Petr Mladek wrote:
[..]
> > +*
> > +* Memory barrier involvement:
> > +*
> > +* If dB reads from gA, then dC reads from fG.
> > +* If dB reads from gA, then dD reads from fH.
> > +* If dB reads from gA, then dE reads from fE.
> > +*
> > +* Note that if dB reads from gA, then dC cannot read from fC.
> > +* Note that if dB reads from gA, then dD cannot read from fD.
> > +*
> > +* Relies on:
> > +*
> > +* RELEASE from fG to gA
> > +*matching
> > +* ADDRESS DEP. from dB to dC
> > +*
> > +* RELEASE from fH to gA
> > +*matching
> > +* ADDRESS DEP. from dB to dD
> > +*
> > +* RELEASE from fE to gA
> > +*matching
> > +* ACQUIRE from dB to dE
> > +*/
> 
> But I am not sure how much this is useful. It would take ages to decrypt
> all these shortcuts (signs) and translate them into something
> human readable. Also it might get outdated easily.
> 
> That said, I haven't found yet if there was a system in all
> the shortcuts. I mean if they can be descrypted easily
> out of head. Also I am not familiar with the notation
> of the dependencies.

Does not appear to be systematic to me, but maybe I'm missing something
obvious. For chains like

jA->cD->hA to jB

I haven't found anything better than just git grep jA kernel/printk/
so far.

But once you'll grep for label cD, for instance, you'd see
that it's not defined. It's mentioned but not defined

kernel/printk/ringbuffer.c:  * jA->cD->hA.
kernel/printk/ringbuffer.c:  * RELEASE from jA->cD->hA to jB

I was thinking about renaming labels. E.g.

dataring_desc_init()
{
/* di1 */
WRITE_ONCE(desc->begin_lpos, 1);
/* di2 */
WRITE_ONCE(desc->next_lpos, 1);
}

Where di stands for descriptor init.

dataring_push()
{
/* dp1 */
ret = get_new_lpos(dr, size, _lpos, _lpos);
...
/* dp2 */
smp_mb();
...
}

Where dp stands for descriptor push. For dataring we can add a 'dr'
prefix, to avoid confusion with desc barriers, which have 'd' prefix.
And so on. Dunno.

-ss


comments style: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/dataring.c
> +/**
> + * _datablock_valid() - Check if given positions yield a valid data block.
> + *
> + * @dr: The associated data ringbuffer.
> + *
> + * @head_lpos:  The newest data logical position.
> + *
> + * @tail_lpos:  The oldest data logical position.
> + *
> + * @begin_lpos: The beginning logical position of the data block to check.
> + *
> + * @next_lpos:  The logical position of the next adjacent data block.
> + *  This value is used to identify the end of the data block.
> + *

Please remove the empty lines between arguments description. They make
the comments too scattered.

> + * A data block is considered valid if it satisfies the two conditions:
> + *
> + * * tail_lpos <= begin_lpos < next_lpos <= head_lpos
> + * * tail_lpos is at most exactly 1 wrap behind head_lpos
> + *
> + * Return: true if the specified data block is valid.
> + */

To be sure, the empty lines between paragraphs are useful.
The following is still well readable:

/**
 * _datablock_valid() - Check if given positions yield a valid data block.
 * @dr: The associated data ringbuffer.
 * @head_lpos:  The newest data logical position.
 * @tail_lpos:  The oldest data logical position.
 * @begin_lpos: The beginning logical position of the data block to check.
 * @next_lpos:  The logical position of the next adjacent data block.
 *  This value is used to identify the end of the data block.

 * A data block is considered valid if it satisfies the two conditions:
 *
 * * tail_lpos <= begin_lpos < next_lpos <= head_lpos
 * * tail_lpos is at most exactly 1 wrap behind head_lpos
 *
 * Return: true if the specified data block is valid.
 */

> +static unsigned long _dataring_pop(struct dataring *dr,
> +unsigned long tail_lpos)
> +{
> + unsigned long new_tail_lpos;
> + unsigned long begin_lpos;
> + unsigned long next_lpos;
> + struct dr_datablock *db;
> + struct dr_desc *desc;
> +
> + /*
> +  * dA:
> +  *
> +  * @db has an address dependency on @tail_pos. Therefore @tail_lpos
> +  * must be loaded before dB, which accesses @db.
> +  */
> + db = to_datablock(dr, tail_lpos);
> +
> + /*
> +  * dB:
> +  *
> +  * When a writer has completed accessing its data block, it sets the
> +  * @id thus making the data block available for invalidation. This
> +  * _acquire() ensures that this task sees all data ringbuffer and
> +  * descriptor values seen by the writer as @id was set. This is
> +  * necessary to ensure that the data block can be correctly identified
> +  * as valid (i.e. @begin_lpos, @next_lpos, @head_lpos are at least the
> +  * values seen by that writer, which yielded a valid data block at
> +  * that time). It is not enough to rely on the address dependency of
> +  * @desc to @id because @head_lpos is not depedent on @id. This pairs
> +  * with the _release() in dataring_datablock_setid().

This human readable description is really useful.

> +  *
> +  * Memory barrier involvement:
> +  *
> +  * If dB reads from gA, then dC reads from fG.
> +  * If dB reads from gA, then dD reads from fH.
> +  * If dB reads from gA, then dE reads from fE.
> +  *
> +  * Note that if dB reads from gA, then dC cannot read from fC.
> +  * Note that if dB reads from gA, then dD cannot read from fD.
> +  *
> +  * Relies on:
> +  *
> +  * RELEASE from fG to gA
> +  *matching
> +  * ADDRESS DEP. from dB to dC
> +  *
> +  * RELEASE from fH to gA
> +  *matching
> +  * ADDRESS DEP. from dB to dD
> +  *
> +  * RELEASE from fE to gA
> +  *matching
> +  * ACQUIRE from dB to dE
> +  */

But I am not sure how much this is useful. It would take ages to decrypt
all these shortcuts (signs) and translate them into something
human readable. Also it might get outdated easily.

That said, I haven't found yet if there was a system in all
the shortcuts. I mean if they can be descrypted easily
out of head. Also I am not familiar with the notation
of the dependencies.

If this is really needed then I am really scared of some barriers
that guard too many things. This one is a good example.

> + desc = dr->getdesc(smp_load_acquire(>id), dr->getdesc_arg);
> +
> + /* dD: */

It would be great if all these shortcuts (signs) are followed with
something human readable. Few words might be enough.

> + next_lpos = READ_ONCE(desc->next_lpos);
> +
> + if (!_datablock_valid(dr,
> +   /* dE: */
> +   atomic_long_read(>head_lpos),
> +   tail_lpos, begin_lpos, next_lpos)) {
> + /* Another task has already invalidated the data block. */
> + goto out;
> + }
> +
> +
> +++ b/kernel/printk/numlist.c
> +bool 

assign_desc() barriers: Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/ringbuffer.c
> +/**
> + * assign_desc() - Assign a descriptor to the caller.
> + *
> + * @e: The entry structure to store the assigned descriptor to.
> + *
> + * Find an available descriptor to assign to the caller. First it is checked
> + * if the tail descriptor from the committed list can be recycled. If not,
> + * perhaps a never-used descriptor is available. Otherwise, data blocks will
> + * be invalidated until the tail descriptor from the committed list can be
> + * recycled.
> + *
> + * Assigned descriptors are invalid until data has been reserved for them.
> + *
> + * Return: true if a descriptor was assigned, otherwise false.
> + *
> + * This will only fail if it was not possible to invalidate data blocks in
> + * order to recycle a descriptor. This can happen if a writer has reserved 
> but
> + * not yet committed data and that reserved data is currently the oldest 
> data.
> + */
> +static bool assign_desc(struct prb_reserved_entry *e)
> +{
> + struct printk_ringbuffer *rb = e->rb;
> + struct prb_desc *d;
> + struct nl_node *n;
> + unsigned long i;
> +
> + for (;;) {
> + /*
> +  * jA:
> +  *
> +  * Try to recycle a descriptor on the committed list.
> +  */
> + n = numlist_pop(>nl);
> + if (n) {
> + d = container_of(n, struct prb_desc, list);
> + break;
> + }
> +
> + /* Fallback to static never-used descriptors. */
> + if (atomic_read(>desc_next_unused) < DESCS_COUNT(rb)) {
> + i = atomic_fetch_inc(>desc_next_unused);
> + if (i < DESCS_COUNT(rb)) {
> + d = >descs[i];
> + atomic_long_set(>id, i);
> + break;
> + }
> + }
> +
> + /*
> +  * No descriptor available. Make one available for recycling
> +  * by invalidating data (which some descriptor will be
> +  * referencing).
> +  */
> + if (!dataring_pop(>dr))
> + return false;
> + }
> +
> + /*
> +  * jB:
> +  *
> +  * Modify the descriptor ID so that users of the descriptor see that
> +  * it has been recycled. A _release() is used so that prb_getdesc()
> +  * callers can see all data ringbuffer updates after issuing a
> +  * pairing smb_rmb(). See iA for details.
> +  *
> +  * Memory barrier involvement:
> +  *
> +  * If dB->iA reads from jB, then dI reads the same value as
> +  * jA->cD->hA.
> +  *
> +  * Relies on:
> +  *
> +  * RELEASE from jA->cD->hA to jB
> +  *matching
> +  * RMB between dB->iA and dI
> +  */
> + atomic_long_set_release(>id, atomic_long_read(>id) +
> + DESCS_COUNT(rb));

atomic_long_set_release() might be a bit confusing here.
There is no related acquire.

In fact, d->id manipulation has barriers from both sides:

  + smp_rmb() before so that all reads are finished before
the id is updated (release)

  + smp_wmb() after so that the new ID is written before other
related values are modified (acquire).

The smp_wmb() barrier is in prb_reserve(). I would move it here.

Best Regards,
Petr

> +
> + e->desc = d;
> + return true;
> +}
> +
> +/**
> + * prb_reserve() - Reserve data in the ringbuffer.
> + *
> + * @e:The entry structure to setup.
> + *
> + * @rb:   The ringbuffer to reserve data in.
> + *
> + * @size: The size of the data to reserve.
> + *
> + * This is the public function available to writers to reserve data.
> + *
> + * Context: Any context. Disables local interrupts on success.
> + * Return: A pointer to the reserved data or an ERR_PTR if data could not be
> + * reserved.
> + *
> + * If the provided size is legal, this will only fail if it was not possible
> + * to invalidate the oldest data block. This can happen if a writer has
> + * reserved but not yet committed data and that reserved data is currently
> + * the oldest data.
> + *
> + * The ERR_PTR values and their meaning:
> + *
> + * * -EINVAL: illegal @size value
> + * * -EBUSY:  failed to reserve a descriptor (@fail count incremented)
> + * * -ENOMEM: failed to reserve data (invalid descriptor committed)
> + */
> +char *prb_reserve(struct prb_reserved_entry *e, struct printk_ringbuffer *rb,
> +   unsigned int size)
> +{
> + struct prb_desc *d;
> + unsigned long id;
> + char *buf;
> +
> + if (!dataring_checksize(>dr, size))
> + return ERR_PTR(-EINVAL);
> +
> + e->rb = rb;
> +
> + /*
> +  * Disable interrupts during the reserve/commit window in order to
> +  * minimize the number of reserved but not yet committed data blocks
> +  * in the data ringbuffer. 

numlist_pop(): Re: [RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-20 Thread Petr Mladek
On Thu 2019-08-08 00:32:26, John Ogness wrote:
> --- /dev/null
> +++ b/kernel/printk/numlist.c
> +/**
> + * numlist_pop() - Remove the oldest node from the list.
> + *
> + * @nl: The numbered list from which to remove the tail node.
> + *
> + * The tail node can only be removed if two conditions are satisfied:
> + *
> + * * The node is not the only node on the list.
> + * * The node is not busy.
> + *
> + * If, during this function, another task removes the tail, this function
> + * will try again with the new tail.
> + *
> + * Return: The removed node or NULL if the tail node cannot be removed.
> + */
> +struct nl_node *numlist_pop(struct numlist *nl)
> +{
> + unsigned long tail_id;
> + unsigned long next_id;
> + unsigned long r;
> +
> + /* cA: #1 */
> + tail_id = atomic_long_read(>tail_id);
> +
> + for (;;) {
> + /* cB */
> + while (!numlist_read(nl, tail_id, NULL, _id)) {
> + /*
> +  * @tail_id is invalid. Try again with an
> +  * updated value.
> +  */
> +
> + cpu_relax();
> +
> + /* cA: #2 */
> + tail_id = atomic_long_read(>tail_id);
> + }

The above while-cycle basically does the same as the upper for-cycle.
It tries again with freshly loaded nl->tail_id. The following code
looks easier to follow:

do {
tail_id = atomic_long_read(>tail_id);

/*
 * Read might fail when the tail node has been removed
 * and reused in parallel.
 */
if (!numlist_read(nl, tail_id, NULL, _id))
continue;

/* Make sure the node is not the only node on the list. */
if (next_id == tail_id)
return NULL;

/* cC: Make sure the node is not busy. */
if (nl->busy(tail_id, nl->busy_arg))
return NULL;

while (atomic_long_cmpxchg_relaxed(>tail_id, tail_id, next_id) !=
tail_id);

/* This should never fail. The node is ours. */
return nl->node(tail_id, nl->node_arg);


> + /* Make sure the node is not the only node on the list. */
> + if (next_id == tail_id)
> + return NULL;
> +
> + /*
> +  * cC:
> +  *
> +  * Make sure the node is not busy.
> +  */
> + if (nl->busy(tail_id, nl->busy_arg))
> + return NULL;
> +
> + r = atomic_long_cmpxchg_relaxed(>tail_id,
> + tail_id, next_id);
> + if (r == tail_id)
> + break;
> +
> + /* cA: #3 */
> + tail_id = r;
> + }
> +
> + return nl->node(tail_id, nl->node_arg);

If I get it correctly, the above nl->node() call should never fail.
The node has been removed from the list and nobody else could
touch it. It is pretty useful information and it might be worth
mention it in a comment.

Best Regards,
Petr

PS: I am scratching my head around the patchset. I'll try Peter's
approach and comment independent things is separate mails.


[RFC PATCH v4 1/9] printk-rb: add a new printk ringbuffer implementation

2019-08-07 Thread John Ogness
See documentation for details.

For the real patch the "prb overview" documentation section in
kernel/printk/ringbuffer.c will be included in the commit message.

Signed-off-by: John Ogness 
---
 kernel/printk/Makefile |   3 +
 kernel/printk/dataring.c   | 761 +++
 kernel/printk/dataring.h   |  95 +
 kernel/printk/numlist.c| 375 +
 kernel/printk/numlist.h|  72 
 kernel/printk/ringbuffer.c | 800 +
 kernel/printk/ringbuffer.h | 288 +
 7 files changed, 2394 insertions(+)
 create mode 100644 kernel/printk/dataring.c
 create mode 100644 kernel/printk/dataring.h
 create mode 100644 kernel/printk/numlist.c
 create mode 100644 kernel/printk/numlist.h
 create mode 100644 kernel/printk/ringbuffer.c
 create mode 100644 kernel/printk/ringbuffer.h

diff --git a/kernel/printk/Makefile b/kernel/printk/Makefile
index 4d052fc6bcde..567999aa93af 100644
--- a/kernel/printk/Makefile
+++ b/kernel/printk/Makefile
@@ -1,4 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-y  = printk.o
 obj-$(CONFIG_PRINTK)   += printk_safe.o
+obj-$(CONFIG_PRINTK)   += ringbuffer.o
+obj-$(CONFIG_PRINTK)   += numlist.o
+obj-$(CONFIG_PRINTK)   += dataring.o
 obj-$(CONFIG_A11Y_BRAILLE_CONSOLE) += braille.o
diff --git a/kernel/printk/dataring.c b/kernel/printk/dataring.c
new file mode 100644
index ..911bac593ec1
--- /dev/null
+++ b/kernel/printk/dataring.c
@@ -0,0 +1,761 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "dataring.h"
+
+/**
+ * DOC: dataring overview
+ *
+ * A dataring is a lockless ringbuffer consisting of variable length data
+ * blocks, each of which are assigned an ID. The IDs map to descriptors, which
+ * contain metadata about the data block. The lookup function mapping IDs to
+ * descriptors is implemented by the user.
+ *
+ * Data Blocks
+ * ---
+ * All ringbuffer data is stored within a single static byte array. This is
+ * to ensure that any pointers to the data (past and present) will always
+ * point to valid memory. This is important because the lockless readers
+ * and writers may preempt for long periods of time and when they resume may
+ * be working with expired pointers.
+ *
+ * Data blocks are specified by begin and end logical positions (lpos) that
+ * map directly to byte array offsets. Using logical positions indirectly
+ * provides tagged state references for the data blocks to avoid ABA issues
+ * when the ringbuffer wraps. The number of tagged states per index is::
+ *
+ * sizeof(long) / size of byte array
+ *
+ * If a data block starts near the end of the byte array but would extend
+ * beyond it, that data block is handled differently: a special "wrapping data
+ * block" is inserted in the space available at the end of the byte array and
+ * a "content data block" is placed at the beginning of the byte array. This
+ * can waste space at the end of the byte array, but simplifies the
+ * implementation by allowing writers to always work with contiguous buffers.
+ * For example, for a 1000 byte array, a descriptor may show a start lpos of
+ * 1950 and an end lpos of 2100. The data block associated with this
+ * descriptor is 100 bytes in size. Its ID is located in the "wrapping" data
+ * block (located at offset 950 of the byte array) and its data is found in
+ * the "content" data block (located at offset 0 of the byte array).
+ *
+ * Descriptors
+ * ---
+ * A descriptor is a handle to a data block. How descriptors are structured
+ * and mapped to IDs is implemented by the user.
+ *
+ * Descriptors contain the begin (begin_lpos) and end (next_lpos) logical
+ * positions of the data block they represent. The end logical position
+ * matches the begin logical position of the adjacent data block.
+ *
+ * Why Descriptors?
+ * 
+ * The data ringbuffer supports variable length entities, which means that
+ * data blocks will not always begin at a predictable offset of the byte
+ * array. This is a major problem for lockless writers that, for example, will
+ * compete to expire and reuse old data blocks when the ringbuffer is full.
+ * Without a predictable begin for the data blocks, a writer has no reliable
+ * information about the status of the "free" area. Are any flags or state
+ * variables already set or is it just garbage left over from previous usage?
+ *
+ * Descriptors allow safe and controlled access to data block metadata by
+ * providing predictable offsets for such metadata. This is key to supporting
+ * multiple concurrent lockless writers.
+ *
+ * Behavior
+ * 
+ * The data ringbuffer allows writers to commit data without regard for
+ * readers. Readers must pre- and post-validate the data blocks they are
+ * processing to be sure the processed data is consistent. A function
+ * dataring_datablock_isvalid() is available for that. Readers can only
+ * iterate data blocks by utilizing an external implementation using
+ *