date:20240313

Re: [External] Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

2024-03-13 Thread Ho-Ren (Jack) Chuang

On Tue, Mar 12, 2024 at 2:21 AM Huang, Ying  wrote:
>
> "Ho-Ren (Jack) Chuang"  writes:
>
> > The current implementation treats emulated memory devices, such as
> > CXL1.1 type3 memory, as normal DRAM when they are emulated as normal memory
> > (E820_TYPE_RAM). However, these emulated devices have different
> > characteristics than traditional DRAM, making it important to
> > distinguish them. Thus, we modify the tiered memory initialization process
> > to introduce a delay specifically for CPUless NUMA nodes. This delay
> > ensures that the memory tier initialization for these nodes is deferred
> > until HMAT information is obtained during the boot process. Finally,
> > demotion tables are recalculated at the end.
> >
> > * Abstract common functions into `find_alloc_memory_type()`
>
> We should move kmem_put_memory_types() (renamed to
> mt_put_memory_types()?) too.  This can be put in a separate patch.
>

Will do! Thanks,


>
> > Since different memory devices require finding or allocating a memory type,
> > these common steps are abstracted into a single function,
> > `find_alloc_memory_type()`, enhancing code scalability and conciseness.
> >
> > * Handle cases where there is no HMAT when creating memory tiers
> > There is a scenario where a CPUless node does not provide HMAT information.
> > If no HMAT is specified, it falls back to using the default DRAM tier.
> >
> > * Change adist calculation code to use another new lock, mt_perf_lock.
> > In the current implementation, iterating through CPUlist nodes requires
> > holding the `memory_tier_lock`. However, `mt_calc_adistance()` will end up
> > trying to acquire the same lock, leading to a potential deadlock.
> > Therefore, we propose introducing a standalone `mt_perf_lock` to protect
> > `default_dram_perf`. This approach not only avoids deadlock but also
> > prevents holding a large lock simultaneously.
> >
> > Signed-off-by: Ho-Ren (Jack) Chuang 
> > Signed-off-by: Hao Xiang 
> > ---
> >  drivers/acpi/numa/hmat.c | 11 ++
> >  drivers/dax/kmem.c   | 13 +--
> >  include/linux/acpi.h |  6 
> >  include/linux/memory-tiers.h |  8 +
> >  mm/memory-tiers.c| 70 +---
> >  5 files changed, 92 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > index d6b85f0f6082..28812ec2c793 100644
> > --- a/drivers/acpi/numa/hmat.c
> > +++ b/drivers/acpi/numa/hmat.c
> > @@ -38,6 +38,8 @@ static LIST_HEAD(targets);
> >  static LIST_HEAD(initiators);
> >  static LIST_HEAD(localities);
> >
> > +static LIST_HEAD(hmat_memory_types);
> > +
>
> HMAT isn't a device driver for some memory devices.  So I don't think we
> should manage memory types in HMAT.

I can put it back in memory-tier.c. How about the list? Do we still
need to keep a separate list for storing late_inited memory nodes?
And how about the list name if we need to remove the prefix "hmat_"?


> Instead, if the memory_type of a
> node isn't set by the driver, we should manage it in memory-tier.c as
> fallback.
>

Do you mean some device drivers may init memory tiers between
memory_tier_init() and late_initcall(memory_tier_late_init);?
And this is the reason why you mention to exclude
"node_memory_types[nid].memtype != NULL" in memory_tier_late_init().
Is my understanding correct?

> >  static DEFINE_MUTEX(target_lock);
> >
> >  /*
> > @@ -149,6 +151,12 @@ int acpi_get_genport_coordinates(u32 uid,
> >  }
> >  EXPORT_SYMBOL_NS_GPL(acpi_get_genport_coordinates, CXL);
> >
> > +struct memory_dev_type *hmat_find_alloc_memory_type(int adist)
> > +{
> > + return find_alloc_memory_type(adist, &hmat_memory_types);
> > +}
> > +EXPORT_SYMBOL_GPL(hmat_find_alloc_memory_type);
> > +
> >  static __init void alloc_memory_initiator(unsigned int cpu_pxm)
> >  {
> >   struct memory_initiator *initiator;
> > @@ -1038,6 +1046,9 @@ static __init int hmat_init(void)
> >   if (!hmat_set_default_dram_perf())
> >   register_mt_adistance_algorithm(&hmat_adist_nb);
> >
> > + /* Post-create CPUless memory tiers after getting HMAT info */
> > + memory_tier_late_init();
> > +
>
> This should be called in memory-tier.c via
>
> late_initcall(memory_tier_late_init);
>
> Then, we don't need hmat to call it.
>

Thanks. Learned!


> >   return 0;
> >  out_put:
> >   hmat_free_structures();
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index 42ee360cf4e3..aee17ab59f4f 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -55,21 +55,10 @@ static LIST_HEAD(kmem_memory_types);
> >
> >  static struct memory_dev_type *kmem_find_alloc_memory_type(int adist)
> >  {
> > - bool found = false;
> >   struct memory_dev_type *mtype;
> >
> >   mutex_lock(&kmem_memory_type_lock);
> > - list_for_each_entry(mtype, &kmem_memory_types, list) {
> > - if (mtype->adistance == adist) {
> > - found = true;
> > -

Re: [RFC PATCH v3 7/7] virtio_rtc: Add RTC class driver

2024-03-13 Thread Peter Hilber

On 11.03.24 20:46, Alexandre Belloni wrote:
> On 11/03/2024 19:28:50+0100, Peter Hilber wrote:
 Perhaps this might be handled by the driver also setting a virtio-rtc
 monotonic alarm (which uses a clock similar to CLOCK_BOOTTIME_ALARM).
 The
 virtio-rtc monotonic alarm would just be used to wake up in case it was
 a
 CLOCK_BOOTTIME_ALARM alarm.

 Otherwise, the behavior should not differ from other RTC class drivers.

>>>
>>> What I don't quite get is how this is actually related to RTCs. This
>>> would be a super imprecise mechanism to get the current time and date
>>> from the host to the guest which is what I think your are trying to do,
>>> especially since this is not supporting UIE.
>>> The host system clock may come from reading the RTC at some point in
>>> time but more likely from another source so is it really the best
>>> synchronization mechanism?
>>
>> Hello,
>>
>> thank you for your comments.
>>
>> The main motivation to have the RTC class driver is the RTC alarm
>> (discussed below).
>>
>> As for synchronization, virtio_rtc also offers a PTP clock [1] which will
>> be more precise, but which needs a user space daemon. As for RTC-based
>> initial synchronization, my idea was to propose, in a second step, an
>> optional op for rtc_class_ops, which would read the clock with nanosecond
>> precision. This optional op could then be used in rtc_hctosys(), so there
>> would be no need for UIE waiting.
> 
> This would be a clear NAK, rtc_hctosys should use UIE to have proper
> synchronisation. It currently does a very bad job reading the RTC and it
> is a pity it has been mandated by systemd as useerspace is definitively
> better placed to set the system time. I'm still very tempted delaying
> everyone's boot by one second and make rtc_hctosys precise for all the
> supported HW and not just a single driver.
> 

OK. I plan to add a PPS feature to virtio_rtc so that it can support UIE.
AFAIU this is not required for the initial driver version.

Thanks for the comments,

Peter

[...]

Re: [PATCH bpf-next 0/3] uprobes: two common case speed ups

2024-03-13 Thread Jiri Olsa

On Tue, Mar 12, 2024 at 02:02:30PM -0700, Andrii Nakryiko wrote:
> This patch set implements two speed ups for uprobe/uretprobe runtime execution
> path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> patches for details.
> 
> Given I haven't worked with uprobe code before, I'm unfamiliar with
> conventions in this subsystem, including which kernel tree patches should be
> sent to. For now I based all the changes on top of bpf-next/master, which is
> where I tested and benchmarked everything anyways. Please advise what should
> I use as a base for subsequent revision. Thanks.
> 
> Andrii Nakryiko (3):
>   uprobes: encapsulate preparation of uprobe args buffer
>   uprobes: prepare uprobe args buffer lazily
>   uprobes: add speculative lockless system-wide uprobe filter check

nice cleanup and speed up, lgtm

Reviewed-by: Jiri Olsa 

jirka

> 
>  kernel/trace/trace_uprobe.c | 103 ++--
>  1 file changed, 63 insertions(+), 40 deletions(-)
> 
> -- 
> 2.43.0
> 
>

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Peter Hilber

On 12.03.24 18:15, David Woodhouse wrote:
> On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
>> On 08.03.24 13:33, David Woodhouse wrote:
>>> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
 On 07.03.24 15:02, David Woodhouse wrote:
> Hm, should we allow UTC? If you tell me the time in UTC, then
> (sometimes) I still don't actually know what the time is, because some
> UTC seconds occur twice. UTC only makes sense if you provide the TAI
> offset, surely? Should the virtio_rtc specification make it mandatory
> to provide such?
>
> Otherwise you're just designing it to allow crappy hypervisors to
> expose incomplete information.
>

 Hi David,

 (adding virtio-comm...@lists.oasis-open.org for spec discussion),

 thank you for your insightful comments. I think I take a broadly similar
 view. The reason why the current spec and driver is like this is that I
 took a pragmatic approach at first and only included features which work
 out-of-the-box for the current Linux ecosystem.

 The current virtio_rtc features work similar to ptp_kvm, and therefore
 can work out-of-the-box with time sync daemons such as chrony.

 As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
 as well, I am afraid that

 - in some (embedded) scenarios, the TAI clock may not be available

 - crappy hypervisors will pass off the UTC clock as the TAI clock.

 For the same reasons, I am also not sure about adding a *mandatory* TAI
 offset to each readout. I don't know user-space software which would
 leverage this already (at least not through the PTP clock interface).
 And why would such software not go straight for the TAI clock instead?

 How about adding a requirement to the spec that the virtio-rtc device
 SHOULD expose the TAI clock whenever it is available - would this
 address your concerns?
>>>
>>> I think that would be too easy for implementors to miss, or decide not
>>> to obey. Or to get *wrong*, by exposing a TAI clock but actually
>>> putting UTC in it.
>>>
>>> I think I prefer to mandate the tai_offset field with the UTC clock.
>>> Crappy implementations will just set it to zero, but at least that
>>> gives a clear signal to the guests that it's *their* problem to
>>> resolve.
>>
>> To me there are some open questions regarding how this would work. Is there
>> a use case for this with the v3 clock reading methods, or would it be
>> enough to address this with the Virtio timekeeper?
>>
>> Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
>> best alongside some additional information about leap seconds. I am not
>> aware about any user-space user. In addition, leap second smearing should
>> also be addressed.
>>
> 
> Is there even a standard yet for leap-smearing? Will it be linear over
> 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> think is what Google does? Meta does something different again, don't
> they?
> 
> Exposing UTC as the only clock reference is bad enough; when leap
> seconds happen there's a whole second during which you don't *know*
> which second it is. It seems odd to me, for a precision clock to be
> deliberately ambiguous about what the time is!

Just to be clear, the device can perfectly expose only a TAI reference
clock (or both UTC and TAI), the spec is just completely open about this,
as it tries to work for diverse use cases.

> 
> But if the virtio-rtc clock is defined as UTC and then expose something
> *different* in it, that's even worse. You potentially end up providing
> inaccurate time for a whole *day* leading up to the leap second.
> 
> I think you're right that leap second smearing should be addressed. At
> the very least, by making it clear that the virtio-rtc clock which
> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> yet-to-be-defined variant.
> 

Agreed.

> Please make it explicit that any hypervisor which wants to advertise a
> smeared clock shall define a new type which specifies the precise
> smearing algorithm and cannot be conflated with the one you're defining
> here.
> 

I will add a requirement that the UTC clock can never have smeared/smoothed
leap seconds.

I think that not every vendor would bother to first add a definition of a
smearing algorithm. Also, I think in some cases knowing the precise
smearing algorithm might not be important (when having the same time as the
hypervisor is enough and accuracy w.r.t. actual time is less important).

So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
now could catch every UTC-like clock which smears/smoothes leap seconds,
where the vendor cannot be bothered to add the smearing algorithm to spec
and implementations.

As for UTC-SLS, this *could* also be added, although [1] says

It is inappropriate to use Internet-Drafts as reference material or

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Alexandre Belloni

On 13/03/2024 10:45:54+0100, Peter Hilber wrote:
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
> 
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for diverse use cases.
> 
> > 
> > But if the virtio-rtc clock is defined as UTC and then expose something
> > *different* in it, that's even worse. You potentially end up providing
> > inaccurate time for a whole *day* leading up to the leap second.
> > 
> > I think you're right that leap second smearing should be addressed. At
> > the very least, by making it clear that the virtio-rtc clock which
> > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> > yet-to-be-defined variant.
> > 
> 
> Agreed.
> 
> > Please make it explicit that any hypervisor which wants to advertise a
> > smeared clock shall define a new type which specifies the precise
> > smearing algorithm and cannot be conflated with the one you're defining
> > here.
> > 
> 
> I will add a requirement that the UTC clock can never have smeared/smoothed
> leap seconds.
> 
> I think that not every vendor would bother to first add a definition of a
> smearing algorithm. Also, I think in some cases knowing the precise
> smearing algorithm might not be important (when having the same time as the
> hypervisor is enough and accuracy w.r.t. actual time is less important).
> 
> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
> now could catch every UTC-like clock which smears/smoothes leap seconds,
> where the vendor cannot be bothered to add the smearing algorithm to spec
> and implementations.
> 

I still don't know anything about virtio but under Linux, an RTC is
always UTC (or localtime when dual booting but let's not care) and never
accounts for leap seconds. Having an RTC and RTC driver behaving
differently would be super inconvenient. Why don't you leave this to
userspace?

I guess I'm still questioning whether this is the correct interface to
expose the host system time instead of an actual RTC.

-- 
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

2024-03-13 Thread Björn Töpel

Mark Rutland  writes:

> Hi Bjorn
>
> (apologies, my corporate mail server has butchered your name here).

Ha! That's the price I have to pay for carrying double-umlauts
everywhere. Thanks for getting back with a really useful answer!

>> On Arm64, CALL_OPS makes it possible to implement direct calls, while
>> only patching one BL instruction -- nice!
>
> The key thing here isn't that we patch a single instruction (since we have ot
> patch the ops pointer too!); it's that we can safely patch either of the ops
> pointer or BL/NOP at any time while threads are concurrently executing.

...which indeed is a very nice property!

> If you have a multi-instruction sequence, then threads can be preempted
> mid-sequence, and it's very painful/complex to handle all of the races that
> entails.

Oh yes. RISC-V is currently using auipc/jalr with stop_machine(), and
also requires that preemtion is off. Unusable to put it blunt.

> For example, if your callsites use a sequence:
>
>   AUIPC , 
>   JALR , ()
>
> Using stop_machine() won't allow you to patch that safely as some threads
> could be stuck mid-sequence, e.g.
>
>   AUIPC , 
>   [ preempted here ]
>   JALR , ()
>
> ... and you can't update the JALR to use a new funcptr immediate until those
> have completed the sequence.
>
> There are ways around that, but they're complicated and/or expensive, e.g.
>
> * Use a sequence of multiple patches, starting with replacing the JALR with an
>   exception-generating instruction with a fixup handler, which is sort-of what
>   x86 does with UD2. This may require multiple passes with
>   synchronize_rcu_tasks() to make sure all threads have seen the latest
>   instructions, and that cannot be done under stop_machine(), so if you need
>   stop_machine() for CMODx reasons, you may need to use that several times 
> with
>   intervening calls to synchronize_rcu_tasks().
>
> * Have the patching logic manually go over each thread and fix up the pt_regs
>   for the interrupted thread. This is pretty horrid since you could have 
> nested
>   exceptions and a task could have several pt_regs which might require
>   updating.

Yup, and both of these have rather unplesant overhead.

> The CALL_OPS approach is a bit easier to deal with as we can patch the
> per-callsite pointer atomically, then we can (possibly) enable/disable the
> callsite's branch, then wait for threads to drain once. 
>
> As a heads-up, there are some latent/generic issues with DYNAMIC_FTRACE
> generally in this area (CALL_OPs happens to side-step those, but trampoline
> usage is currently affected):
>
>   https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/
>
> ... I'm looking into fixing that at the moment, and it looks like that's 
> likely
> to require some per-architecture changes.
>
>> On RISC-V we cannot use use the same ideas as Arm64 straight off,
>> because the range of jal (compare to BL) is simply too short (+/-1M).
>> So, on RISC-V we need to use a full auipc/jal pair (the text patching
>> story is another chapter, but let's leave that aside for now). Since we
>> have to patch multiple instructions, the cmodx situation doesn't really
>> improve with CALL_OPS.
>
> The branch range thing is annoying, but I think this boils down to the same
> problem as arm64 has with needing a "MOV , LR" instruction that we have 
> to
> patch in once at boot time. You could do the same and patch in the AUIPC once,
> e.g. have
>
> | NOP
> | NOP 
> | func:
> | AUIPC , 
> | JALR , () // patched with NOP
>
> ... which'd look very similar to arm64's sequence:
>
> | NOP
> | NOP
> | func:
> | MOV X9, LR
> | BL ftrace_caller // patched with NOP
>
> ... which I think means it *might* be better from a cmodx perspective?
>
>> Let's say that we continue building on your patch and implement direct
>> calls on CALL_OPS for RISC-V as well.
>> 
>> From Florent's commit message for direct calls:
>> 
>>   |There are a few cases to distinguish:
>>   |- If a direct call ops is the only one tracing a function:
>>   |  - If the direct called trampoline is within the reach of a BL
>>   |instruction
>>   | -> the ftrace patchsite jumps to the trampoline
>>   |  - Else
>>   | -> the ftrace patchsite jumps to the ftrace_caller trampoline 
>> which
>>   |reads the ops pointer in the patchsite and jumps to the direct
>>   |call address stored in the ops
>>   |- Else
>>   |  -> the ftrace patchsite jumps to the ftrace_caller trampoline and 
>> its
>>   | ops literal points to ftrace_list_ops so it iterates over all
>>   | registered ftrace ops, including the direct call ops and calls 
>> its
>>   | call_direct_funcs handler which stores the direct called
>>   | trampoline's address in the ftrace_regs and the ftrace_caller
>>   | trampoline will return to that address instead of returning to 
>> the
>>   | traced function
>> 
>>

Re: [PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages

2024-03-13 Thread Brian Foster

On Sat, Mar 09, 2024 at 12:14:23PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2/29/2024 11:01 PM, Brian Foster wrote:
> > On Wed, Feb 28, 2024 at 10:41:24PM +0800, Hou Tao wrote:
> >> From: Hou Tao 
> >>
> >> When reading a file kept in virtiofs from kernel (e.g., insmod a kernel
> >> module), if the cache of virtiofs is disabled, the read buffer will be
> >> passed to virtiofs through out_args[0].value instead of pages. Because
> >> virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new()
> >> will create a bounce buffer for the read buffer by using kmalloc() and
> >> copy the read buffer into bounce buffer. If the read buffer is large
> >> (e.g., 1MB), the allocation will incur significant stress on the memory
> >> subsystem.
> >>
> >> So instead of allocating bounce buffer by using kmalloc(), allocate a
> >> bounce buffer which is backed by scattered pages. The original idea is
> >> to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To
> >> simplify the copy operations in the bounce buffer, use a bio_vec flex
> >> array to represent the argbuf. Also add an is_flat field in struct
> >> virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce
> >> buffer.
> >>
> >> Signed-off-by: Hou Tao 
> >> ---
> >>  fs/fuse/virtio_fs.c | 163 
> >>  1 file changed, 149 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >> index f10fff7f23a0f..ffea684bd100d 100644
> >> --- a/fs/fuse/virtio_fs.c
> >> +++ b/fs/fuse/virtio_fs.c
> > ...
> >> @@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct 
> >> work_struct *work)
> >>}
> >>  }
> >>  
> > ...  
> >>  static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf 
> >> *argbuf,
> >>  unsigned int offset,
> >>  const void *src, unsigned int len)
> >>  {
> >> -  memcpy(argbuf->buf + offset, src, len);
> >> +  struct iov_iter iter;
> >> +  unsigned int copied;
> >> +
> >> +  if (argbuf->is_flat) {
> >> +  memcpy(argbuf->f.buf + offset, src, len);
> >> +  return;
> >> +  }
> >> +
> >> +  iov_iter_bvec(&iter, ITER_DEST, argbuf->s.bvec,
> >> +argbuf->s.nr, argbuf->s.size);
> >> +  iov_iter_advance(&iter, offset);
> > Hi Hou,
> >
> > Just a random comment, but it seems a little inefficient to reinit and
> > readvance the iter like this on every copy/call. It looks like offset is
> > already incremented in the callers of the argbuf copy helpers. Perhaps
> > iov_iter could be lifted into the callers and passed down, or even just
> > include it in the argbuf structure and init it at alloc time?
> 
> Sorry for the late reply. Being busy with off-site workshop these days.
> 

No worries.

> I have tried a similar idea before in which iov_iter was saved directly
> in argbuf struct, but it didn't work out well. The reason is that for
> copy both in_args and out_args, an iov_iter is needed, but the direction
> is different. Currently the bi-directional io_vec is not supported, so
> the code have to initialize the iov_iter twice: one for copy from
> in_args and another one for copy to out_args.
> 

Ok, seems reasonable enough.

> For dio read initiated from kernel, both of its in_numargs and
> out_numargs is 1, so there will be only one iov_iter_advance() in
> virtio_fs_argbuf_copy_to_out_arg() and the offset is 64, so I think the
> overhead will be fine. For dio write initiated from kernel, its
> in_numargs is 2 and out_numargs is 1, so there will be two invocations
> of iov_iter_advance(). The first one with offset=64, and the another one
> with offset=round_up_page_size(64 + write_size), so the later one may
> introduce extra overhead. But compared with the overhead of data copy, I
> still think the overhead of calling iov_iter_advance() is fine.
> 

I'm not claiming the overhead is some practical problem here, but rather
that we shouldn't need to be concerned with it in the first place with
some cleaner code. It's been a bit since I first looked at this. I was
originally wondering about defining the iov_iter in the caller and pass
as a param, but on another look, do the lowest level helpers really need
to exist?

If you don't anticipate further users, IMO something like the diff below
is a bit more clean (compile tested only, but no reinits and less code
overall). But that's just my .02, feel free to use it or not..

Brian

--- 8< ---

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 9ee71051c89f..9de477e9ccd5 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -544,26 +544,6 @@ static unsigned int virtio_fs_argbuf_setup_sg(struct 
virtio_fs_argbuf *argbuf,
return cur - sg;
 }
 
-static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf *argbuf,
- unsigned int offset,
- const void *sr

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
> 
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?

Well yes, we don't need to expose *anything* from the hypervisor and we
can leave it all to guest userspace. We can run NTP on every single one
of *hundreds* of guests, leaving them all to duplicate the work of
calibrating the *same* underlying oscillator.

I thought we were trying to avoid that, by having the hypervisor tell
them what the time was. If we're going to do that, we need it to be
sufficiently precise (and some clients want to *know* the precision),
and above all we need it to be *unambiguous*.

If the hypervisor says that the time is 3692217600.001, then the guest
doesn't actually know *which* 3692217600.001 it is, and thus it still
doesn't know the time to an accuracy better than 1 second.

And if we start allowing the hypervisor to smear clocks in some other
underspecified ways, then we end up with errors of up to 1 second in
the clock for long periods of time *around* the leap second.

We need to avoid that ambiguity.

> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.

If an RTC device is able to report '23:59:60' as the time of day, I
suppose that *could* resolve the ambiguity. But talking to a device is
slow; we want guests to be able to know the time — accurately — with a
simple counter/tsc read and some arithmetic. Which means *paired* reads
of 'RTC' and the counter, and a precise indication of the counter
frequency.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
> On 12.03.24 18:15, David Woodhouse wrote:
> > On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
> > > On 08.03.24 13:33, David Woodhouse wrote:
> > > > On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
> > > > > On 07.03.24 15:02, David Woodhouse wrote:
> > > > > > Hm, should we allow UTC? If you tell me the time in UTC, then
> > > > > > (sometimes) I still don't actually know what the time is, because 
> > > > > > some
> > > > > > UTC seconds occur twice. UTC only makes sense if you provide the TAI
> > > > > > offset, surely? Should the virtio_rtc specification make it 
> > > > > > mandatory
> > > > > > to provide such?
> > > > > > 
> > > > > > Otherwise you're just designing it to allow crappy hypervisors to
> > > > > > expose incomplete information.
> > > > > > 
> > > > > 
> > > > > Hi David,
> > > > > 
> > > > > (adding virtio-comm...@lists.oasis-open.org for spec discussion),
> > > > > 
> > > > > thank you for your insightful comments. I think I take a broadly 
> > > > > similar
> > > > > view. The reason why the current spec and driver is like this is that 
> > > > > I
> > > > > took a pragmatic approach at first and only included features which 
> > > > > work
> > > > > out-of-the-box for the current Linux ecosystem.
> > > > > 
> > > > > The current virtio_rtc features work similar to ptp_kvm, and therefore
> > > > > can work out-of-the-box with time sync daemons such as chrony.
> > > > > 
> > > > > As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
> > > > > as well, I am afraid that
> > > > > 
> > > > > - in some (embedded) scenarios, the TAI clock may not be available
> > > > > 
> > > > > - crappy hypervisors will pass off the UTC clock as the TAI clock.
> > > > > 
> > > > > For the same reasons, I am also not sure about adding a *mandatory* 
> > > > > TAI
> > > > > offset to each readout. I don't know user-space software which would
> > > > > leverage this already (at least not through the PTP clock interface).
> > > > > And why would such software not go straight for the TAI clock instead?
> > > > > 
> > > > > How about adding a requirement to the spec that the virtio-rtc device
> > > > > SHOULD expose the TAI clock whenever it is available - would this
> > > > > address your concerns?
> > > > 
> > > > I think that would be too easy for implementors to miss, or decide not
> > > > to obey. Or to get *wrong*, by exposing a TAI clock but actually
> > > > putting UTC in it.
> > > > 
> > > > I think I prefer to mandate the tai_offset field with the UTC clock.
> > > > Crappy implementations will just set it to zero, but at least that
> > > > gives a clear signal to the guests that it's *their* problem to
> > > > resolve.
> > > 
> > > To me there are some open questions regarding how this would work. Is 
> > > there
> > > a use case for this with the v3 clock reading methods, or would it be
> > > enough to address this with the Virtio timekeeper?
> > > 
> > > Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
> > > best alongside some additional information about leap seconds. I am not
> > > aware about any user-space user. In addition, leap second smearing should
> > > also be addressed.
> > > 
> > 
> > Is there even a standard yet for leap-smearing? Will it be linear over
> > 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
> > think is what Google does? Meta does something different again, don't
> > they?
> > 
> > Exposing UTC as the only clock reference is bad enough; when leap
> > seconds happen there's a whole second during which you don't *know*
> > which second it is. It seems odd to me, for a precision clock to be
> > deliberately ambiguous about what the time is!
> 
> Just to be clear, the device can perfectly expose only a TAI reference
> clock (or both UTC and TAI), the spec is just completely open about this,
> as it tries to work for diverse use cases.

As long as the guest *knows* what it's getting, sure.

> > 
> > But if the virtio-rtc clock is defined as UTC and then expose something
> > *different* in it, that's even worse. You potentially end up providing
> > inaccurate time for a whole *day* leading up to the leap second.
> > 
> > I think you're right that leap second smearing should be addressed. At
> > the very least, by making it clear that the virtio-rtc clock which
> > advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
> > yet-to-be-defined variant.
> > 
> 
> Agreed.
> 
> > Please make it explicit that any hypervisor which wants to advertise a
> > smeared clock shall define a new type which specifies the precise
> > smearing algorithm and cannot be conflated with the one you're defining
> > here.
> > 
> 
> I will add a requirement that the UTC clock can never have smeared/smoothed
> leap seconds.

Thanks.

> I think that not every vendor would bother to first add a definition of a
> smearing algorithm. Also, I think in some

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Alexandre Belloni

On 13/03/2024 12:29:38+, David Woodhouse wrote:
> On Wed, 2024-03-13 at 12:18 +0100, Alexandre Belloni wrote:
> > 
> > I still don't know anything about virtio but under Linux, an RTC is
> > always UTC (or localtime when dual booting but let's not care) and never
> > accounts for leap seconds. Having an RTC and RTC driver behaving
> > differently would be super inconvenient. Why don't you leave this to
> > userspace?
> 
> Well yes, we don't need to expose *anything* from the hypervisor and we
> can leave it all to guest userspace. We can run NTP on every single one
> of *hundreds* of guests, leaving them all to duplicate the work of
> calibrating the *same* underlying oscillator.
> 

Really, I see the point of sharing the time accurately between the host
and the guest and not duplicating the effort. This is not what I am
objecting to.

> I thought we were trying to avoid that, by having the hypervisor tell
> them what the time was. If we're going to do that, we need it to be
> sufficiently precise (and some clients want to *know* the precision),
> and above all we need it to be *unambiguous*.
> 
> If the hypervisor says that the time is 3692217600.001, then the guest
> doesn't actually know *which* 3692217600.001 it is, and thus it still
> doesn't know the time to an accuracy better than 1 second.
> 

The RTC subsystem has a 1 second resolution and this is not going to
change because there is no point doing so for the hardware, to get the
time precisely, UIE MUST be used there is no vendor that will just
support reading the time/date or timestamp as this is way too imprecise.

> And if we start allowing the hypervisor to smear clocks in some other
> underspecified ways, then we end up with errors of up to 1 second in
> the clock for long periods of time *around* the leap second.
> 
> We need to avoid that ambiguity.

Exactly my point and I said, reading an RTC is always UTC and never
handles leap seconds so if userspace doesn't handle the leap second and
updates the RTC, too bad. There are obviously no RTC that will smear
clock unless instructed to, so the hypervisor must not smear the clock.

Note that this is not an issue for the subsystem because if you are not
capable to track an accurate clock, the RTC itself will have drifted by
much more than a second in between leap second inclusions.

> 
> > I guess I'm still questioning whether this is the correct interface to
> > expose the host system time instead of an actual RTC.
> 
> If an RTC device is able to report '23:59:60' as the time of day, I
> suppose that *could* resolve the ambiguity. But talking to a device is
> slow; we want guests to be able to know the time — accurately — with a
> simple counter/tsc read and some arithmetic. Which means *paired* reads
> of 'RTC' and the counter, and a precise indication of the counter
> frequency.

23:59:60 is not and will never be allowed in the RTC subsystem as this
is an invalid value for the hardware.

The TSC or whatever CPU counter/clock that is used to keep the system
time is not an RTC, I don't get why it has to be exposed as such to the
guests. PTP is fine and precise, RTC is not.

-- 
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-13 Thread Oleg Nesterov

I forgot everything about this code, plus it has changed a lot since
I looked at it many years ago, but ...

I think this change is fine but the changelog looks a bit confusing
(overcomplicated) to me.

On 03/12, Andrii Nakryiko wrote:
>
> This patch adds a speculative check before grabbing that rwlock. If
> nr_systemwide is non-zero, lock is skipped and event is passed through.
> From examining existing logic it looks correct and safe to do. If
> nr_systemwide is being modified under rwlock in parallel, we have to
> consider basically just one important race condition: the case when
> nr_systemwide is dropped from one to zero (from
> trace_uprobe_filter_remove()) under filter->rwlock, but
> uprobe_perf_filter() raced and saw it as >0.

Unless I am totally confused, there is nothing new. Even without
this change trace_uprobe_filter_remove() can clear nr_systemwide
right after uprobe_perf_filter() drops filter->rwlock.

And of course, trace_uprobe_filter_add() can change nr_systemwide
from 0 to 1. In this case uprobe_perf_func() can "wrongly" return
UPROBE_HANDLER_REMOVE but we can't avoid this and afaics this is
fine even if handler_chain() does unapply_uprobe(), uprobe_perf_open()
will do uprobe_apply() after that, we can rely on ->register_rwsem.

> In case we speculatively read nr_systemwide as zero, while it was
> incremented in parallel, we'll proceed to grabbing filter->rwlock and
> re-doing the check, this time in lock-protected and non-racy way.

See above...

So I think uprobe_perf_filter() needs filter->rwlock only to iterate
the list, it can check nr_systemwide lockless and this means that you
can also remove the same check in __uprobe_perf_filter(), other callers
trace_uprobe_filter_add/remove check it themselves.

> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer 
> *uc,
>   tu = container_of(uc, struct trace_uprobe, consumer);
>   filter = tu->tp.event->filter;
>
> + /* speculative check */
> + if (READ_ONCE(filter->nr_systemwide))
> + return true;
> +
>   read_lock(&filter->rwlock);
>   ret = __uprobe_perf_filter(filter, mm);
>   read_unlock(&filter->rwlock);

ACK,

but see above. I think the changelog should be simplified and the
filter->nr_systemwide check in __uprobe_perf_filter() should be
removed. But I won't insist and perhaps I missed something...

Oleg.

[PATCH] net: hns3: tracing: fix hclgevf trace event strings

2024-03-13 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

[
   Note, I need to take this patch through my tree, so I'm looking for acks.
   This causes the build to fail when I add the __assign_str() check, which
   I was about to push to Linus, but it breaks allmodconfig due to this error.
]

The __string() and __assign_str() helper macros of the TRACE_EVENT() macro
are going through some optimizations where only the source string of
__string() will be used and the __assign_str() source will be ignored and
later removed.

To make sure that there's no issues, a new check is added between the
__string() src argument and the __assign_str() src argument that does a
strcmp() to make sure they are the same string.

The hclgevf trace events have:

  __assign_str(devname, &hdev->nic.kinfo.netdev->name);

Which triggers the warning:

hclgevf_trace.h:34:39: error: passing argument 1 of ‘strcmp’ from incompatible 
pointer type [-Werror=incompatible-pointer-types]
   34 | __assign_str(devname, &hdev->nic.kinfo.netdev->name);
 [..]
arch/x86/include/asm/string_64.h:75:24: note: expected ‘const char *’ but 
argument is of type ‘char (*)[16]’
   75 | int strcmp(const char *cs, const char *ct);
  |^~


Because __assign_str() now has:

WARN_ON_ONCE(__builtin_constant_p(src) ?\
 strcmp((src), __data_offsets.dst##_ptr_) : \
 (src) != __data_offsets.dst##_ptr_);   \

The problem is the '&' on hdev->nic.kinfo.netdev->name. That's because
that name is:

charname[IFNAMSIZ]

Where passing an address '&' of a char array is not compatible with strcmp().

The '&' is not necessary, remove it.

Fixes: d8355240cf8fb ("net: hns3: add trace event support for PF/VF mailbox")
Signed-off-by: Steven Rostedt (Google) 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h  | 8 
 .../net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h| 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
index 8510b88d4982..f3cd5a376eca 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_trace.h
@@ -24,7 +24,7 @@ TRACE_EVENT(hclge_pf_mbx_get,
__field(u8, code)
__field(u8, subcode)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, &hdev->vport[0].nic.kinfo.netdev->name)
+   __string(devname, hdev->vport[0].nic.kinfo.netdev->name)
__array(u32, mbx_data, PF_GET_MBX_LEN)
),
 
@@ -33,7 +33,7 @@ TRACE_EVENT(hclge_pf_mbx_get,
__entry->code = req->msg.code;
__entry->subcode = req->msg.subcode;
__assign_str(pciname, pci_name(hdev->pdev));
-   __assign_str(devname, &hdev->vport[0].nic.kinfo.netdev->name);
+   __assign_str(devname, hdev->vport[0].nic.kinfo.netdev->name);
memcpy(__entry->mbx_data, req,
   sizeof(struct hclge_mbx_vf_to_pf_cmd));
),
@@ -56,7 +56,7 @@ TRACE_EVENT(hclge_pf_mbx_send,
__field(u8, vfid)
__field(u16, code)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, &hdev->vport[0].nic.kinfo.netdev->name)
+   __string(devname, hdev->vport[0].nic.kinfo.netdev->name)
__array(u32, mbx_data, PF_SEND_MBX_LEN)
),
 
@@ -64,7 +64,7 @@ TRACE_EVENT(hclge_pf_mbx_send,
__entry->vfid = req->dest_vfid;
__entry->code = le16_to_cpu(req->msg.code);
__assign_str(pciname, pci_name(hdev->pdev));
-   __assign_str(devname, &hdev->vport[0].nic.kinfo.netdev->name);
+   __assign_str(devname, hdev->vport[0].nic.kinfo.netdev->name);
memcpy(__entry->mbx_data, req,
   sizeof(struct hclge_mbx_pf_to_vf_cmd));
),
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
index 5d4895bb57a1..b259e95dd53c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3vf/hclgevf_trace.h
@@ -23,7 +23,7 @@ TRACE_EVENT(hclge_vf_mbx_get,
__field(u8, vfid)
__field(u16, code)
__string(pciname, pci_name(hdev->pdev))
-   __string(devname, &hdev->nic.kinfo.netdev->name)
+   __string(devname, hdev->nic.kinfo.netdev->name)
__array(u32, mbx_data, VF_GET_MBX_LEN)
),
 
@@ -31,7 +31,7 @@ TRACE_EVENT(hclge_vf_mbx_get,
__entry->vfid = req->dest_vfid;
__entry->code = le16_to_cpu(req->msg.code);
__assign_str(pciname, pci_name(hdev->pdev));
-   __assig

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
> The TSC or whatever CPU counter/clock that is used to keep the system
> time is not an RTC, I don't get why it has to be exposed as such to the
> guests. PTP is fine and precise, RTC is not.

Ah, I see. But the point of the virtio_rtc is not really to expose that
CPU counter. The point is to report the wallclock time, just like an
actual RTC. The real difference is the *precision*.

The virtio_rtc device has a facility to *also* expose the counter,
because that's what we actually need to gain that precision...

Applications don't read the RTC every time they want to know what the
time is. These days, they don't even make a system call; it's done
entirely in userspace mode. The kernel exposes some shared memory,
essentially saying "the counter was X at time Y, and runs at Z Hz".
Then applications just read the CPU counter and do some arithmetic.

As we require more and more precision in the calibration, it becomes
important to get *paired* readings of the CPU counter and the wallclock
time at precisely the same moment. If the guest has to read one and
then the other, potentially taking interrupts, getting preempted and
suffering steal/SMI time in the middle, that introduces an error which
is increasingly significant as we increasingly care about precision.

Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
kernels having to repeat readings over time and perform the calibration
as the underlying hardware oscillator frequency (Z) drifts with
temperature. I'm trying to get him to let the hypervisor expose the
calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
Which aside from reducing the duplication of effort, will *also* fix
the problem of live migration where *all* those things suffer a step
change and leave the guest with an inaccurate clock but not knowing it.

But that part isn't relevant to the RTC, as you say. RTC doesn't care
about that level of precision; it's just what the system uses to know
roughly what year it is, when it powers up. (And isn't even really
trusted for that, which is a large part of why I added the
X509_V_FLAG_NO_CHECK_TIME flag to OpenSSL, so Secure Boot doesn't break
when the RTC is catastrophically wrong :)

If you're asking why patch 7/7 in Peter's series exists to expose the
virtio clock through RTC, and you're not particularly interested in the
first six, I suppose that's a fair question. As is the question of "why
is it called virtio_rtc not virtio_ptp?". 

But let me turn it around: if the kernel has access to this virtio
device and *not* any other RTC, why *wouldn't* the kernel use the time
from it? The fact that it can optionally *also* provide paired readings
with the CPU counter doesn't actually *hurt* for the RTC use case, does
it?

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Peter Hilber

On 13.03.24 12:18, Alexandre Belloni wrote:
> On 13/03/2024 10:45:54+0100, Peter Hilber wrote:
>>> Exposing UTC as the only clock reference is bad enough; when leap
>>> seconds happen there's a whole second during which you don't *know*
>>> which second it is. It seems odd to me, for a precision clock to be
>>> deliberately ambiguous about what the time is!
>>
>> Just to be clear, the device can perfectly expose only a TAI reference
>> clock (or both UTC and TAI), the spec is just completely open about this,
>> as it tries to work for diverse use cases.
>>
>>>
>>> But if the virtio-rtc clock is defined as UTC and then expose something
>>> *different* in it, that's even worse. You potentially end up providing
>>> inaccurate time for a whole *day* leading up to the leap second.
>>>
>>> I think you're right that leap second smearing should be addressed. At
>>> the very least, by making it clear that the virtio-rtc clock which
>>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
>>> yet-to-be-defined variant.
>>>
>>
>> Agreed.
>>
>>> Please make it explicit that any hypervisor which wants to advertise a
>>> smeared clock shall define a new type which specifies the precise
>>> smearing algorithm and cannot be conflated with the one you're defining
>>> here.
>>>
>>
>> I will add a requirement that the UTC clock can never have smeared/smoothed
>> leap seconds.
>>
>> I think that not every vendor would bother to first add a definition of a
>> smearing algorithm. Also, I think in some cases knowing the precise
>> smearing algorithm might not be important (when having the same time as the
>> hypervisor is enough and accuracy w.r.t. actual time is less important).
>>
>> So maybe I should add a VIRTIO_RTC_CLOCK_UTC_SMEARED clock type, which for
>> now could catch every UTC-like clock which smears/smoothes leap seconds,
>> where the vendor cannot be bothered to add the smearing algorithm to spec
>> and implementations.
>>
> 
> I still don't know anything about virtio but under Linux, an RTC is
> always UTC (or localtime when dual booting but let's not care) and never
> accounts for leap seconds. Having an RTC and RTC driver behaving
> differently would be super inconvenient. Why don't you leave this to
> userspace?
> 
> I guess I'm still questioning whether this is the correct interface to
> expose the host system time instead of an actual RTC.
> 

virtio_rtc only registers RTC class devices for virtio_rtc clock type UTC,
so adding an UTC_SMEARED clock type would avoid leap seconds for the RTC
class.

But I understand that there are more concerns and I will re-consider. From
my POV CLOCK_{REALTIME,MONOTONIC}_ALARM support is very important however.

So the only alternative to me seems to be adding an alternative alarmtimer
backend (and time synchronization through user space).

Thanks for the comment,

Peter

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Alexandre Belloni

On 13/03/2024 14:06:42+, David Woodhouse wrote:
> If you're asking why patch 7/7 in Peter's series exists to expose the
> virtio clock through RTC, and you're not particularly interested in the
> first six, I suppose that's a fair question. As is the question of "why
> is it called virtio_rtc not virtio_ptp?". 
> 

Exactly my question, thanks :)

> But let me turn it around: if the kernel has access to this virtio
> device and *not* any other RTC, why *wouldn't* the kernel use the time
> from it? The fact that it can optionally *also* provide paired readings
> with the CPU counter doesn't actually *hurt* for the RTC use case, does
> it?

As long as it doesn't behave differently from the other RTC, I'm fine
with this. This is important because I don't want to carry any special
infrastructure for this driver or to have to special case this driver
later on because it is incompatible with some evolution of the
subsystem.

-- 
Alexandre Belloni, co-owner and COO, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-13 Thread Oleg Nesterov

LGTM, one nit below.

On 03/12, Andrii Nakryiko wrote:
>
> +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe 
> *tu,
> +struct pt_regs *regs)
> +{
> + struct uprobe_cpu_buffer *ucb;
> + int dsize, esize;
> +
> + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
> + dsize = __get_data_size(&tu->tp, regs);
> +
> + ucb = uprobe_buffer_get();
> + ucb->dsize = dsize;
> +
> + store_trace_args(ucb->buf, &tu->tp, regs, esize, dsize);
> +
> + return ucb;
> +}

OK, but note that every user of ->dsize adds tp.size. So I think you can
simplify this code a bit more if you change prepare_uprobe_buffer() to do

ucb->dsize = tu->tp.size + dsize;

and update the users.

Oleg.

Re: [PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-13 Thread Oleg Nesterov

Again, looks good to me, but I have a minor nit. Feel free to ignore.

On 03/12, Andrii Nakryiko wrote:
>
>  static void __uprobe_trace_func(struct trace_uprobe *tu,
>   unsigned long func, struct pt_regs *regs,
> - struct uprobe_cpu_buffer *ucb,
> + struct uprobe_cpu_buffer **ucbp,
>   struct trace_event_file *trace_file)
>  {
>   struct uprobe_trace_entry_head *entry;
>   struct trace_event_buffer fbuffer;
> + struct uprobe_cpu_buffer *ucb;
>   void *data;
>   int size, esize;
>   struct trace_event_call *call = trace_probe_event_call(&tu->tp);
>  
> + ucb = *ucbp;
> + if (!ucb) {
> + ucb = prepare_uprobe_buffer(tu, regs);
> + *ucbp = ucb;
> + }

perhaps it would be more clean to pass ucbp to prepare_uprobe_buffer()
and change it to do

if (*ucbp)
return *ucbp;

at the start. Then __uprobe_trace_func() and __uprobe_perf_func() can
simply do

ucb = prepare_uprobe_buffer(tu, regs, ucbp);

> - uprobe_buffer_put(ucb);
> + if (ucb)
> + uprobe_buffer_put(ucb);

Similarly, I think the "ucb != NULL" check should be shifted into
uprobe_buffer_put().

Oleg.

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-13 Thread Mathieu Poirier

On Tue, Mar 12, 2024 at 05:32:52PM +, Abdellatif El Khlifi wrote:
> Hi Mathieu,
> 
> On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote:
> > > This is an initial patchset for allowing to turn on and off the remote 
> > > processor.
> > > The FW is already loaded before the Corstone-1000 SoC is powered on and 
> > > this
> > > is done through the FPGA board bootloader in case of the FPGA target. Or 
> > > by the Corstone-1000 FVP model
> > > (emulator).
> > >
> > >From the above I take it that booting with a preloaded firmware is a
> > scenario that needs to be supported and not just a temporary stage.
> 
> The current status of the Corstone-1000 SoC requires that there is
> a preloaded firmware for the external core. Preloading is done externally
> either through the FPGA bootloader or the emulator (FVP) before powering
> on the SoC.
> 

Ok

> Corstone-1000 will be upgraded in a way that the A core running Linux is able
> to share memory with the remote core and also being able to access the remote
> core memory so Linux can copy the firmware to. This HW changes are still
> This is why this patchset is relying on a preloaded firmware. And it's the 
> step 1
> of adding remoteproc support for Corstone.
>

Ok, so there is a HW problem where A core and M core can't see each other's
memory, preventing the A core from copying the firmware image to the proper
location.

When the HW is fixed, will there be a need to support scenarios where the
firmware image has been preloaded into memory?

> When the HW is ready, we will be able to avoid preloading the firmware
> and the user can do the following:
> 
> 1) Use a default firmware filename stated in the DT (firmware-name property),
> that's the one remoteproc subsystem will use initially, load the firmware file
> and start the remote core.
> 
> 2) Then, the user can choose to use another firmware file:
> 
> echo stop >/sys/class/remoteproc/remoteproc0/state
> echo -n new_firmware.elf > /sys/class/remoteproc/remoteproc0/firmware
> echo start >/sys/class/remoteproc/remoteproc0/state
> 
> > > The plan for the driver is as follows:
> > >
> > > Step 1: provide a foundation driver capable of turning the core on/off
> > >
> > > Step 2: provide mailbox support for comms
> > >
> > > Step 3: provide FW reload capability
> > >
> > What happens when a user wants to boot the remote processor with the
> > firmware provided on the file system rather than the one preloaded
> > into memory?
> 
> We will support this scenario when the HW is upgraded and copying the firmware
> to the remote core memory becomes possible.
> 
> > Furthermore, how do we account for scenarios where the
> > remote processor goes from running a firmware image on the file system
> > to the firmware image loaded by an external entity?  Is this a valid
> > scenario?
> 
> No, this scenario won't apply when we get the HW upgrade. No need for an
> external entity anymore. The firmware(s) will all be files in the linux 
> filesystem.
> 
> > > Steps 2 & 3 are waiting for a HW update so the Cortex-A35 (running Linux) 
> > > can share memory with
> > > the remote core.
> > >
> > > I'm happy to provide more explanation in the commit log to reflect this 
> > > status.
> > >
> > > Is it OK that we go with step 1 as a foundation please ?
> > >
> > 
> > First let's clarify all the scenarios that need to be supported.  From
> > there I will advise on how to proceed and what modifications to the
> > subsystem's core should be made, if need be.
> 
> Thanks, I hope the answers above provide the information needed.
> 
> Cheers
> Abdellatif

[GIT PULL] OpenRISC updates for 6.9

2024-03-13 Thread Stafford Horne

Hello Linus,

Please consider for pull,

The following changes since commit b401b621758e46812da61fa58a67c3fd8d91de0d:

  Linux 6.8-rc5 (2024-02-18 12:56:25 -0800)

are available in the Git repository at:

  https://github.com/openrisc/linux.git tags/for-linus

for you to fetch changes up to 7f1e2fc493480086fbb375f4f6d33cb93fc069d6:

  openrisc: Use asm-generic's version of fix_to_virt() & virt_to_fix() 
(2024-03-10 08:55:46 +)


OpenRISC updates for 6.9

Just a few cleanups and updates that were sent in:

 - Replace asm/fixmap.h with asm-generic version
 - Fix to move memblock setup up before it's used during init


Dawei Li (1):
  openrisc: Use asm-generic's version of fix_to_virt() & virt_to_fix()

Oreoluwa Babatunde (1):
  openrisc: Call setup_memory() earlier in the init sequence

 arch/openrisc/include/asm/fixmap.h | 31 +--
 arch/openrisc/kernel/setup.c   |  6 +++---
 2 files changed, 4 insertions(+), 33 deletions(-)

Re: [PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-13 Thread Andrii Nakryiko

On Wed, Mar 13, 2024 at 8:16 AM Oleg Nesterov  wrote:
>
> LGTM, one nit below.
>
> On 03/12, Andrii Nakryiko wrote:
> >
> > +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe 
> > *tu,
> > +struct pt_regs *regs)
> > +{
> > + struct uprobe_cpu_buffer *ucb;
> > + int dsize, esize;
> > +
> > + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
> > + dsize = __get_data_size(&tu->tp, regs);
> > +
> > + ucb = uprobe_buffer_get();
> > + ucb->dsize = dsize;
> > +
> > + store_trace_args(ucb->buf, &tu->tp, regs, esize, dsize);
> > +
> > + return ucb;
> > +}
>
> OK, but note that every user of ->dsize adds tp.size. So I think you can
> simplify this code a bit more if you change prepare_uprobe_buffer() to do
>
> ucb->dsize = tu->tp.size + dsize;
>
> and update the users.
>

makes sense, done

> Oleg.
>

Re: [PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-13 Thread Andrii Nakryiko

On Wed, Mar 13, 2024 at 8:48 AM Oleg Nesterov  wrote:
>
> Again, looks good to me, but I have a minor nit. Feel free to ignore.
>
> On 03/12, Andrii Nakryiko wrote:
> >
> >  static void __uprobe_trace_func(struct trace_uprobe *tu,
> >   unsigned long func, struct pt_regs *regs,
> > - struct uprobe_cpu_buffer *ucb,
> > + struct uprobe_cpu_buffer **ucbp,
> >   struct trace_event_file *trace_file)
> >  {
> >   struct uprobe_trace_entry_head *entry;
> >   struct trace_event_buffer fbuffer;
> > + struct uprobe_cpu_buffer *ucb;
> >   void *data;
> >   int size, esize;
> >   struct trace_event_call *call = trace_probe_event_call(&tu->tp);
> >
> > + ucb = *ucbp;
> > + if (!ucb) {
> > + ucb = prepare_uprobe_buffer(tu, regs);
> > + *ucbp = ucb;
> > + }
>
> perhaps it would be more clean to pass ucbp to prepare_uprobe_buffer()
> and change it to do
>
> if (*ucbp)
> return *ucbp;
>
> at the start. Then __uprobe_trace_func() and __uprobe_perf_func() can
> simply do
>
> ucb = prepare_uprobe_buffer(tu, regs, ucbp);

ok, will do

>
> > - uprobe_buffer_put(ucb);
> > + if (ucb)
> > + uprobe_buffer_put(ucb);
>
> Similarly, I think the "ucb != NULL" check should be shifted into
> uprobe_buffer_put().

sure, will hide it inside uprobe_buffer_put()

>
> Oleg.
>

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

2024-03-13 Thread Nathan Chancellor

On Tue, Mar 12, 2024 at 11:30:02AM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" 
> 
> The WARN_ON() check in __assign_str() to catch where the source variable
> to the macro doesn't match the source variable to __string() gives an
> error in clang:
> 
> >> include/trace/events/sunrpc.h:703:4: warning: result of comparison against 
> >> a string literal is unspecified (use an explicit string comparison 
> >> function instead) [-Wstring-compare]
>  670 | __assign_str(progname, "unknown");
> 
> That's because the __assign_str() macro has:
> 
>WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_);
> 
> Where "src" is a string literal. Clang warns when comparing a string
> literal directly as it is undefined to what the value of the literal is.
> 
> Since this is still to make sure the same string that goes to __string()
> is the same as __assign_str(), for string literals do a test for that and
> then use strcmp() in those cases
> 
> Note that this depends on commit 51270d573a8d ("tracing/net_sched: Fix
> tracepoints that save qdisc_dev() as a string") being applied, as this was
> what found that bug.
> 
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202402292111.kidexylu-...@intel.com/
> Fixes: 433e1d88a3be ("tracing: Add warning if string in __assign_str() does 
> not match __string()")

Is this change destined for 6.9 or 6.10? I applied it to current
trace/core (eb1533d156d3) along with 51270d573a8d but the warning is
still present. I even tried

__builtin_choose_expr(__is_constexpr((src)),
 strcmp((src), __data_offsets.dst##_ptr_),
 (src) != __data_offsets.dst##_ptr_));

but not even that silenced the warning. I think we will still need a
diag directive to fully silence this warning.

> Signed-off-by: Steven Rostedt (Google) 
> ---
>  include/trace/stages/stage6_event_callback.h | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/stages/stage6_event_callback.h 
> b/include/trace/stages/stage6_event_callback.h
> index a0c15f67eabe..83da83a0c14f 100644
> --- a/include/trace/stages/stage6_event_callback.h
> +++ b/include/trace/stages/stage6_event_callback.h
> @@ -35,7 +35,9 @@
>   do {\
>   char *__str__ = __get_str(dst); \
>   int __len__ = __get_dynamic_array_len(dst) - 1; \
> - WARN_ON_ONCE((src) != __data_offsets.dst##_ptr_);   \
> + WARN_ON_ONCE(__builtin_constant_p(src) ?\
> +  strcmp((src), __data_offsets.dst##_ptr_) : \
> +  (src) != __data_offsets.dst##_ptr_);   \
>   memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
>  EVENT_NULL_STR, __len__);\
>   __str__[__len__] = '\0';\
> -- 
> 2.43.0
>

Re: [PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-13 Thread Andrii Nakryiko

On Wed, Mar 13, 2024 at 6:20 AM Oleg Nesterov  wrote:
>
> I forgot everything about this code, plus it has changed a lot since
> I looked at it many years ago, but ...
>
> I think this change is fine but the changelog looks a bit confusing
> (overcomplicated) to me.

It's a new piece of code and logic, so I tried to do my due diligence
and argue why I think it's fine. I'll drop the overcomplicated
explanation, as I agree with you that it's inherently racy even
without my changes (and use-after-free safety is provided with
uprobe->register_rwsem independent from all this).

>
> On 03/12, Andrii Nakryiko wrote:
> >
> > This patch adds a speculative check before grabbing that rwlock. If
> > nr_systemwide is non-zero, lock is skipped and event is passed through.
> > From examining existing logic it looks correct and safe to do. If
> > nr_systemwide is being modified under rwlock in parallel, we have to
> > consider basically just one important race condition: the case when
> > nr_systemwide is dropped from one to zero (from
> > trace_uprobe_filter_remove()) under filter->rwlock, but
> > uprobe_perf_filter() raced and saw it as >0.
>
> Unless I am totally confused, there is nothing new. Even without
> this change trace_uprobe_filter_remove() can clear nr_systemwide
> right after uprobe_perf_filter() drops filter->rwlock.
>
> And of course, trace_uprobe_filter_add() can change nr_systemwide
> from 0 to 1. In this case uprobe_perf_func() can "wrongly" return
> UPROBE_HANDLER_REMOVE but we can't avoid this and afaics this is
> fine even if handler_chain() does unapply_uprobe(), uprobe_perf_open()
> will do uprobe_apply() after that, we can rely on ->register_rwsem.
>

yep, agreed

> > In case we speculatively read nr_systemwide as zero, while it was
> > incremented in parallel, we'll proceed to grabbing filter->rwlock and
> > re-doing the check, this time in lock-protected and non-racy way.
>
> See above...
>
>
> So I think uprobe_perf_filter() needs filter->rwlock only to iterate
> the list, it can check nr_systemwide lockless and this means that you
> can also remove the same check in __uprobe_perf_filter(), other callers
> trace_uprobe_filter_add/remove check it themselves.
>

makes sense, will do

>
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct 
> > uprobe_consumer *uc,
> >   tu = container_of(uc, struct trace_uprobe, consumer);
> >   filter = tu->tp.event->filter;
> >
> > + /* speculative check */
> > + if (READ_ONCE(filter->nr_systemwide))
> > + return true;
> > +
> >   read_lock(&filter->rwlock);
> >   ret = __uprobe_perf_filter(filter, mm);
> >   read_unlock(&filter->rwlock);
>
> ACK,
>
> but see above. I think the changelog should be simplified and the
> filter->nr_systemwide check in __uprobe_perf_filter() should be
> removed. But I won't insist and perhaps I missed something...
>

I think you are right, I'll move the check

> Oleg.
>

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

2024-03-13 Thread Abdellatif El Khlifi

Hi Mathieu,

On Wed, Mar 13, 2024 at 10:25:32AM -0600, Mathieu Poirier wrote:
> On Tue, Mar 12, 2024 at 05:32:52PM +, Abdellatif El Khlifi wrote:
> > Hi Mathieu,
> > 
> > On Tue, Mar 12, 2024 at 10:29:52AM -0600, Mathieu Poirier wrote:
> > > > This is an initial patchset for allowing to turn on and off the remote 
> > > > processor.
> > > > The FW is already loaded before the Corstone-1000 SoC is powered on and 
> > > > this
> > > > is done through the FPGA board bootloader in case of the FPGA target. 
> > > > Or by the Corstone-1000 FVP model
> > > > (emulator).
> > > >
> > > >From the above I take it that booting with a preloaded firmware is a
> > > scenario that needs to be supported and not just a temporary stage.
> > 
> > The current status of the Corstone-1000 SoC requires that there is
> > a preloaded firmware for the external core. Preloading is done externally
> > either through the FPGA bootloader or the emulator (FVP) before powering
> > on the SoC.
> > 
> 
> Ok
> 
> > Corstone-1000 will be upgraded in a way that the A core running Linux is 
> > able
> > to share memory with the remote core and also being able to access the 
> > remote
> > core memory so Linux can copy the firmware to. This HW changes are still
> > This is why this patchset is relying on a preloaded firmware. And it's the 
> > step 1
> > of adding remoteproc support for Corstone.
> >
> 
> Ok, so there is a HW problem where A core and M core can't see each other's
> memory, preventing the A core from copying the firmware image to the proper
> location.
> 
> When the HW is fixed, will there be a need to support scenarios where the
> firmware image has been preloaded into memory?

No, this scenario won't apply when we get the HW upgrade. No need for an
external entity anymore. The firmware(s) will all be files in the linux 
filesystem.

Cheers
Abdellatif

Re: [PATCH bpf-next 0/3] uprobes: two common case speed ups

2024-03-13 Thread Andrii Nakryiko

On Wed, Mar 13, 2024 at 2:41 AM Jiri Olsa  wrote:
>
> On Tue, Mar 12, 2024 at 02:02:30PM -0700, Andrii Nakryiko wrote:
> > This patch set implements two speed ups for uprobe/uretprobe runtime 
> > execution
> > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> > system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> > patches for details.
> >
> > Given I haven't worked with uprobe code before, I'm unfamiliar with
> > conventions in this subsystem, including which kernel tree patches should be
> > sent to. For now I based all the changes on top of bpf-next/master, which is
> > where I tested and benchmarked everything anyways. Please advise what should
> > I use as a base for subsequent revision. Thanks.

Steven, Masami,

Is this the kind of patches that should go through your tree(s)? Or
you'd be fine with this going through bpf-next? I'd appreciate the
link to the specific GIT repo I should use as a base in the former
case, thank you!

> >
> > Andrii Nakryiko (3):
> >   uprobes: encapsulate preparation of uprobe args buffer
> >   uprobes: prepare uprobe args buffer lazily
> >   uprobes: add speculative lockless system-wide uprobe filter check
>
> nice cleanup and speed up, lgtm
>
> Reviewed-by: Jiri Olsa 
>
> jirka
>
> >
> >  kernel/trace/trace_uprobe.c | 103 ++--
> >  1 file changed, 63 insertions(+), 40 deletions(-)
> >
> > --
> > 2.43.0
> >
> >

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

2024-03-13 Thread Steven Rostedt

On Wed, 13 Mar 2024 09:59:03 -0700
Nathan Chancellor  wrote:

> > Reported-by: kernel test robot 
> > Closes: 
> > https://lore.kernel.org/oe-kbuild-all/202402292111.kidexylu-...@intel.com/
> > Fixes: 433e1d88a3be ("tracing: Add warning if string in __assign_str() does 
> > not match __string()")  
> 
> Is this change destined for 6.9 or 6.10? I applied it to current
> trace/core (eb1533d156d3) along with 51270d573a8d but the warning is
> still present. I even tried
> 
> __builtin_choose_expr(__is_constexpr((src)),
>  strcmp((src), __data_offsets.dst##_ptr_),
>  (src) != __data_offsets.dst##_ptr_));
> 
> but not even that silenced the warning. I think we will still need a
> diag directive to fully silence this warning.

Yes, you said that the warning is still there, but the bug it shows should
not be.

I believe that's because clang still evaluates the (src) != ... even when
the source is a contast and it warns about it. But if src is a constant, we
do not want to do the !=, we want to do the slower strcmp().

Let me test to make sure that when src is a string "like this" that it does
the strcmp(). Otherwise, we may have to always do the strcmp(), which I
really would like to avoid.

BTW, I triggered another bug with strcmp():

  https://lore.kernel.org/all/20240313093454.3909a...@gandalf.local.home/

-- Steve

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Peter Hilber

On 13.03.24 13:45, David Woodhouse wrote:
> On Wed, 2024-03-13 at 10:45 +0100, Peter Hilber wrote:
>> On 12.03.24 18:15, David Woodhouse wrote:
>>> On Mon, 2024-03-11 at 19:24 +0100, Peter Hilber wrote:
 On 08.03.24 13:33, David Woodhouse wrote:
> On Fri, 2024-03-08 at 11:32 +0100, Peter Hilber wrote:
>> On 07.03.24 15:02, David Woodhouse wrote:
>>> Hm, should we allow UTC? If you tell me the time in UTC, then
>>> (sometimes) I still don't actually know what the time is, because some
>>> UTC seconds occur twice. UTC only makes sense if you provide the TAI
>>> offset, surely? Should the virtio_rtc specification make it mandatory
>>> to provide such?
>>>
>>> Otherwise you're just designing it to allow crappy hypervisors to
>>> expose incomplete information.
>>>
>>
>> Hi David,
>>
>> (adding virtio-comm...@lists.oasis-open.org for spec discussion),
>>
>> thank you for your insightful comments. I think I take a broadly similar
>> view. The reason why the current spec and driver is like this is that I
>> took a pragmatic approach at first and only included features which work
>> out-of-the-box for the current Linux ecosystem.
>>
>> The current virtio_rtc features work similar to ptp_kvm, and therefore
>> can work out-of-the-box with time sync daemons such as chrony.
>>
>> As of RFC spec v3, UTC clock only is allowed. If mandating a TAI clock
>> as well, I am afraid that
>>
>> - in some (embedded) scenarios, the TAI clock may not be available
>>
>> - crappy hypervisors will pass off the UTC clock as the TAI clock.
>>
>> For the same reasons, I am also not sure about adding a *mandatory* TAI
>> offset to each readout. I don't know user-space software which would
>> leverage this already (at least not through the PTP clock interface).
>> And why would such software not go straight for the TAI clock instead?
>>
>> How about adding a requirement to the spec that the virtio-rtc device
>> SHOULD expose the TAI clock whenever it is available - would this
>> address your concerns?
>
> I think that would be too easy for implementors to miss, or decide not
> to obey. Or to get *wrong*, by exposing a TAI clock but actually
> putting UTC in it.
>
> I think I prefer to mandate the tai_offset field with the UTC clock.
> Crappy implementations will just set it to zero, but at least that
> gives a clear signal to the guests that it's *their* problem to
> resolve.

 To me there are some open questions regarding how this would work. Is there
 a use case for this with the v3 clock reading methods, or would it be
 enough to address this with the Virtio timekeeper?

 Looking at clock_adjtime(2), the tai_offset could be exposed, but probably
 best alongside some additional information about leap seconds. I am not
 aware about any user-space user. In addition, leap second smearing should
 also be addressed.

>>>
>>> Is there even a standard yet for leap-smearing? Will it be linear over
>>> 1000 seconds like UTC-SLS? Or semi-raised-cosine over 24 hours, which I
>>> think is what Google does? Meta does something different again, don't
>>> they?
>>>
>>> Exposing UTC as the only clock reference is bad enough; when leap
>>> seconds happen there's a whole second during which you don't *know*
>>> which second it is. It seems odd to me, for a precision clock to be
>>> deliberately ambiguous about what the time is!
>>
>> Just to be clear, the device can perfectly expose only a TAI reference
>> clock (or both UTC and TAI), the spec is just completely open about this,
>> as it tries to work for diverse use cases.
> 
> As long as the guest *knows* what it's getting, sure.
> 
>>>
>>> But if the virtio-rtc clock is defined as UTC and then expose something
>>> *different* in it, that's even worse. You potentially end up providing
>>> inaccurate time for a whole *day* leading up to the leap second.
>>>
>>> I think you're right that leap second smearing should be addressed. At
>>> the very least, by making it clear that the virtio-rtc clock which
>>> advertises UTC shall be used *only* for UTC, never UTC-SLS or any other
>>> yet-to-be-defined variant.
>>>
>>
>> Agreed.
>>
>>> Please make it explicit that any hypervisor which wants to advertise a
>>> smeared clock shall define a new type which specifies the precise
>>> smearing algorithm and cannot be conflated with the one you're defining
>>> here.
>>>
>>
>> I will add a requirement that the UTC clock can never have smeared/smoothed
>> leap seconds.
> 
> Thanks.
> 
>> I think that not every vendor would bother to first add a definition of a
>> smearing algorithm. Also, I think in some cases knowing the precise
>> smearing algorithm might not be important (when having the same time as the
>> hypervisor is enough and accuracy w.r.t. actual time is less important).
>>
>> S

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Peter Hilber

On 13.03.24 15:06, David Woodhouse wrote:
> On Wed, 2024-03-13 at 13:58 +0100, Alexandre Belloni wrote:
>> The TSC or whatever CPU counter/clock that is used to keep the system
>> time is not an RTC, I don't get why it has to be exposed as such to the
>> guests. PTP is fine and precise, RTC is not.
> 
> Ah, I see. But the point of the virtio_rtc is not really to expose that
> CPU counter. The point is to report the wallclock time, just like an
> actual RTC. The real difference is the *precision*.
> 
> The virtio_rtc device has a facility to *also* expose the counter,
> because that's what we actually need to gain that precision...
> 
> Applications don't read the RTC every time they want to know what the
> time is. These days, they don't even make a system call; it's done
> entirely in userspace mode. The kernel exposes some shared memory,
> essentially saying "the counter was X at time Y, and runs at Z Hz".
> Then applications just read the CPU counter and do some arithmetic.
> 
> As we require more and more precision in the calibration, it becomes
> important to get *paired* readings of the CPU counter and the wallclock
> time at precisely the same moment. If the guest has to read one and
> then the other, potentially taking interrupts, getting preempted and
> suffering steal/SMI time in the middle, that introduces an error which
> is increasingly significant as we increasingly care about precision.
> 
> Peter's proposal exposes the pairs of {X,Y} and leaves *all* the guest
> kernels having to repeat readings over time and perform the calibration
> as the underlying hardware oscillator frequency (Z) drifts with
> temperature. I'm trying to get him to let the hypervisor expose the
> calibrated frequency Z too. Along with *error* bounds for ±δX and ±δZ.
> Which aside from reducing the duplication of effort, will *also* fix
> the problem of live migration where *all* those things suffer a step
> change and leave the guest with an inaccurate clock but not knowing it.

I am already convinced that this would work significantly better than the
{X,Y} pair (but would be a bit more effort to implement):

1. when accessed by user space, obviously

2. when backing the PTP clock, it saves CPU time and makes non-paired
   reads more precise.

I would just prefer to try upstreaming the {X,Y} pairing first. I think the
{X,Y,Z...} pairing could be discussed and developed in parallel.

Thanks for the comments,

Peter

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread David Woodhouse

On 13 March 2024 17:50:48 GMT, Peter Hilber  
wrote:
>On 13.03.24 13:45, David Woodhouse wrote:
>> Surely the whole point of this effort is to provide guests with precise
>> and *unambiguous* knowledge of what the time is? 
>
>I would say, a fundamental point of this effort is to enable such
>implementations, and to detect if a device is promising to support this.
>
>Where we might differ is as to whether the Virtio clock *for every
>implementation* has to be *continuously* accurate w.r.t. a time standard,
>or whether *for some implementations* it could be enough that all guests in
>the local system have the same, precise local notion of time, which might
>be off from the actual time standard.

That makes sense, but remember I don't just want {X, Y, Z} but *also* the error 
bounds of ±deltaY and ±deltaZ too.

So your example just boils down to "I'm calling it UTC, and it's really 
precise, but we make no promises about its *accuracy*". And that's fine.

>Also, cf. ptp_kvm, which AFAIU doesn't address leap seconds at all...

KVM is not an exemplar of good time practices. 
Not in *any* respect :)

>With your described use case the UTC_SMEARED clock should of course not be
>used. The UTC_SMEARED clock would get a distinct name through udev, like
>/dev/ptp_virtio_utc_smeared, so the incompatibility could at least be
>detected.

As long as it's clear to all concerned that this is fundamentally not usable as 
an accurate time source, and is only for the local-sync case you described, 
sure.

>> Using UTC is bad enough, because for a UTC timestamp in the middle of a
>> leap second the guest can't know know *which* occurrence of that leap
>> second it is, so it might be wrong by a second. To resolve that
>> ambiguity needs a leap indicator and/or tai_offset field.
>
>I agree that virtio-rtc should communicate this. The question is, what
>exactly, and for which clock read request?

Are we now conflating software architecture (and Linux in particular) with 
"hardware" design?

To a certain extent, as long as the virtio-rtc device is designed to expose 
time precisely and unambiguously, it's less important if the Linux kernel 
*today* can use that. Although of course we should strive for that. Let's 
be...well, *unambiguous*, I suppose... that we've changed topics to discuss 
that though.

>As for PTP clocks:
>
>- It doesn't fit into the ioctl PTP_SYS_OFFSET_PRECISE2.
>
>- The clock_adjtime(2) tai_offset and return value could be set (if
>  upstream will accept this). Would this help? As discussed, user space
>  would need to interpret this (and currently no dynamic POSIX clock sets
>  this).

Hm, maybe?

>>> I think I can add a SHOULD requirement which vaguely refers to vCPU 0, or
>>> boot vCPU. But the Virtio device is not necessarily hosted by a hypervisor,
>>> so the device might not even know which vCPUs there are. E.g. there is even
>>> interest to make virtio-rtc work as part of the virtio-net device (which
>>> might be implemented in hardware).
>> 
>> Sure, but those implementations aren't going to offer the TSC pairing
>> at all, are they?
>> 
>
>They could offer an Intel ART pairing (some physical PTP NICs are already
>doing this, look for the convert_art_to_tsc() users).

Right, but isn't that software's problem? The time pairing is defined against 
the ART in that case.

Re: [PATCH 3/3] dt-bindings: remoteproc: Add Arm remoteproc

2024-03-13 Thread Robin Murphy


On 2024-03-01 4:42 pm, abdellatif.elkhl...@arm.com wrote:

From: Abdellatif El Khlifi 

introduce the bindings for Arm remoteproc support.

Signed-off-by: Abdellatif El Khlifi 
---
  .../bindings/remoteproc/arm,rproc.yaml| 69 +++
  MAINTAINERS   |  1 +
  2 files changed, 70 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml

diff --git a/Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml 
b/Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml
new file mode 100644
index ..322197158059
--- /dev/null
+++ b/Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml
@@ -0,0 +1,69 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/remoteproc/arm,rproc.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Arm Remoteproc Devices
+
+maintainers:
+  - Abdellatif El Khlifi 
+
+description: |
+  Some Arm heterogeneous System-On-Chips feature remote processors that can
+  be controlled with a reset control register and a reset status register to
+  start or stop the processor.
+
+  This document defines the bindings for these remote processors.
+
+properties:
+  compatible:
+enum:
+  - arm,corstone1000-extsys
+
+  reg:
+minItems: 2
+maxItems: 2
+description: |
+  Address and size in bytes of the reset control register
+  and the reset status register.
+  Expects the registers to be in the order as above.
+  Should contain an entry for each value in 'reg-names'.
+
+  reg-names:
+description: |
+  Required names for each of the reset registers defined in
+  the 'reg' property. Expects the names from the following
+  list, in the specified order, each representing the corresponding
+  reset register.
+items:
+  - const: reset-control
+  - const: reset-status
+
+  firmware-name:
+description: |
+  Default name of the firmware to load to the remote processor.


So... is loading the firmware image achieved by somehow bitbanging it 
through the one reset register, maybe? I find it hard to believe this is 
a complete and functional binding.


Frankly at the moment I'd be inclined to say it isn't even a remoteproc 
binding (or driver) at all, it's a reset controller. Bindings are a 
contract for describing the hardware, not the current state of Linux 
driver support - if this thing still needs mailboxes, shared memory, a 
reset vector register, or whatever else to actually be useful, those 
should be in the binding from day 1 so that a) people can write and 
deploy correct DTs now, such that functionality becomes available on 
their systems as soon as driver support catches up, and b) the community 
has any hope of being able to review whether the binding is 
appropriately designed and specified for the purpose it intends to serve.


For instance right now it seems somewhat tenuous to describe two 
consecutive 32-bit registers as separate "reg" entries, but *maybe* it's 
OK if that's all there ever is. However if it's actually going to end up 
needing several more additional MMIO and/or memory regions for other 
functionality, then describing each register and location individually 
is liable to get unmanageable really fast, and a higher-level functional 
grouping (e.g. these reset-related registers together as a single 8-byte 
region) would likely be a better design.


Thanks,
Robin.


+
+required:
+  - compatible
+  - reg
+  - reg-names
+  - firmware-name
+
+additionalProperties: false
+
+examples:
+  - |
+extsys0: remoteproc@1a010310 {
+compatible = "arm,corstone1000-extsys";
+reg = <0x1a010310 0x4>, <0x1a010314 0x4>;
+reg-names = "reset-control", "reset-status";
+firmware-name = "es0_flashfw.elf";
+};
+
+extsys1: remoteproc@1a010318 {
+compatible = "arm,corstone1000-extsys";
+reg = <0x1a010318 0x4>, <0x1a01031c 0x4>;
+reg-names = "reset-control", "reset-status";
+firmware-name = "es1_flashfw.elf";
+};
diff --git a/MAINTAINERS b/MAINTAINERS
index 54d6a40feea5..eddaa3841a65 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1768,6 +1768,7 @@ ARM REMOTEPROC DRIVER
  M:Abdellatif El Khlifi 
  L:linux-remotep...@vger.kernel.org
  S:Maintained
+F: Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml
  F:drivers/remoteproc/arm_rproc.c
  
  ARM SMC WATCHDOG DRIVER

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

2024-03-13 Thread Andrew Lunn

> As long as it doesn't behave differently from the other RTC, I'm fine
> with this. This is important because I don't want to carry any special
> infrastructure for this driver or to have to special case this driver
> later on because it is incompatible with some evolution of the
> subsystem.

Maybe deliberately throw away all the sub-second accuracy when
accessing the time via the RTC API? That helps to make it look like an
RTC. And when doing the rounding, add a constant offset of 10ms to
emulate the virtual i2c bus it is hanging off, like most RTCs?

  Andrew

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

2024-03-13 Thread Steven Rostedt

On Wed, 13 Mar 2024 13:45:50 -0400
Steven Rostedt  wrote:

> Let me test to make sure that when src is a string "like this" that it does
> the strcmp(). Otherwise, we may have to always do the strcmp(), which I
> really would like to avoid.

I added the below patch and enabled sched_switch and it triggered the
warning (expected if it went the strcmp() path). I then changed it to be:

#define __assign_str(dst, src)  \
do {\
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
WARN_ON_ONCE(__builtin_constant_p(src) ?\
 strcmp((src), __data_offsets.dst##_ptr_) : \
-(src) != __data_offsets.dst##_ptr_);   \
+(src) == __data_offsets.dst##_ptr_);   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\
__str__[__len__] = '\0';\
} while (0)

And the sched_switch did not trigger (expected). So it seems that it should
not be a problem.

Note, I only tested this with gcc and not clang.

But I guess there's also the case where we have:

__assign_str(str, field ? field : "NULL")

But hopefully that's not an issue.

-- Steve

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..eaacd0c4e899 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -236,6 +236,7 @@ TRACE_EVENT(sched_switch,
__array(char,   next_comm,  TASK_COMM_LEN   )
__field(pid_t,  next_pid)
__field(int,next_prio   )
+   __string( test, "this")
),
 
TP_fast_assign(
@@ -246,6 +247,7 @@ TRACE_EVENT(sched_switch,
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid   = next->pid;
__entry->next_prio  = next->prio;
+   __assign_str(test, "this");
/* XXX SCHED_DEADLINE */
),
 
diff --git a/include/trace/stages/stage6_event_callback.h 
b/include/trace/stages/stage6_event_callback.h
index 83da83a0c14f..cf301c723fd0 100644
--- a/include/trace/stages/stage6_event_callback.h
+++ b/include/trace/stages/stage6_event_callback.h
@@ -36,7 +36,7 @@
char *__str__ = __get_str(dst); \
int __len__ = __get_dynamic_array_len(dst) - 1; \
WARN_ON_ONCE(__builtin_constant_p(src) ?\
-strcmp((src), __data_offsets.dst##_ptr_) : \
+!strcmp((src), __data_offsets.dst##_ptr_) :
\
 (src) != __data_offsets.dst##_ptr_);   \
memcpy(__str__, __data_offsets.dst##_ptr_ ? :   \
   EVENT_NULL_STR, __len__);\

Re: [GIT PULL] Modules changes for v6.9-rc1

2024-03-13 Thread pr-tracker-bot

The pull request you sent on Tue, 12 Mar 2024 18:38:32 -0700:

> git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
> tags/modules-6.9-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/ce0c1c92656e3ea3840c4a5c748aa352285cae9c

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

Re: [PATCH v2 1/6] fuse: limit the length of ITER_KVEC dio by max_pages

2024-03-13 Thread Bernd Schubert




On 3/9/24 05:26, Hou Tao wrote:
> Hi,
> 
> On 3/1/2024 9:42 PM, Miklos Szeredi wrote:
>> On Wed, 28 Feb 2024 at 15:40, Hou Tao  wrote:
>>
>>> So instead of limiting both the values of max_read and max_write in
>>> kernel, capping the maximal length of kvec iter IO by using max_pages in
>>> fuse_direct_io() just like it does for ubuf/iovec iter IO. Now the max
>>> value for max_pages is 256, so on host with 4KB page size, the maximal
>>> size passed to kmalloc() in copy_args_to_argbuf() is about 1MB+40B. The
>>> allocation of 2MB of physically contiguous memory will still incur
>>> significant stress on the memory subsystem, but the warning is fixed.
>>> Additionally, the requirement for huge physically contiguous memory will
>>> be removed in the following patch.
>> So the issue will be fixed properly by following patches?
>>
>> In that case this patch could be omitted, right?
> 
> Sorry for the late reply. Being busy with off-site workshop these days.
> 
> No, this patch is still necessary and it is used to limit the number of
> scatterlist used for fuse request and reply in virtio-fs. If the length
> of out_args[0].size is not limited, the number of scatterlist used to
> map the fuse request may be greater than the queue size of virtio-queue
> and the fuse request may hang forever.

I'm currently also totally busy and didn't carefully check, but isn't
there something missing that limits fc->max_write/fc->max_read?


Thanks,
Bernd

Re: [External] Re: [PATCH v2 1/1] memory tier: acpi/hmat: create CPUless memory tiers after obtaining HMAT info

Re: [RFC PATCH v3 7/7] virtio_rtc: Add RTC class driver

Re: [PATCH bpf-next 0/3] uprobes: two common case speed ups

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS

Re: [PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

[PATCH] net: hns3: tracing: fix hclgevf trace event strings

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

Re: [PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

[GIT PULL] OpenRISC updates for 6.9

Re: [PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

Re: [PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

Re: [PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

Re: [PATCH 1/3] remoteproc: Add Arm remoteproc driver

Re: [PATCH bpf-next 0/3] uprobes: two common case speed ups

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [PATCH 3/3] dt-bindings: remoteproc: Add Arm remoteproc

Re: [RFC PATCH v3 0/7] Add virtio_rtc module and related changes

Re: [PATCH] tracing: Use strcmp() in __assign_str() WARN_ON() check

Re: [GIT PULL] Modules changes for v6.9-rc1

Re: [PATCH v2 1/6] fuse: limit the length of ITER_KVEC dio by max_pages

34 matches

Site Navigation

Mail list logo

Footer information