RE: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-06 Thread David Laight
From: Peter Zijlstra
> Sent: 06 March 2017 11:22
> To: Madhavan Srinivasan
> Cc: Wang Nan; Alexander Shishkin; linux-ker...@vger.kernel.org; Arnaldo 
> Carvalho de Melo; Alexei
> Starovoitov; Ingo Molnar; Stephane Eranian; Sukadev Bhattiprolu; 
> linuxppc-dev@lists.ozlabs.org
> Subject: Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of 
> perf_mem_data_src
> 
> On Mon, Mar 06, 2017 at 04:13:08PM +0530, Madhavan Srinivasan wrote:
> > From: Sukadev Bhattiprolu 
> >
> > perf_mem_data_src is an union that is initialized via the ->val field
> > and accessed via the bitmap fields. For this to work on big endian
> > platforms, we also need a big-endian represenation of perf_mem_data_src.
> 
> Doesn't this break interpreting the data on a different endian machine?

Best to avoid bitfields if you ever care about the bit order.

David



RE: [PATCH] powerpc: Avoid panic during boot due to divide by zero in init_cache_info()

2017-03-06 Thread David Laight
From: Segher Boessenkool
> Sent: 06 March 2017 14:18
> On Mon, Mar 06, 2017 at 01:03:19PM +0100, Gabriel Paubert wrote:
> > > > > The PowerPC divw etc. instructions do not trap by themselves, but 
> > > > > recent
> > > > > GCC inserts trap instructions on code paths that are always undefined
> > > > > behaviour (like, dividing by zero).
> > > >
> > > > Is it systematic or does it depend from, e.g., optimization levels?
> > >
> > > In this case it needs -fisolate-erroneous-paths-dereference which is
> > > default at -O2 and higher.
> >
> > Great, another optimization-dependent behaviour. :-(
> 
> It makes the "behaviour" for undefined behaviour *less* surprising.
> It does not change anything else: malformed programs stay malformed,
> correct programs do exactly what they did before, too.

Yep, 'undefined behaviour' is exactly that.
It doesn't mean 'undefined result', or 'maybe a signal'.
Wiping the disk and targeting the user with an ICBM are both valid.

David



RE: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'

2017-03-14 Thread David Laight
From: Linuxppc-dev Till Smejkal
> Sent: 13 March 2017 22:14
> The only way until now to create a new memory map was via the exported
> function 'mm_alloc'. Unfortunately, this function not only allocates a new
> memory map, but also completely initializes it. However, with the
> introduction of first class virtual address spaces, some initialization
> steps done in 'mm_alloc' are not applicable to the memory maps needed for
> this feature and hence would lead to errors in the kernel code.
> 
> Instead of introducing a new function that can allocate and initialize
> memory maps for first class virtual address spaces and potentially
> duplicate some code, I decided to split the mm_alloc function as well as
> the 'mm_init' function that it uses.
> 
> Now there are four functions exported instead of only one. The new
> 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
> want to have the old behavior of mm_alloc one can use the newly introduced
> function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
> also fully initializes it.
...

That looks like bugs waiting to happen.
You need unchanged code to fail to compile.

David




RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread David Laight
From: Linuxppc-dev Daniel Axtens
> Sent: 15 March 2017 12:38
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
...

While not part of this change, the unrolled loops look as though
they just destroy the cpu cache.
I'd like be convinced that anything does CRC over long enough buffers
to make it a gain at all.

With modern (not that modern now) superscalar cpus you can often
get the loop instructions 'for free'.
Sometimes pipelining the loop is needed to get full throughput.
Unlike the IP checksum, you don't even have to 'loop carry' the
cpu carry flag.

David



RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread David Laight
From: Daniel Axtens
> Sent: 15 March 2017 22:30
> Hi David,
> 
> > While not part of this change, the unrolled loops look as though
> > they just destroy the cpu cache.
> > I'd like be convinced that anything does CRC over long enough buffers
> > to make it a gain at all.
> >
> > With modern (not that modern now) superscalar cpus you can often
> > get the loop instructions 'for free'.
> > Sometimes pipelining the loop is needed to get full throughput.
> > Unlike the IP checksum, you don't even have to 'loop carry' the
> > cpu carry flag.
> 
> Internal testing on a NVMe device with T10DIF enabled on 4k blocks
> shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
> uses over 60% of CPU time - with these patches CRC drops to single
> digits.
> 
> I should probably have lead with that, sorry.

I'm not doubting that using the cpu instruction for crcs gives a
massive performance boost.

Just that the heavily unrolled loop is unlikely to help overall.
Some 'cold cache' tests on shorter buffers might be illuminating.
 
> FWIW, the original patch showed a 3.7x gain on btrfs as well -
> 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")
> 
> When Anton wrote the original code he had access to IBM's internal
> tooling for looking at how instructions flow through the various stages
> of the CPU, so I trust it's pretty much optimal from that point of view.

Doesn't mean he used it :-)

David




RE: RFC on writel and writel_relaxed

2018-03-21 Thread David Laight
> x86 has compiler barrier inside the relaxed() API so that code does not
> get reordered. ARM64 architecturally guarantees device writes to be observed
> in order.

There are places where you don't even need a compile barrier between
every write.

I had horrid problems getting some ppc code (for a specific embedded SoC)
optimised to have no extra barriers.
I ended up just writing through 'pointer to volatile' and adding an
explicit 'eieio' between the block of writes and status read.

No less painful was doing a byteswapping write to normal memory.

David



RE: RFC on writel and writel_relaxed

2018-03-22 Thread David Laight
From: Oliver
> Sent: 22 March 2018 05:24
...
> > No less painful was doing a byteswapping write to normal memory.
> 
> What was the problem? The reverse indexed load/store instructions are
> a little awkward to use, but they work...

Finding something that would generate the right instruction without any
barriers.
ISTR writing my own asm pattern.

David



RE: RFC on writel and writel_relaxed

2018-03-26 Thread David Laight
> > This is a super performance critical operation for most drivers and
> > directly impacts network performance.

Perhaps there ought to be writel_nobarrier() (etc) that never contain
any barriers at all.
This might mean that they are always just the memory operation,
but it would make it more obvious what the driver was doing.

The driver would then be explicitly responsible for all the rmb(), wmb()
and mmiowb() (etc).
Performance critical paths could then avoid all the extra barriers.

David




RE: RFC on writel and writel_relaxed

2018-03-27 Thread David Laight
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong...

Certainly requiring the driver writes use explicit barriers should make
them understand when and why they are needed - and then put in the correct ones.

The problem I've had is that I have a good idea which barriers are needed
but find that readl/writel seem to contain a lot of extra ones.
Maybe the are required in some places, but the extra synchronising
instructions could easily have measureable performance effects on
hot paths.

Drivers are likely to contain sequences like:
read_io
if (...) return
write_mem
...
write_mem
barrier
write_mem
barrier
write_io
for things like ring updates.
Where the 'mem' might actually be in io space.
In such sequences not all the synchronising instructions are needed.
I'm not at all sure it is easy to get the right set.

David






RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Will Deacon
> Sent: 28 March 2018 09:54
...
> > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > expand to outX, and outX is not ordered with respect to other types of
> > > memory.
> >
> > Ugh ?
> >
> > My understanding of HW at least is the exact opposite. outX is *more*
> > ordered if anything, than any other accessors. IO space is completely
> > synchronous, non posted and ordered afaik.
> 
> I'm just going by memory-barriers.txt:
> 
> 
>  (*) inX(), outX():
> 
>  [...]
> 
>  They are guaranteed to be fully ordered with respect to each other.
> 
>  They are not guaranteed to be fully ordered with respect to other types 
> of
>  memory and I/O operation.

A long time ago there was a document from Intel that said that inb/outb weren't
necessarily synchronised wrt memory accesses.
(Might be P-pro era).
However no processors actually behaved that way and more recent docs
say that inb/outb are fully ordered.

David



RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 10:56
...
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 
> The typical "fix" is to turn that into:
> 
>   writel(1, RESET_REG);
>   readl(RESET_REG); /* Flush posted writes */

Would a writel(1, RESET_REG) here provide enough synchronsiation?

>   usleep(10);
>   writel(0, RESET_REG);
> 
> *However* the issue here, at least on power, is that the CPU can issue
> that readl but doesn't necessarily wait for it to complete (ie, the
> data to return), before proceeding to the usleep. Now a usleep contains
> a bunch of loads and stores and is probably fine, but a udelay which
> just loops on the timebase may not be.
> 
> Thus we may still violate the timing requirement.

I've seem that sort of code (with udelay() and no read back) quite often.
How many were in linux I don't know.

For small delays I usually fix it by repeated writes (of the same value)
to the device register. That can guarantee very short intervals.

The only time I've actually seen buffered writes break timing was
between a 286 and an 8859 interrupt controller.
If you wrote to the mask then enabled interrupts the first IACK cycle
could be too close to write and break the cycle recovery time.
That clobbered most of the interrupt controller registers.
That probably affected every 286 board ever built!
Not sure how much software added the required extra bus cycle.

> What we did inside readl, with the twi;isync sequence (which basically
> means, trap on return value with "trap never" as a condition, followed
> by isync that ensures all excpetion conditions are resolved), is force
> the CPU to "consume" the data from the read before moving on.
> 
> This effectively makes readl fully synchronous (we would probably avoid
> that if we were to implement a readl_relaxed).

I've always wondered exactly what the twi;isync were for - always seemed
very heavy handed for most mmio reads.
Particularly if you are doing mmio reads from a data fifo.

Perhaps there should be a writel_wait() that is allowed to do a read back
for such code paths?

David



RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 16:13
...
> > I've always wondered exactly what the twi;isync were for - always seemed
> > very heavy handed for most mmio reads.
> > Particularly if you are doing mmio reads from a data fifo.
> 
> If you do that you should use the "s" version of the accessors. Those
> will only do the above trick at the end of the access series. Also a
> FIFO needs special care about endianness anyway, so you should use
> those accessors regardless. (Hint: you never endian swap a FIFO even on
> BE on a LE device, unless something's been wired very badly in HW).

That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface.
Reading the high address 'clocked' the fifo.
So the first 3 reads could happen in any order, but the 4th had to be last.
This is a small ppc and we shovel a lot of data through that fifo.

Whether it needed byteswapping depended completely on how our hardware people
had built the pcb (not made easy by some docs using the ibm bit numbering).
In fact it didn't

While that driver only had to run on a very specific small ppc, generic drivers
might have similar issues.

I suspect that writel() is always (or should always be):
barrier_before_writel()
writel_relaxed()
barrier_after_writel()
So if a driver needs to do multiple writes (without strong ordering)
it should be able to repeat the writel_relaxed() with only one set
of barriers.
Similarly for readl().
In addition a lesser barrier is probably enough between a readl_relaxed()
and a writel_relaxed() that is conditional on the read value.

David



RE: RFC on writel and writel_relaxed

2018-03-29 Thread David Laight
From: Jason Gunthorpe
> Sent: 29 March 2018 15:45
...
> > > When talking about ordering between the devices, the relevant question
> > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > from the DEVICE_FOO. 'ordered' means that in this case
> > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > by BAR.
> >
> > Yes, and that isn't the case for arm because the writes can still be
> > buffered.
> 
> The statement is not about buffering, or temporal completion order, or
> the order of acks returning to the CPU. It is about pure transaction
> ordering inside the interconnect.
> 
> Can write BAR -> FOO pass write CPU -> FOO?

Almost certainly.
The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
The second cpu write then completes (to a different target).
That target then issues a peer to peer transfer that reaches the shared bridge.
I doubt the order of the transactions is guaranteed when it becomes 
'un-stalled'.

Of course, these are peer to peer transfers, and strange ones at that.
Normally you'd not be doing peer to peer transfers that access 'memory'
the cpu has just written to.

Requiring extra barriers in this case, or different functions for WC accesses
shouldn't really be an issue.

Even requiring a barrier between a write to dma coherent memory and a write
that starts dma isn't really onerous.
Even if it is a nop on all current architectures it is a good comment in the 
code.
It could even have a 'dev' argument.

David



RE: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a function

2017-04-18 Thread David Laight
From: Naveen N. Rao
> Sent: 12 April 2017 11:58
...
> +kprobe_opcode_t *kprobe_lookup_name(const char *name)
> +{
...
> + char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
> + const char *modsym;
> + bool dot_appended = false;
> + if ((modsym = strchr(name, ':')) != NULL) {
> + modsym++;
> + if (*modsym != '\0' && *modsym != '.') {
> + /* Convert to  */
> + strncpy(dot_name, name, modsym - name);
> + dot_name[modsym - name] = '.';
> + dot_name[modsym - name + 1] = '\0';
> + strncat(dot_name, modsym,
> + sizeof(dot_name) - (modsym - name) - 2);
> + dot_appended = true;

If the ':' is 'a way down' name[] then although the strncpy() won't
overrun dot_name[] the rest of the code can.

The strncat() call is particularly borked.

David



RE: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a function

2017-04-19 Thread David Laight
From: Naveen N. Rao
> Sent: 19 April 2017 09:09
> To: David Laight; Michael Ellerman
> Cc: linux-ker...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Masami 
> Hiramatsu; Ingo Molnar
> Subject: RE: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a 
> function
> 
> Excerpts from David Laight's message of April 18, 2017 18:22:
> > From: Naveen N. Rao
> >> Sent: 12 April 2017 11:58
> > ...
> >> +kprobe_opcode_t *kprobe_lookup_name(const char *name)
> >> +{
> > ...
> >> +  char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
> >> +  const char *modsym;
> >> +  bool dot_appended = false;
> >> +  if ((modsym = strchr(name, ':')) != NULL) {
> >> +  modsym++;
> >> +  if (*modsym != '\0' && *modsym != '.') {
> >> +  /* Convert to  */
> >> +  strncpy(dot_name, name, modsym - name);
> >> +  dot_name[modsym - name] = '.';
> >> +  dot_name[modsym - name + 1] = '\0';
> >> +  strncat(dot_name, modsym,
> >> +  sizeof(dot_name) - (modsym - name) - 2);
> >> +  dot_appended = true;
> >
> > If the ':' is 'a way down' name[] then although the strncpy() won't
> > overrun dot_name[] the rest of the code can.
> 
> Nice catch, thanks David!
> We need to be validating the length of 'name'. I'll put out a patch for
> that.

Silent truncation is almost certainly wrong here.

> As an aside, I'm not sure I follow what you mean when you say that the
> strncpy() won't overrun dot_name[]. If we have a name[] longer than
> sizeof(dot_name) with the ':' after that, the strncpy() can also overrun
> dot_name[].

Yes, that should just be a memcpy(), as should the strncat().

Using strncpy() where the length is other than the size of the target buffer
should be banned. Not that it ever does what people expect.
strncat() is even worse.

David






RE: [PATCH] powerpc/32: Move entry_32 functions just after HEAD functions.

2017-04-19 Thread David Laight
From: Christophe Leroy
> By default, PPC8xx PINs an ITLB on the first 8M of memory in order
> to avoid any ITLB miss on kernel code.
> However, with some debug functions like DEBUG_PAGEALLOC and
> (soon to come) DEBUG_RODATA, the PINned TLB is invalidated soon
> after startup so ITLB missed start to happen also on the kernel code.
> 
> In order to avoid any ITLB miss in a critical section, we have to
> ensure that their is no page boundary crossed between the setup of
> a new value in SRR0/SRR1 and the associated RFI. This cannot be done
> easily if entry_32 functions sits in the middle of other .text
> functions. By placing entry_32 just after the .head section (as already
> done for entry_64 on PPC64), we can more easily ensure the issue
> doesn't happen.

Shouldn't this be done by putting all the functions that 'matter'
into a named section instead of relying on the order of the input files?
(Which is what I think this is doing.)

David



RE: [PATCH v3 4/7] powerpc: kprobes: use safer string functions in kprobe_lookup_name()

2017-04-21 Thread David Laight
From: Naveen N. Rao
> Sent: 19 April 2017 13:51
...
>   dot_name[0] = '\0';
> - strncat(dot_name, name, sizeof(dot_name) - 1);
> + strlcat(dot_name, name, sizeof(dot_name));
...

Is that really zeroing the first byte just so it can append to it?

David



RE: [PATCH 2/8] selftests: lib.mk: define CLEAN macro to allow Makefiles to override clean

2017-04-24 Thread David Laight
From: Shuah Khan
> Sent: 22 April 2017 00:15
> Define CLEAN macro to allow Makefiles to override common clean target
> in lib.mk. This will help fix the following failures:
> 
> warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
> 
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/lib.mk | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
> index 775c589..959273c 100644
> --- a/tools/testing/selftests/lib.mk
> +++ b/tools/testing/selftests/lib.mk
> @@ -51,8 +51,12 @@ endef
>  emit_tests:
>   $(EMIT_TESTS)
> 
> -clean:
> +define CLEAN
>   $(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES) 
> $(EXTRA_CLEAN)
> +endef
> +
> +clean:
> + $(CLEAN)

If might be easier to do something like:

ifneq($(NO_CLEAN),y)
clean:
$(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES) 
$(EXTRA_CLEAN)
endif

David



RE: [PATCH 8/8] selftests: x86: override clean in lib.mk to fix warnings

2017-04-24 Thread David Laight
From: Linuxppc-dev Michael Ellerman
> Shuah Khan  writes:
> 
> > Add override for lib.mk clean to fix the following warnings from clean
> > target run.
> >
> > Makefile:44: warning: overriding recipe for target 'clean'
> > ../lib.mk:55: warning: ignoring old recipe for target 'clean'
> >
> > Signed-off-by: Shuah Khan 
> > ---
> >  tools/testing/selftests/x86/Makefile | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/x86/Makefile 
> > b/tools/testing/selftests/x86/Makefile
> > index 38e0a9c..4d27550 100644
> > --- a/tools/testing/selftests/x86/Makefile
> > +++ b/tools/testing/selftests/x86/Makefile
> > @@ -40,8 +40,9 @@ all_32: $(BINARIES_32)
> >
> >  all_64: $(BINARIES_64)
> >
> > -clean:
> > +override define CLEAN
> > $(RM) $(BINARIES_32) $(BINARIES_64)
> > +endef
> 
> Simpler as:
> 
> EXTRA_CLEAN := $(BINARIES_32) $(BINARIES_64)

Actually for builds that insist on crapping all over the source tree I've used:

clean:
rm -rf `cat .cvsignore 2>/dev/null`

David



RE: [PATCH] kallsyms: optimize kallsyms_lookup_name() for a few cases

2017-04-25 Thread David Laight
From: Naveen N. Rao
> Sent: 25 April 2017 17:18
> 1. Fail early for invalid/zero length symbols.
> 2. Detect names of the form  and skip checking for kernel
> symbols in that case.
> 
> Signed-off-by: Naveen N. Rao 
> ---
> Masami, Michael,
> I have added two very simple checks here, which I felt is good to have,
> rather than the elaborate checks in the previous version. Given the
> change in module code to use strnchr(), the checks are now safe and
> further tests are not probably not that useful.
...
> + if (strnchr(name, MODULE_NAME_LEN, ':'))
> + return module_kallsyms_lookup_name(name);

Should that be MODULE_NAME_LEN - 1 ?

David



RE: [PATCH] powerpc/xive: Fix/improve verbose debug output

2017-04-28 Thread David Laight
From: Michael Ellerman
> Sent: 28 April 2017 07:34
...
> Not sure what you mean. You can enable pr_debug()s individually, by
> function, by module, by file, or for the whole kernel.

It is sort of a shame that you can't turn pr_info() off in the same way.

David



RE: [PATCH v2] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations

2017-05-04 Thread David Laight
From: Naveen N. Rao [mailto:naveen.n@linux.vnet.ibm.com]
> Sent: 04 May 2017 11:25
> Use safer string manipulation functions when dealing with a
> user-provided string in kprobe_lookup_name().
> 
> Reported-by: David Laight 
> Signed-off-by: Naveen N. Rao 
> ---
> Changed to ignore return value of 0 from strscpy(), as suggested by
> Masami.

let's see what this code looks like;

>   char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
>   bool dot_appended = false;
> + const char *c;
> + ssize_t ret = 0;
> + int len = 0;
> +
> + if ((c = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) {

I don't like unnecessary assignments in conditionals.

> + c++;
> + len = c - name;
> + memcpy(dot_name, name, len);
> + } else
> + c = name;
> +
> + if (*c != '\0' && *c != '.') {
> + dot_name[len++] = '.';
>   dot_appended = true;

If you don't append a dot, then you can always just lookup
the original string.

>   }
> + ret = strscpy(dot_name + len, c, KSYM_NAME_LEN);
> + if (ret > 0)
> + addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name);

I'm not sure you need 'ret' here at all.

> + /* Fallback to the original non-dot symbol lookup */
> + if (!addr && dot_appended)
>   addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);

We can bikeshed this function to death:

/* The function name must start with a '.'.
 * If it doesn't then we insert one. */
c = strnchr(name, MODULE_NAME_LEN, ':');
if (c && c[1] && c[1] != '.') {
/* Insert a '.' after the ':' */
c++;
len = c - name;
memcpy(dot_name, name, len);
} else {
if (name[0] == '.')
goto check_name:
/* Insert a '.' before name */
c = name;
len = 0;
}

dot_name[len++] = '.';
if (strscpy(dot_name + len, c, KSYM_NAME_LEN) > 0) {
addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name);
if (addr)
return addr;
}
/* Symbol with extra '.' not found, fallback to original name */

check_name:
return (kprobe_opcode_t *)kallsyms_lookup_name(name);

David



RE: [PATCH v2] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations

2017-05-04 Thread David Laight
From: Paul Clarke
> Sent: 04 May 2017 16:07
...
> > +   if ((c = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) {
> 
> Shouldn't this be MODULE_NAME_LEN + 1, since the ':' can come after a module 
> name of length
> MODULE_NAME_LEN?

No, because MODULE_NAME_LEN includes the terminating '\0'.

David



RE: [PATCH] spin loop primitives for busy waiting

2017-05-12 Thread David Laight
From: Linus Torvalds
> Sent: 11 May 2017 19:48
...
> The one question I have is about "spin_on_cond()": since you
> explicitly document that the "no spinning" case is expected to be the
> default, I really think that the default implementation should be
> along the lines if
> 
>   #define spin_on_cond(cond) do { \
> if (unlikely(!(cond))) { \
> spin_begin(); do spin_cpu_relax(); while (!(cond)); spin_end(); \
> } \
>   } while (0)
> 
> which will actually result in better code generation even if
> spin_begin/end() are no-ops, and just generally optimizes for the
> right behavior (ie do the spinning out-of-line, since by definition it
> cannot be performance-critical after the first iteration).

At least some versions of gcc convert while (cond) do {body}
into if (cond) do {body} while (cond) even when 'cond'
is a non-trivial expression and 'body' is trivial.
The code-bloat is silly.
No point enforcing the 'optimisation' here.

David


RE: [PATCH 2/9] timers: provide a "modern" variant of timers

2017-05-19 Thread David Laight
From: Christoph Hellwig
> Sent: 16 May 2017 12:48
>
> The new callback gets a pointer to the timer_list itself, which can
> then be used to get the containing structure using container_of
> instead of casting from and to unsigned long all the time.

What about sensible drivers that put some other value in the 'data'
field?

Perhaps it ought to have been 'void *data'.

Seems retrograde to be passing the address of the timer structure
(which, in principle, the callers no nothing about).

So I wouldn't call it 'modern', just different.

David



RE: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

2017-05-19 Thread David Laight
From: Arnd Bergmann
> Sent: 17 May 2017 22:40
> 
> On Wed, May 17, 2017 at 11:16 PM, Chris Packham
>  wrote:
> > On 18/05/17 06:18, Borislav Petkov wrote:
> > One thing I would like confirmation on is is in_le32 -> ioread32 the
> > correct change? I tossed up between ioread32 and readl. Looking at
> > mv643xx_eth.c which supports both the MV643XX and Orion it's using readl
> > so perhaps I should be using that.
> 
> There is no easy answer: on powerpc, readl is used for PCI,
> while in_le32 is used for on-chip devices, so in_le32 is the
> right one in principle. The main difference is that readl can
> work with CONFIG_EEH on pseries, but in_le32 is cheaper.
> 
> On ARM and most other architectures, readl is used for both
> PCI and on-chip devices, so that's what portable code tends
> to use.
> 
> ioread32 is required to behave the same way as readl
> on all __iomem pointers returned from ioremap(), but
> is an extern function on powerpc and can be more
> expensive when CONFIG_GENERIC_IOMAP is set.

What about x86?
Isn't ioread32() an extern function that checks for 'io' addresses
than need 'inb' (etc) instructions rather than memory ones.
If we know a PCI slave isn't 'io' should be be using ioread32() or readl()?
Don't some architectures have different enforced barriers in both these?

David





RE: RFC: better timer interface

2017-05-23 Thread David Laight
From: Thomas Gleixner
> Sent: 21 May 2017 19:15
...
> > timer_start(timer, ms, abs)
> 
> I'm not even sure, whether we need absolute timer wheel timers at
> all, because most use cases are relative to now.

Posix requires absolute timers for some userspace calls
(annoying because the code often wants relative).

OTOH how much conditional code is there for the 'abs' argument.
And is there any code that doesn't pass a constant?

Certainly worth a separate timer_start_abs(timer, wall_time)
function since you can't correctly map a wall_time timer
to a jiffies one.

David



RE: RFC: better timer interface

2017-05-23 Thread David Laight
From: Thomas Gleixner
> Sent: 23 May 2017 12:59
> On Tue, 23 May 2017, David Laight wrote:
> 
> > From: Thomas Gleixner
> > > Sent: 21 May 2017 19:15
> > ...
> > > > timer_start(timer, ms, abs)
> > >
> > > I'm not even sure, whether we need absolute timer wheel timers at
> > > all, because most use cases are relative to now.
> >
> > Posix requires absolute timers for some userspace calls
> > (annoying because the code often wants relative).
> 
> Posix is completely irrelevant here. These timers are purely kernel
> internal.

Somehow pthread_cond_timedwait() has to be implemented.
Doing so without kernel timers that use absolute 'wall clock' time is tricky.

David



RE: [PATCH 3/3] powerpc/8xx: xmon compile fix

2017-05-26 Thread David Laight
From:  Michael Ellerman
> Sent: 26 May 2017 08:24
> Nicholas Piggin  writes:
> > diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> > index f11f65634aab..438fdb0fb142 100644
> > --- a/arch/powerpc/xmon/xmon.c
> > +++ b/arch/powerpc/xmon/xmon.c
> > @@ -1242,14 +1242,16 @@ bpt_cmds(void)
> >  {
> > int cmd;
> > unsigned long a;
> > -   int mode, i;
> > +   int i;
> > struct bpt *bp;
> > -   const char badaddr[] = "Only kernel addresses are permitted "
> > -   "for breakpoints\n";
> >
> > cmd = inchar();
> > switch (cmd) {
> > -#ifndef CONFIG_8xx
> > +#ifndef CONFIG_PPC_8xx
> > +   int mode;
> > +   const char badaddr[] = "Only kernel addresses are permitted "
> > +   "for breakpoints\n";
> > +
> > case 'd':   /* bd - hardware data breakpoint */
> > mode = 7;
> > cmd = inchar();
> 
> GCC 7 rejects this:
> 
>   arch/powerpc/xmon/xmon.c: In function bpt_cmds:
>   arch/powerpc/xmon/xmon.c:1252:13: error: statement will never be executed 
> [-Werror=switch-
> unreachable]
> const char badaddr[] = "Only kernel addresses are permitted for 
> breakpoints\n";
>^~~

Try 'static' ?

David



RE: [PATCH] powerpc/powernv: Enable PCI peer-to-peer

2017-06-02 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 31 May 2017 03:26
...
> We might also need a way to provide the "initiator" with translated DMA
> addresses that allow to target the receiver.

Related to that is the ability to allocate memory that two (or more)
PCIe devices can DMA to/from.
That can be useful if peer-to-peer transfers are not available.

David



RE: [PATCH 1/2] powerpc/xmon: hdec is now 64bits

2017-08-30 Thread David Laight
From: Balbir Singh
> Sent: 30 August 2017 01:28
> ISA 300 defines hypervisor decrementer to be 64 bits in length.
> This patch extends the print format for all archs to be 64 bits
> 
> Signed-off-by: Balbir Singh 
> ---
>  arch/powerpc/xmon/xmon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 9e68f1d..1b26d53 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -1749,7 +1749,7 @@ static void dump_206_sprs(void)
> 
>   printf("sdr1   = %.16lx  hdar  = %.16lx hdsisr = %.8x\n",
>   mfspr(SPRN_SDR1), mfspr(SPRN_HDAR), mfspr(SPRN_HDSISR));
> - printf("hsrr0  = %.16lx hsrr1  = %.16lx hdec = %.8x\n",
> + printf("hsrr0  = %.16lx hsrr1  = %.16lx hdec = %.16lx\n",
>   mfspr(SPRN_HSRR0), mfspr(SPRN_HSRR1), mfspr(SPRN_HDEC));
>   printf("lpcr   = %.16lx  pcr   = %.16lx lpidr = %.8x\n",
>   mfspr(SPRN_LPCR), mfspr(SPRN_PCR), mfspr(SPRN_LPID));

On the face of it the patch doesn't do what the commit message says.
Not only that it is really silly to print a 32bit value with 8 extra
leading zero digits.

Something more subtle was also wrong:
There were 3 mfspr() calls, 2 printed with %lx and one with %x.
That ought to generate a warning from gcc.

David



RE: [PATCH kernel] powerpc/powernv: Update comment about shifting IOV BAR

2017-09-14 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 14 September 2017 04:40
> On Thu, 2017-09-14 at 13:18 +1000, Alexey Kardashevskiy wrote:
> > On 14/09/17 13:07, Benjamin Herrenschmidt wrote:
> > > On Thu, 2017-09-14 at 12:45 +1000, Alexey Kardashevskiy wrote:
> > > > On 31/08/17 13:34, Alexey Kardashevskiy wrote:
> > > > > From: Benjamin Herrenschmidt 
> > > >
> > > > Oops, this was not right :)
> > > >
> > > > Anyway, Ben, please comment. Thanks.
> > >
> > > This is incorrect, we can do hotplug behind switches afaik.
> >
> > Do we have an actual system which allows this?
> 
> Tuleta no ?

You can logically 'hotplug' PCI(e) on any system [1].

The 'problem' is that whatever enumerates the PCI(e) at system
powerup doesn't normally assign extra resources to bridges to allow
for devices that aren't present at boot time.
So you can normally only replace cards with ones that use the same
(or less) resources, or that are not behind any bridges.
This is problematic if you have a docking station connected via
a bridge.

[1] Apart from some annoying x86 Dell servers we have which generate
an NMI when the PCIe link goes down (when we reprogram the fpga).
They also fail to boot if a link doesn't come up...

David



RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp

2017-09-19 Thread David Laight
From: wei.guo.si...@gmail.com
> Sent: 19 September 2017 11:04
> Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
> mode) if either src or dst address is not 8 bytes aligned. It can be
> opmitized if both addresses are with the same offset with 8 bytes boundary.
> 
> memcmp() can align the src/dst address with 8 bytes firstly and then
> compare with .Llong mode.

Why not mask both addresses with ~7 and mask/shift the read value to ignore
the unwanted high (BE) or low (LE) bits.

The same can be done at the end of the compare with any final, partial word.

David
 


RE: [PATCH 0/5] [RFC] printk/ia64/ppc64/parisc64: let's deprecate %pF/%pf printk specifiers

2017-09-19 Thread David Laight
From: Sergey Senozhatsky
> Sent: 19 September 2017 03:06
...
> I'll simply convert everything to `unsigned long'. including the
> dereference_function_descriptor() function [I believe there are
> still some casts happening when we pass addr from kernel/module
> dereference functions to dereference_function_descriptor(), or
> when we return `void *' back to symbol resolution code, etc.)
> besides, it seems that everything that uses
> dereference_function_descriptor() wants `unsigned long' anyway:

Using 'unsigned long' for any kind of pointer is an accident
waiting do happen.
It also makes it difficult to typecheck the function calls.
Using 'void *' isn't any better.
Either a pointer to an undefined struct, or a struct containing
a single 'char' member, is likely to be safest.

David



RE: [PATCH 0/5] [RFC] printk/ia64/ppc64/parisc64: let's deprecate %pF/%pf printk specifiers

2017-09-20 Thread David Laight
From: Helge Deller
> Sent: 19 September 2017 21:08
...
> > Using 'unsigned long' for any kind of pointer is an accident
> > waiting do happen.
> > It also makes it difficult to typecheck the function calls.
> > Using 'void *' isn't any better.
> > Either a pointer to an undefined struct, or a struct containing
> > a single 'char' member, is likely to be safest.
> 
> David, you might be right in most cases, but in this case I'd prefer
> unsigned long too. I think this will create the least amount of
> typecasts here.

I've not looked at the specifics case...

Another option is using a struct with a single member and
passing it by value.
This could be used for things like user-space pointers or
even errno values.
The only problem is old ABI where even small structures are
always passed by reference.

David



RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp

2017-09-20 Thread David Laight
From: Simon Guo
> Sent: 20 September 2017 10:57
> On Tue, Sep 19, 2017 at 10:12:50AM +, David Laight wrote:
> > From: wei.guo.si...@gmail.com
> > > Sent: 19 September 2017 11:04
> > > Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
> > > mode) if either src or dst address is not 8 bytes aligned. It can be
> > > opmitized if both addresses are with the same offset with 8 bytes 
> > > boundary.
> > >
> > > memcmp() can align the src/dst address with 8 bytes firstly and then
> > > compare with .Llong mode.
> >
> > Why not mask both addresses with ~7 and mask/shift the read value to ignore
> > the unwanted high (BE) or low (LE) bits.
> >
> > The same can be done at the end of the compare with any final, partial word.
> 
> Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes
> size comparison with v1. I will rework on v2.

Clearly you have to be careful to return the correct +1/-1 on mismatch.

For systems that can do misaligned transfers you can compare the first
word, then compare aligned words and finally the last word.
Rather like a memcpy() function I wrote (for NetBDSD) that copied
the last word first, then a whole number of words aligned at the start.
(Hope no one expected anything special for overlapping copies.)

David



RE: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation

2017-09-25 Thread David Laight
From: wei.guo.si...@gmail.com
> Sent: 21 September 2017 00:35
> This patch adjust selftest memcmp_64 so that memcmp selftest can be
> compiled successfully.
...
>  #define ITERATIONS 1
> 
> +#define LARGE_SIZE (5 * 1024)
> +#define LARGE_ITERATIONS 1000
...

Measuring performance by doing a lot of iterations isn't ideal
and is pretty pointless.
Cold cache performance can be more useful.
Also you don't really want any dynamic branch prediction logic
tuned to the exact test you keep doing.

David




RE: [PATCH] mm: fix RODATA_TEST failure "rodata_test: test data was not read only"

2017-09-25 Thread David Laight
From: Segher Boessenkool
> Sent: 25 September 2017 08:37
> On Sun, Sep 24, 2017 at 12:17:51PM -0700, Kees Cook wrote:
> > On Thu, Sep 21, 2017 at 2:37 AM, Christophe Leroy
> >  wrote:
> > > On powerpc, RODATA_TEST fails with message the following messages:
> > >
> > > [6.199505] Freeing unused kernel memory: 528K
> > > [6.203935] rodata_test: test data was not read only
> > >
> > > This is because GCC allocates it to .data section:
> > >
> > > c0695034 g O .data  0004 rodata_test_data
> >
> > Uuuh... that seems like a compiler bug. It's marked "const" -- it
> > should never end up in .data. I would argue that this has done exactly
> > what it was supposed to do, and shows that something has gone wrong.
> > It should always be const. Adding "static" should just change
> > visibility. (I'm not opposed to the static change, but it seems to
> > paper over a problem with the compiler...)
> 
> The compiler puts this item in .sdata, for 32-bit.  There is no .srodata,
> so if it wants to use a small data section, it must use .sdata .
> 
> Non-external, non-referenced symbols are not put in .sdata, that is the
> difference you see with the "static".
> 
> I don't think there is a bug here.  If you think there is, please open
> a GCC bug.

The .sxxx sections are for 'small' data that can be accessed (typically)
using small offsets from a global register.
This means that all sections must be adjacent in the image.
So you can't really have readonly small data.

My guess is that the linker script is putting .srodata in with .sdata.

David



RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2017-09-27 Thread David Laight
From: Segher Boessenkool
> Sent: 27 September 2017 10:28
...
> You also need nasty code to deal with the start and end of strings, with
> conditional branches and whatnot, which quickly overwhelms the benefit
> of using vector registers at all.  This tradeoff also changes with newer
> ISA versions.

The goal posts keep moving.
For instance with modern intel x86 cpus 'rep movsb' is by far the fastest
way to copy data (from cached memory).

> Things have to become *really* cheap before it will be good to often use
> vector registers in the kernel though.

I've had thoughts about this in the past.
If the vector registers belong to the current process then you might
get away with just saving the ones you want to use.
If they belong to a different process then you also need to tell the
FPU save code where you've saved the registers.
Then the IPI code can recover all the correct values.

On X86 all the AVX registers are caller saved, the system call
entry could issue the instruction that invalidates them all.
Kernel code running in the context of a user process could then
use the registers without saving them.
It would only need to set a mark to ensure they are invalidated
again on return to user (might be cheap enough to do anyway).
Dunno about PPC though.

David



RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2017-09-28 Thread David Laight
From: Simon Guo
> Sent: 27 September 2017 19:34
...
> > On X86 all the AVX registers are caller saved, the system call
> > entry could issue the instruction that invalidates them all.
> > Kernel code running in the context of a user process could then
> > use the registers without saving them.
> > It would only need to set a mark to ensure they are invalidated
> > again on return to user (might be cheap enough to do anyway).
> > Dunno about PPC though.
> 
> I am not aware of any ppc instruction which can set a "mark" or provide
> any high granularity flag against single or subgroup of vec regs' validity.
> But ppc experts may want to correct me.

I was just thinking of a software flag.

David



RE: [PATCH 02/11] x86: make dma_cache_sync a no-op

2017-10-03 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
> x86 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.

I believe it is just about possible to require an explicit
write flush on x86.
ISTR this can happen with something like write combining.

Whether this can actually happen is the kernel is another matter.

David



RE: [PATCH 04/11] ia64: make dma_cache_sync a no-op

2017-10-04 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
>
> ia64 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/ia64/include/asm/dma-mapping.h | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/dma-mapping.h 
> b/arch/ia64/include/asm/dma-mapping.h
> index 3ce5ab4339f3..99dfc1aa9d3c 100644
> --- a/arch/ia64/include/asm/dma-mapping.h
> +++ b/arch/ia64/include/asm/dma-mapping.h
> @@ -48,11 +48,6 @@ static inline void
>  dma_cache_sync (struct device *dev, void *vaddr, size_t size,
>   enum dma_data_direction dir)
>  {
> - /*
> -  * IA-64 is cache-coherent, so this is mostly a no-op.  However, we do 
> need to
> -  * ensure that dma_cache_sync() enforces order, hence the mb().
> -  */
> - mb();
>  }

Are you sure about this one?
It looks as though you are doing a mechanical change for all architectures.
Some of them are probably stranger than you realise.

Even with cache coherent memory any cpu 'store/write buffer' may not
be snooped by dma reads.

Something needs to flush the store buffer between the last cpu write
to the dma buffer and the write (probably a device register) that
tells the device it can read the memory.

My guess from the comment is that dma_cache_synch() is expected to
include that barrier - and it might not be anywhere else.

David



RE: [PATCH v10 09/10] mm: stop zeroing memory during allocation in vmemmap

2017-10-06 Thread David Laight
From: Pavel Tatashin
> Sent: 05 October 2017 22:11
> vmemmap_alloc_block() will no longer zero the block, so zero memory
> at its call sites for everything except struct pages.  Struct page memory
> is zero'd by struct page initialization.

It seems dangerous to change an allocator to stop zeroing memory.
It is probably saver to add a new function that doesn't zero
the memory and use that is the places where you don't want it
to be zeroed.

David



RE: [PATCH v10 09/10] mm: stop zeroing memory during allocation in vmemmap

2017-10-06 Thread David Laight
From: Michal Hocko
> Sent: 06 October 2017 12:47
> On Fri 06-10-17 11:10:14, David Laight wrote:
> > From: Pavel Tatashin
> > > Sent: 05 October 2017 22:11
> > > vmemmap_alloc_block() will no longer zero the block, so zero memory
> > > at its call sites for everything except struct pages.  Struct page memory
> > > is zero'd by struct page initialization.
> >
> > It seems dangerous to change an allocator to stop zeroing memory.
> > It is probably saver to add a new function that doesn't zero
> > the memory and use that is the places where you don't want it
> > to be zeroed.
> 
> Not sure what you mean. memblock_virt_alloc_try_nid_raw is a new
> function which doesn't zero out...

You should probably leave vmemap_alloc_block() zeroing the memory
so that existing alls don't have to be changed - apart from the
ones you are explicitly optimising.

David


RE: [PATCH] powerpc/lib/sstep: Fix count leading zeros instructions

2017-10-09 Thread David Laight
From: Sandipan Das
> Sent: 09 October 2017 12:07
> According to the GCC documentation, the behaviour of __builtin_clz()
> and __builtin_clzl() is undefined if the value of the input argument
> is zero. Without handling this special case, these builtins have been
> used for emulating the following instructions:
>   * Count Leading Zeros Word (cntlzw[.])
>   * Count Leading Zeros Doubleword (cntlzd[.])
> 
> This fixes the emulated behaviour of these instructions by adding an
> additional check for this special case.

Presumably the result is undefined because the underlying cpu
instruction is used - and it's return value is implementation defined.

Here you are emulating the cpu instruction - so executing one will
give the same answer as it the 'real' one were execucted.

Indeed it might be worth an asm statement that definitely executes
the cpu instruction?

David

> 
> Signed-off-by: Sandipan Das 
> ---
>  arch/powerpc/lib/sstep.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 0f7e41bd7e88..ebbc0b92650c 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -1717,11 +1717,19 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   * Logical instructions
>   */
>   case 26:/* cntlzw */
> - op->val = __builtin_clz((unsigned int) regs->gpr[rd]);
> + val = (unsigned int) regs->gpr[rd];
> + if (val == 0)
> + op->val = 32;
> + else
> + op->val = __builtin_clz(val);
>   goto logical_done;
>  #ifdef __powerpc64__
>   case 58:/* cntlzd */
> - op->val = __builtin_clzl(regs->gpr[rd]);
> + val = regs->gpr[rd];
> + if (val == 0)
> + op->val = 64;
> + else
> + op->val = __builtin_clzl(val);
>   goto logical_done;
>  #endif
>   case 28:/* and */
> --
> 2.13.6



RE: [PATCH] powerpc/lib/sstep: Fix count leading zeros instructions

2017-10-09 Thread David Laight
From: Segher Boessenkool
> Sent: 09 October 2017 15:21
> On Mon, Oct 09, 2017 at 01:49:26PM +, David Laight wrote:
> > From: Sandipan Das
> > > Sent: 09 October 2017 12:07
> > > According to the GCC documentation, the behaviour of __builtin_clz()
> > > and __builtin_clzl() is undefined if the value of the input argument
> > > is zero. Without handling this special case, these builtins have been
> > > used for emulating the following instructions:
> > >   * Count Leading Zeros Word (cntlzw[.])
> > >   * Count Leading Zeros Doubleword (cntlzd[.])
> > >
> > > This fixes the emulated behaviour of these instructions by adding an
> > > additional check for this special case.
> >
> > Presumably the result is undefined because the underlying cpu
> > instruction is used - and it's return value is implementation defined.
> 
> It is undefined because the result is undefined, and the compiler
> optimises based on that.  The return value of the builtin is undefined,
> not implementation defined.
> 
> The patch is correct.

But the code you are emulating might be relying on the (un)defined value
the cpu instruction gives for zero input rather than the input width.

Or, put another way, if the return value for a clz instruction with zero
argument is undefined (as it is on x86 - intel and amd may differ) then the
emulation can return any value since the code can't care.
So the conditional is not needed.

David




RE: [PATCH] powerpc/lib/sstep: Fix count leading zeros instructions

2017-10-09 Thread David Laight
From: naveen.n@linux.vnet.ibm.com
> Sent: 09 October 2017 15:48
...
> This is about the behavior of the gcc builtin being undefined, rather
> than the actual cpu instruction itself.

I'd have hoped that the ggc builtin just generated the expected cpu instruction.
So is only undefined because it is very cpu dependant.

More problematic here would be any cpu flag register settings.
eg: x86 would set the 'Z' flag for zero input.

David



RE: Adjusting further size determinations?

2017-10-18 Thread David Laight
From: SF Markus Elfring
>  Unpleasant consequences are possible in both cases.
> >> How much do you care to reduce the failure probability further?
> >
> > Zero.
> 
> I am interested to improve the software situation a bit more here.

There are probably better places to spend your time!

If you want 'security' for kmalloc() then:

#define KMALLOC_TYPE(flags) (type *)kmalloc(sizeof (type), flags)
#define KMALLOC(ptr, flags) *(ptr) = KMALLOC_TYPE(typeof *(ptr), flags)

and change:
ptr = kmalloc(sizeof *ptr, flags);
to:
KMALLOC(&ptr, flags);

But it is all churn for churn's sake.

David



RE: [PATCH 1/4] powerpc/tm: Add commandline option to disable hardware transactional memory

2017-10-20 Thread David Laight
> > This patch adds a simple commandline option so that HTM can be
> > disabled at boot time.

ISTM that being able to disable it after boot would be more useful.
(ie in a startup script)

David



RE: [PATCH 1/4] powerpc/tm: Add commandline option to disable hardware transactional memory

2017-10-23 Thread David Laight
From: Michael Neuling
> Sent: 21 October 2017 02:00
> To: David Laight; 'Breno Leitao'; Michael Ellerman
> Cc: stew...@linux.vnet.ibm.com; linuxppc-...@ozlabs.org; cyril...@gmail.com
> Subject: Re: [PATCH 1/4] powerpc/tm: Add commandline option to disable 
> hardware transactional memory
> 
> On Fri, 2017-10-20 at 12:58 +, David Laight wrote:
> > > > This patch adds a simple commandline option so that HTM can be
> > > > disabled at boot time.
> >
> > ISTM that being able to disable it after boot would be more useful.
> > (ie in a startup script)
> 
> I agree bug unfortunately that's impossible.
> 
> If a process is already running in tm suspend, there is no way to stop it 
> other
> than killing the process.  At that point you may as well kexec with a new
> cmdline option

Isn't that unlikely when the first rc scripts are being run?
Setting an early rc script is generally easier than adding a command line
parameter.

I don't know about ppc, but grub on x86 makes it painfully almost
impossible to have two boot entries that have different command line
options.

David



RE: [PATCH] [net-next,v2] ibmvnic: Feature implementation of Vital Product Data (VPD) for the ibmvnic driver

2017-11-06 Thread David Laight
From: David Miller
> Sent: 04 November 2017 13:21
> From: Desnes Augusto Nunes do Rosario 
> Date: Wed,  1 Nov 2017 19:03:32 -0200
> 
> > +   substr = strnstr(adapter->vpd->buff, "RM", adapter->vpd->len);
> > +   if (!substr) {
> > +   dev_info(dev, "No FW level provided by VPD\n");
> > +   complete(&adapter->fw_done);
> > +   return;
> > +   }
> > +
> > +   /* get length of firmware level ASCII substring */
> > +   fw_level_len = *(substr + 2);
> > +
> > +   /* copy firmware version string from vpd into adapter */
> > +   ptr = strncpy((char *)adapter->fw_version,
> > + substr + 3, fw_level_len);
> 
> You have to be more careful here, making sure first that
> (substr + 2) < (adapter->vpd->buff + adapter->vpd->len),
> and next that (substr + 2 + fw_level_len) is in range
> as well.

And that the copy isn't longer than the target buffer.

David



RE: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-10 Thread David Laight
From: Matthew Wilcox
> Sent: 09 November 2017 19:44
> 
> On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> > So these semantics are what we're going with? Anything that does mmap() is
> > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > itself? Is intended to be cross-platform or just x86 and power specific?
> 
> It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> across apps that expect to be able to use bit 48 for their own purposes.
> And their address space is 48 bit by default.  Oops.

(Do you mean 49bit?)

Aren't such apps just doomed to be broken?

ISTR there is something on (IIRC) sparc64 that does a 'match'
on the high address bits to make it much harder to overrun
one area into another.

David



RE: [PATCH] soc/qman: Sleep instead of stuck hacking jiffies.

2017-06-26 Thread David Laight
From: Karim Eshapa
> Sent: 25 June 2017 16:14
> Use msleep() instead of stucking with
> long delay will be more efficient.
...
> --- a/drivers/soc/fsl/qbman/qman.c
> +++ b/drivers/soc/fsl/qbman/qman.c
> @@ -1084,11 +1084,7 @@ static int drain_mr_fqrni(struct qm_portal *p)
>* entries well before the ring has been fully consumed, so
>* we're being *really* paranoid here.
>*/
> - u64 now, then = jiffies;
> -
> - do {
> - now = jiffies;
> - } while ((then + 1) > now);
> + msleep(1);
...
How is that in any way equivalent?
If HZ is 1000 the old code loops for 10 seconds.
If HZ is 250 (common for some distros) it loops for 40 seconds.

Clearly both are horrid, but it isn't at all clear that a 1ms sleep
is performing the same job.

My guess is that this code is never called, and broken if actually called.

David



RE: [PATCH] powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user()

2017-06-26 Thread David Laight
From: Michael Ellerman
> Sent: 26 June 2017 14:34
..
> Al also pointed out that inlining copy_to/from_user() is probably of little or
> no benefit, which is correct
...

I was a bit horrified at the x86-64 versions of copy_to/from_user() as well.
With code that (tries to) error kernel pointers that cross stack frame
boundaries I'm fairly sure they expand to a lot of code.

I also suspect the cost of that kernel address check is not insignificant
especially if it has to walk down several stack frames.
(Never mind what happens to code compiled without stack frames.)

David




RE: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer

2017-06-27 Thread David Laight
From: Frederic Barrat
> Sent: 26 June 2017 19:09
> P9 has support for PCI peer-to-peer, enabling a device to write in the
> mmio space of another device directly, without interrupting the CPU.
> 
> This patch adds support for it on powernv, by adding a new API to be
> called by drivers. The pnv_pci_set_p2p(...) call configures an
> 'initiator', i.e the device which will issue the mmio operation, and a
> 'target', i.e. the device on the receiving side.
...

Two questions:

1) How does the driver get the correct address to give to the 'initiator'
   in order to perform an access to the 'target'?

2) Surely the API call the driver makes should be architecture neutral,
   returning an error on other architectures.

At least some x86 cpus also support peer-to-peer writes,
I believe they can work between cpu chips.
PCIe bridges might support them (or be configurable to support them).

David



RE: [PATCH v5 2/7] powerpc/vmlinux.lds: Align __init_begin to 16M

2017-06-29 Thread David Laight
From: Balbir Singh
> Sent: 28 June 2017 18:04
> For CONFIG_STRICT_KERNEL_RWX align __init_begin to 16M.
> We use 16M since its the larger of 2M on radix and 16M
> on hash for our linear mapping. The plan is to have
> .text, .rodata and everything upto __init_begin marked
> as RX. Note we still have executable read only data.
> We could further align rodata to another 16M boundary.
> I've used keeping text plus rodata as read-only-executable
> as a trade-off to doing read-only-executable for text and
> read-only for rodata.
...

Doesn't this go against 'address space randomisation'?
(Yes I realise a PIC kernel is probably non-trivial to compile
and load.)

David




RE: [PATCH v2 08/10] powerpc/opal: Add opal_async_wait_response_interruptible() to opal-async

2017-07-10 Thread David Laight
From: Cyril Bur
> Sent: 10 July 2017 02:31
> This patch adds an _interruptible version of opal_async_wait_response().
> This is useful when a long running OPAL call is performed on behalf of a
> userspace thread, for example, the opal_flash_{read,write,erase}
> functions performed by the powernv-flash MTD driver.
> 
> It is foreseeable that these functions would take upwards of two minutes
> causing the wait_event() to block long enough to cause hung task
> warnings. Furthermore, wait_event_interruptible() is preferable as
> otherwise there is no way for signals to stop the process which is going
> to be confusing in userspace.

ISTM that if you are doing (something like) a flash full device erase
(that really can take minutes) it isn't actually an interruptible
operation - the flash chip will still be busy.
So allowing the user process be interrupted just leaves a big mess.

OTOH the 'hung task' warning isn't the only problem with uninterruptible
sleeps - the processes also count towards the 'load average'.
Some software believes the 'load average' is a meaningful value.

It would be more generally useful for tasks to be able to sleep
uninterruptibly without counting towards the 'load average' or triggering
the 'task stuck' warning.
(I've code somewhere that sleeps interruptibly unless there is a signal
pending when it sleeps uninterruptibly.)

WRT flash erases, 'whole device' erases aren't significantly quicker
than sector by sector erases.
The latter can be interrupted between sectors.
I'm not sure you'd want to do writes than lock down enough kernel
memory to take even a second to complete.

David



RE: [PATCH] tty: Fix TIOCGPTPEER ioctl definition

2017-07-11 Thread David Laight
From: Linuxppc-dev  Gleb Fotengauer-Malinovskiy
> Sent: 11 July 2017 01:12
> This ioctl does nothing to justify an _IOC_READ or _IOC_WRITE flag
> because it doesn't copy anything from/to userspace to access the
> argument.
> 
> Fixes: 54ebbfb1 ("tty: add TIOCGPTPEER ioctl")
...
> -#define TIOCGPTPEER  _IOR('T', 0x41, int) /* Safely open the slave */
> +#define TIOCGPTPEER  _IO('T', 0x41) /* Safely open the slave */

This is a user API change. When was the ioctl added?

David



RE: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-25 Thread David Laight
From: Brijesh Singh
> Sent: 24 July 2017 20:08
> From: Tom Lendacky 
> 
> Secure Encrypted Virtualization (SEV) does not support string I/O, so
> unroll the string I/O operation into a loop operating on one element at
> a time.
> 
> Signed-off-by: Tom Lendacky 
> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/include/asm/io.h | 26 ++
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index e080a39..2f3c002 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -327,14 +327,32 @@ static inline unsigned type in##bwl##_p(int port)   
> \
>   \
>  static inline void outs##bwl(int port, const void *addr, unsigned long 
> count) \
>  {

Is it even worth leaving these as inline functions?
Given the speed of IO cycles it is unlikely that the cost of calling a real
function will be significant.
The code bloat reduction will be significant.

David



RE: [PATCH v3 2/5] powerpc/lib/sstep: Add popcnt instruction emulation

2017-07-25 Thread David Laight
From: Linuxppc-dev 
[mailto:linuxppc-dev-bounces+david.laight=aculab@lists.ozlabs.org] On 
Behalf Of
> Matt Brown
> Sent: 25 July 2017 04:33
> To: linuxppc-dev@lists.ozlabs.org
> Subject: [PATCH v3 2/5] powerpc/lib/sstep: Add popcnt instruction emulation
> 
> This adds emulations for the popcntb, popcntw, and popcntd instructions.
> Tested for correctness against the popcnt{b,w,d} instructions on ppc64le.
> 
> Signed-off-by: Matt Brown 
> ---
> v3:
>   - optimised using the Giles-Miller method of side-ways addition
> v2:
>   - fixed opcodes
>   - fixed typecasting
>   - fixed bitshifting error for both 32 and 64bit arch
> ---
>  arch/powerpc/lib/sstep.c | 40 +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 87d277f..c1f9cdb 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -612,6 +612,32 @@ static nokprobe_inline void do_cmpb(struct pt_regs 
> *regs, unsigned long v1,
>   regs->gpr[rd] = out_val;
>  }
> 
> +/*
> + * The size parameter is used to adjust the equivalent popcnt instruction.
> + * popcntb = 8, popcntw = 32, popcntd = 64
> + */
> +static nokprobe_inline void do_popcnt(struct pt_regs *regs, unsigned long v1,
> + int size, int ra)
> +{
> + unsigned long long out = v1;
> +
> + out = (0x & out) + (0x & (out >> 1));
> + out = (0x & out) + (0x & (out >> 2));
> + out = (0x0f0f0f0f0f0f0f0f & out) + (0x0f0f0f0f0f0f0f0f & (out >> 4));
> + if (size == 8) {/* popcntb */
> + regs->gpr[ra] = out;

I'm pretty sure you need to mask the result with 7.

David



RE: [PATCH 3/4] powerpc: pseries: only store the device node basename in full_name

2017-07-26 Thread David Laight
From: Rob Herring
> Sent: 25 July 2017 22:44
> With dependencies on full_name containing the entire device node path
> removed, stop storing the full_name in nodes created by
> dlpar_configure_connector() and pSeries_reconfig_add_node().
...
>   dn = kzalloc(sizeof(*dn), GFP_KERNEL);
>   if (!dn)
>   return NULL;
> 
>   name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
> - dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
> + dn->full_name = kasprintf(GFP_KERNEL, "%s", name);

Isn't this just strdup()?
Perhaps you should rename full_name to name - since it is no longer 'full'?

Maybe you could do a single malloc() for both 'dn' and the name?

David



RE: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-27 Thread David Laight
From: Brijesh Singh
> Sent: 26 July 2017 21:07
...
> I am not sure if I understand your concern.
> 
> Are you commenting on amount of code duplication ? If so, I can certainly 
> improve
> and use the similar macro used into header file to generate the functions 
> body.

If you are careful the real functions could expand the inline functions
that get used when SEV is compiled out.

Oh, if you are looking at this, can you fix memcpy_to_io()
so that it is never 'rep movsb'?

David



RE: [RFC Part1 PATCH v3 07/17] x86/mm: Include SEV for encryption memory attribute changes

2017-07-28 Thread David Laight
From: Borislav Petkov
> Sent: 27 July 2017 15:59
> On Mon, Jul 24, 2017 at 02:07:47PM -0500, Brijesh Singh wrote:
> > From: Tom Lendacky 
> >
> > The current code checks only for sme_active() when determining whether
> > to perform the encryption attribute change.  Include sev_active() in this
> > check so that memory attribute changes can occur under SME and SEV.
> >
> > Signed-off-by: Tom Lendacky 
> > Signed-off-by: Brijesh Singh 
> > ---
> >  arch/x86/mm/pageattr.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> > index dfb7d65..b726b23 100644
> > --- a/arch/x86/mm/pageattr.c
> > +++ b/arch/x86/mm/pageattr.c
> > @@ -1781,8 +1781,8 @@ static int __set_memory_enc_dec(unsigned long addr, 
> > int numpages, bool enc)
> > unsigned long start;
> > int ret;
> >
> > -   /* Nothing to do if the SME is not active */
> > -   if (!sme_active())
> > +   /* Nothing to do if SME and SEV are not active */
> > +   if (!sme_active() && !sev_active())
> 
> This is the second place which does
> 
>   if (!SME && !SEV)
> 
> I wonder if, instead of sprinking those, we should have a
> 
>   if (mem_enc_active())
> 
> or so which unifies all those memory encryption logic tests and makes
> the code more straightforward for readers who don't have to pay
> attention to SME vs SEV ...

If any of the code paths are 'hot' it would make sense to be checking
a single memory location.

David



RE: [PATCH 1/5] Fix packed and aligned attribute warnings.

2017-07-31 Thread David Laight
From: SZ Lin
> Sent: 29 July 2017 08:24
...
> diff --git a/drivers/char/tpm/tpm_ibmvtpm.h b/drivers/char/tpm/tpm_ibmvtpm.h
> index 91dfe766d080..9f708ca3dc84 100644
> --- a/drivers/char/tpm/tpm_ibmvtpm.h
> +++ b/drivers/char/tpm/tpm_ibmvtpm.h
> @@ -25,7 +25,7 @@ struct ibmvtpm_crq {
>   __be16 len;
>   __be32 data;
>   __be64 reserved;
> -} __attribute__((packed, aligned(8)));
> +} __packed __aligned(8);

You can't need __packed and __aligned(8) on that structure.
There are no gaps and you are saying it is always aligned.

So just remove the pointless attributes.

David



RE: [RESEND][PATCH V10 0/3] powernv : Add support for OPAL-OCC command/response interface

2017-08-01 Thread David Laight
From: Shilpasri G Bhat
> Sent: 31 July 2017 08:43
> In P9, OCC (On-Chip-Controller) supports shared memory based
> commad-response interface. Within the shared memory there is an OPAL
  ^ typo
> command buffer and OCC response buffer that can be used to send
> inband commands to OCC. The following commands are supported:
...

David


RE: [PATCH v9 14/14] powerpc: rewrite local_t using soft_irq

2017-08-04 Thread David Laight
From: Nicholas Piggin
> Sent: 04 August 2017 10:04
> On Fri, 04 Aug 2017 11:40:43 +1000
> Benjamin Herrenschmidt  wrote:
> 
> > On Fri, 2017-08-04 at 03:50 +1000, Nicholas Piggin wrote:
> > > Hey, so... why are any of these implemented in asm? We should
> > > just do them all in C, right? I looked a bit harder at code gen
> > > and a couple of them are still emitting larx/stcx.
> >
> > As long as we can guarantee that the C compiler won't play games
> > moving stuff around. But yes, I tend to agree.
> 
> 
> I believe so. I mean we already depend on the same pattern for any
> other sequence of local_irq_disable(); c code; local_irq_enable();
> so we'd have other problems if we couldn't.

I'd guess that a "memory" clobber on the irq_disable/enable would be enough.
It could be restricted to the memory area being updated.

David



RE: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory

2017-08-08 Thread David Laight
From: Pasha Tatashin
> Sent: 08 August 2017 12:49
> Thank you for looking at this change. What you described was in my
> previous iterations of this project.
> 
> See for example here: https://lkml.org/lkml/2017/5/5/369
> 
> I was asked to remove that flag, and only zero memory in place when
> needed. Overall the current approach is better everywhere else in the
> kernel, but it adds a little extra code to kasan initialization.

Perhaps you could #define the function prototype(s?) so that the flags
are not passed unless it is a kasan build?

David



RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-15 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 15 August 2017 02:34
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > Taking a step back, though, why does vfio-pci perform this check in the
> > > first place? If a malicious guest already has control of a device, any
> > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > message address/data it could simply do with a DMA write anyway, so the
> > > security argument doesn't stand up in general (sure, not all PCIe
> > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)

Indeed, we have an FPGA based PCIe card where the MSI-X table is just a
piece of PCIe accessible memory.
The device driver has to read the MSI-X table and write the address+data
values to other registers which are then used to raise the interrupt.
(Ok, I've written a better interrupt generator so we don't do that
any more.)

David



RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-17 Thread David Laight
From: Alex Williamson
> Sent: 16 August 2017 17:56
...
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.

Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
it just lie to the guest about the physical address of the MSI-X table?
Then mmio access to anything in the same physical page will just work.

It has already been pointed out that you can't actually police the
interrupts that are raised without host hardware support.

Actually, putting other vectors in the MSI-X table is boring, most
drivers will ignore unexpected interrupts.
Much more interesting are physical memory addresses and accessible IO
addresses.
Of course, a lot of boards have PCI master capability and can probably
be persuaded to do writes to specific location anyway.

David



RE: [PATCH 1/3] perf tests test_arm_coresight: Fix the shellcheck warning in latest test_arm_coresight.sh

2023-10-05 Thread David Laight
From: Athira Rajeev
> Sent: 29 September 2023 05:12
> 
> Running shellcheck on tests/shell/test_arm_coresight.sh
> throws below warnings:
> 
>   In tests/shell/test_arm_coresight.sh line 15:
>   cs_etm_path=$(find  /sys/bus/event_source/devices/cs_etm/ -name cpu* 
> -print -quit)
>   ^--^ SC2061: Quote the parameter to -name so the shell 
> won't interpret it.
> 
>   In tests/shell/test_arm_coresight.sh line 20:
>   if [ $archhver -eq 5 -a "$(printf "0x%X\n" $archpart)" = 
> "0xA13" ] ; then
>^-- SC2166: Prefer [ p ] && [ q ] as [ p 
> -a q ] is not well defined
> 
> This warning is observed after commit:
> "commit bb350847965d ("perf test: Update cs_etm testcase for Arm ETE")"
> 
> Fixed this issue by using quoting 'cpu*' for SC2061 and
> using "&&" in line number 20 for SC2166 warning
> 
> Fixes: bb350847965d ("perf test: Update cs_etm testcase for Arm ETE")
> Signed-off-by: Athira Rajeev 
> ---
>  tools/perf/tests/shell/test_arm_coresight.sh | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/perf/tests/shell/test_arm_coresight.sh
> b/tools/perf/tests/shell/test_arm_coresight.sh
> index fe78c4626e45..f2115dfa24a5 100755
> --- a/tools/perf/tests/shell/test_arm_coresight.sh
> +++ b/tools/perf/tests/shell/test_arm_coresight.sh
> @@ -12,12 +12,12 @@
>  glb_err=0
> 
>  cs_etm_dev_name() {
> - cs_etm_path=$(find  /sys/bus/event_source/devices/cs_etm/ -name cpu* 
> -print -quit)
> + cs_etm_path=$(find  /sys/bus/event_source/devices/cs_etm/ -name 'cpu*' 
> -print -quit)

Isn't the intention to get the shell to expand "cpu* ?
So quoting it completely breaks the script.

>   trcdevarch=$(cat ${cs_etm_path}/mgmt/trcdevarch)
>   archhver=$((($trcdevarch >> 12) & 0xf))
>   archpart=$(($trcdevarch & 0xfff))
> 
> - if [ $archhver -eq 5 -a "$(printf "0x%X\n" $archpart)" = "0xA13" ] ; 
> then
> + if [ $archhver -eq 5 ] && [ "$(printf "0x%X\n" $archpart)" = "0xA13" ] 
> ; then

The only issue I see there is that $archhver could be in "".
IIRC test is required to parse the 5 part expression "a op b -a c op d"
in the 'expected' manner even if any of a/b/c/d look like operators.
In any case '(' and ')' can be used to force the ordering.
Or, more usually, prepend an x as in:

if [ "x$archhver" = x5 -a "$(printf "0x%X\n" $archpart)" = "0xA13" ] ; then

David

>   echo "ete"
>   else
>   echo "etm"
> --
> 2.31.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/3] perf tests test_arm_coresight: Fix the shellcheck warning in latest test_arm_coresight.sh

2023-10-05 Thread David Laight
From: David Laight
> Sent: 05 October 2023 11:16
...
> > -   cs_etm_path=$(find  /sys/bus/event_source/devices/cs_etm/ -name cpu* 
> > -print -quit)
> > +   cs_etm_path=$(find  /sys/bus/event_source/devices/cs_etm/ -name 'cpu*' 
> > -print -quit)
> 
> Isn't the intention to get the shell to expand "cpu* ?
> So quoting it completely breaks the script.

Complete brain-fade :-(

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [RFC PATCH 2/2] powerpc: nop trap instruction after WARN_ONCE fires

2022-09-26 Thread David Laight
From: Nicholas Piggin
> Sent: 23 September 2022 16:42
>
> WARN_ONCE and similar are often used in frequently executed code, and
> should not crash the system. The program check interrupt caused by
> WARN_ON_ONCE can be a significant overhead even when nothing is being
> printed. This can cause performance to become unacceptable, having the
> same effective impact to the user as a BUG_ON().
> 
> Avoid this overhead by patching the trap with a nop instruction after a
> "once" trap fires. Conditional warnings that return a result must have
> equivalent compare and branch instructions after the trap, so when it is
> nopped the statement will behave the same way. It's possible the asm
> goto should be removed entirely and this comparison just done in C now.
> 
> XXX: possibly this should schedule the patching to run in a different
> context than the program check.

I'm pretty sure WARN_ON_ONCE() is valid everywhere printk() is allowed.
In many cases this means you can't call mutex_enter().

So you need a different scheme.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v3 3/5] treewide: use get_random_u32() when possible

2022-10-07 Thread David Laight
From: Christophe Leroy
> Sent: 06 October 2022 18:43
...
> But taking into account that sp must remain 16 bytes aligned, would it
> be better to do something like ?
> 
>   sp -= prandom_u32_max(PAGE_SIZE >> 4) << 4;

That makes me think...
If prandom_u32_max() is passed a (constant) power of 2 it doesn't
need to do the multiply, it can just do a shift right.

Doesn't it also always get a 32bit random value?
So actually get_random_u32() & PAGE_MASK & ~0xf is faster!

When PAGE_SIZE is 4k, PAGE_SIZE >> 4 is 256 so it could use:
get_ramdom_u8() << 4

You also seem to have removed prandom_u32() in favour of
get_random_u32() but have added more prandom_() functions.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v3 3/5] treewide: use get_random_u32() when possible

2022-10-08 Thread David Laight
From: Jason A. Donenfeld
> Sent: 07 October 2022 18:56
...
> > Given these kinds of less mechanical changes, it may make sense to split
> > these from the "trivial" conversions in a treewide patch. The chance of
> > needing a revert from the simple 1:1 conversions is much lower than the
> > need to revert by-hand changes.
> >
> > The Cocci script I suggested in my v1 review gets 80% of the first
> > patch, for example.
> 
> I'll split things up into a mechanical step and a non-mechanical step.
> Good idea.

I'd also do something about the 'get_random_int() & 3' cases.
(ie remainder by 2^n-1)
These can be converted to 'get_random_u8() & 3' (etc).
So they only need one random byte (not 4) and no multiply.

Possibly something based on (the quickly typed, and not C):
#define get_random_below(val) [
if (builtin_constant(val))
BUILD_BUG_ON(!val || val > 0x1ull)
if (!(val & (val - 1)) {
if (val <= 0x100)
return get_random_u8() & (val - 1);
if (val <= 0x1)
return get_random_u16() & (val - 1);
return get_random_u32() & (val - 1);
}
}
BUILD_BUG_ON(sizeof (val) > 4);
return ((u64)get_random_u32() * val) >> 32;
}

get_random_below() is a much better name than prandom_u32_max().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v4 2/6] treewide: use prandom_u32_max() when possible

2022-10-08 Thread David Laight
From: Jason A. Donenfeld
> Sent: 07 October 2022 19:01
> 
> Rather than incurring a division or requesting too many random bytes for
> the given range, use the prandom_u32_max() function, which only takes
> the minimum required bytes from the RNG and avoids divisions.
> 
...
> --- a/lib/cmdline_kunit.c
> +++ b/lib/cmdline_kunit.c
> @@ -76,7 +76,7 @@ static void cmdline_test_lead_int(struct kunit *test)
>   int rc = cmdline_test_values[i];
>   int offset;
> 
> - sprintf(in, "%u%s", prandom_u32_max(256), str);
> + sprintf(in, "%u%s", get_random_int() % 256, str);
>   /* Only first '-' after the number will advance the pointer */
>   offset = strlen(in) - strlen(str) + !!(rc == 2);
>   cmdline_do_one_test(test, in, rc, offset);
> @@ -94,7 +94,7 @@ static void cmdline_test_tail_int(struct kunit *test)
>   int rc = strcmp(str, "") ? (strcmp(str, "-") ? 0 : 1) : 1;
>   int offset;
> 
> - sprintf(in, "%s%u", str, prandom_u32_max(256));
> + sprintf(in, "%s%u", str, get_random_int() % 256);
>   /*
>* Only first and leading '-' not followed by integer
>* will advance the pointer.

Something has gone backwards here
And get_random_u8() looks a better fit.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v4 4/6] treewide: use get_random_u32() when possible

2022-10-08 Thread David Laight
From: Jason A. Donenfeld
> Sent: 07 October 2022 19:01
> 
> The prandom_u32() function has been a deprecated inline wrapper around
> get_random_u32() for several releases now, and compiles down to the
> exact same code. Replace the deprecated wrapper with a direct call to
> the real function. The same also applies to get_random_int(), which is
> just a wrapper around get_random_u32().
> 
...
> diff --git a/net/802/garp.c b/net/802/garp.c
> index f6012f8e59f0..c1bb67e25430 100644
> --- a/net/802/garp.c
> +++ b/net/802/garp.c
> @@ -407,7 +407,7 @@ static void garp_join_timer_arm(struct garp_applicant 
> *app)
>  {
>   unsigned long delay;
> 
> - delay = (u64)msecs_to_jiffies(garp_join_time) * prandom_u32() >> 32;
> + delay = (u64)msecs_to_jiffies(garp_join_time) * get_random_u32() >> 32;
>   mod_timer(&app->join_timer, jiffies + delay);
>  }
> 
> diff --git a/net/802/mrp.c b/net/802/mrp.c
> index 35e04cc5390c..3e9fe9f5d9bf 100644
> --- a/net/802/mrp.c
> +++ b/net/802/mrp.c
> @@ -592,7 +592,7 @@ static void mrp_join_timer_arm(struct mrp_applicant *app)
>  {
>   unsigned long delay;
> 
> - delay = (u64)msecs_to_jiffies(mrp_join_time) * prandom_u32() >> 32;
> + delay = (u64)msecs_to_jiffies(mrp_join_time) * get_random_u32() >> 32;
>   mod_timer(&app->join_timer, jiffies + delay);
>  }
> 

Aren't those:
delay = prandom_u32_max(msecs_to_jiffies(xxx_join_time));

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v1 3/5] treewide: use get_random_u32() when possible

2022-10-12 Thread David Laight
From: Joe Perches
> Sent: 12 October 2022 20:17
> 
> On Wed, 2022-10-05 at 23:48 +0200, Jason A. Donenfeld wrote:
> > The prandom_u32() function has been a deprecated inline wrapper around
> > get_random_u32() for several releases now, and compiles down to the
> > exact same code. Replace the deprecated wrapper with a direct call to
> > the real function.
> []
> > diff --git a/drivers/infiniband/hw/cxgb4/cm.c 
> > b/drivers/infiniband/hw/cxgb4/cm.c
> []
> > @@ -734,7 +734,7 @@ static int send_connect(struct c4iw_ep *ep)
> >&ep->com.remote_addr;
> > int ret;
> > enum chip_type adapter_type = ep->com.dev->rdev.lldi.adapter_type;
> > -   u32 isn = (prandom_u32() & ~7UL) - 1;
> > +   u32 isn = (get_random_u32() & ~7UL) - 1;
> 
> trivia:
> 
> There are somewhat odd size mismatches here.
> 
> I had to think a tiny bit if random() returned a value from 0 to 7
> and was promoted to a 64 bit value then truncated to 32 bit.
> 
> Perhaps these would be clearer as ~7U and not ~7UL

That makes no difference - the compiler will generate the same code.

The real question is WTF is the code doing?
The '& ~7u' clears the bottom 3 bits.
The '- 1' then sets the bottom 3 bits and decrements the
(random) high bits.

So is the same as get_random_u32() | 7.
But I bet the coder had something else in mind.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread David Laight
From: Tony Luck
> Sent: 21 October 2022 05:08

> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.

What about an RT process on a busy system?
The worker threads are pretty low priority.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread David Laight
From: Segher Boessenkool
> Sent: 16 April 2024 11:38
> 
> On Tue, Apr 16, 2024 at 03:02:52PM +0530, Naresh Kamboju wrote:
> > In file included from arch/powerpc/include/asm/io.h:672:
> > arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer
> > arithmetic on a null pointer has undefined behavior
> > [-Werror,-Wnull-pointer-arithmetic]
> 
> It is not UB, but just undefined: the program is meaningless.
> 
> It is not a null pointer but even a null pointer constant here.  It
> matters in places, including here.
> 
> It would help if the warnings were more correct :-(

Isn't it only a problem because the NULL pointer isn't required to
be the all-zero bit pattern?

So when do we get a warning from using memset() on a structure
that contains pointers? Since it is equally buggy.

Has anyone ever seen a system where NULL wasn't 'all zeros'?
I've used a system where the 'native' invalid pointer was 'all ones',
but even there the C code used 'all zeros'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v4 02/24] powerpc/pseries: Fix alignment of PLPKS structures and buffers

2023-01-27 Thread David Laight
From: Andrew Donnellan
> Sent: 27 January 2023 03:21
> 
> On Thu, 2023-01-26 at 17:31 +0000, David Laight wrote:
> > Changing the size to kzalloc() doesn't help.
> > The alignment depends on the allocator and is only required to have
> > a relatively small alignment (ARCH_MINALIGN?) regardless of the size.
> >
> > IIRC one of the allocators adds a small header to every item.
> > It won't return 16 byte aligned items at all.
> 
> I'm relying on the behaviour described in Documentation/core-
> api/memory-allocation.rst:
> 
> The address of a chunk allocated with kmalloc is aligned to at
> least ARCH_KMALLOC_MINALIGN bytes. For sizes which are a power of
> two, the alignment is also guaranteed to be at least the respective
> size.
> 
> Is this wrong?

The alignment for power of two doesn't match what I've inferred
from reading comments on other patches.

It is true for dma_malloc_coherent() - that does guarantee that a
16k allocate will be aligned on a 16k physical address boundary.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH RFC] PCI/AER: Enable internal AER errors by default

2023-02-13 Thread David Laight
From: Bjorn Helgaas
> Sent: 13 February 2023 21:38
> 
> On Fri, Feb 10, 2023 at 02:33:23PM -0800, Ira Weiny wrote:
> > The CXL driver expects internal error reporting to be enabled via
> > pci_enable_pcie_error_reporting().  It is likely other drivers expect the 
> > same.
> > Dave submitted a patch to enable the CXL side[1] but the PCI AER registers
> > still mask errors.
> >
> > PCIe v6.0 Uncorrectable Mask Register (7.8.4.3) and Correctable Mask
> > Register (7.8.4.6) default to masking internal errors.  The
> > Uncorrectable Error Severity Register (7.8.4.4) defaults internal errors
> > as fatal.
> >
> > Enable internal errors to be reported via the standard
> > pci_enable_pcie_error_reporting() call.  Ensure uncorrectable errors are set
> > non-fatal to limit any impact to other drivers.
> 
> Do you have any background on why the spec makes these errors masked
> by default?  I'm sympathetic to wanting to learn about all the errors
> we can, but I'm a little wary if the spec authors thought it was
> important to mask these by default.

I'd guess that it is for backwards compatibility with older hardware
and/or software that that didn't support error notifications.

Then there are the x86 systems that manage to take the AER
error into some 'board management hardware' which finally
interrupts the kernel with an NMI - and the obvious consequence.
These systems are NEBS? 'qualified' for telecoms use, but take
out a PCIe link and the system crashes.

It is pretty easy to generate a PCIe error.
Any endpoint with two (or more) different sized BARs leaves
a big chunk of PCIe address space that is forwarded by the upstream
bridge but is not responded to.
The requirement to put the MSI-X area in its own BAR pretty much
ensures that such addresses exist.

(Never mind reprogramming the fpga that is terminating the link.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread David Laight
From: Arnd Bergmann
> Sent: 31 March 2023 11:39
...
> Most architectures that have write-through caches (m68k,
> microblaze) or write-back caches but no speculation (all other
> armv4/armv5, hexagon, openrisc, sh, most mips, later xtensa)
> only invalidate before DMA but not after.
> 
> OTOH, most machines that are actually in use today (armv6+,
> powerpc, later mips, microblaze, riscv, nios2) also have to
> deal with speculative accesses, so they end up having to
> invalidate or flush both before and after a DMA_FROM_DEVICE
> and DMA_BIDIRECTIONAL.

nios2 is a simple in-order cpu with a short pipeline
(it is a soft-cpu made from normal fpga logic elements).
Definitely doesn't do speculative accesses.
OTOH any one trying to run Linux on it needs their head examined.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v4] Kconfig: introduce HAS_IOPORT option and select it as necessary

2023-04-05 Thread David Laight
From: Linuxppc-dev Arnd Bergmann
> Sent: 05 April 2023 21:32
> 
> On Wed, Apr 5, 2023, at 22:00, H. Peter Anvin wrote:
> > On April 5, 2023 8:12:38 AM PDT, Niklas Schnelle  
> > wrote:
> >>On Thu, 2023-03-23 at 17:33 +0100, Niklas Schnelle wrote:
> >>> We introduce a new HAS_IOPORT Kconfig option to indicate support for I/O
> >>> Port access. In a future patch HAS_IOPORT=n will disable compilation of
> >>> the I/O accessor functions inb()/outb() and friends on architectures
> >>> which can not meaningfully support legacy I/O spaces such as s390.
> >>> >>
> >>Gentle ping. As far as I can tell this hasn't been picked to any tree
> >>sp far but also hasn't seen complains so I'm wondering if I should send
> >>a new version of the combined series of this patch plus the added
> >>HAS_IOPORT dependencies per subsystem or wait until this is picked up.
> >
> > You need this on a system supporting not just ISA but also PCI.
> >
> > Typically on non-x86 architectures this is simply mapped into a memory 
> > window.
> 
> I'm pretty confident that the list is correct here, as the HAS_IOPORT
> symbol is enabled exactly for the architectures that have a way to
> map the I/O space. PCIe generally works fine without I/O space, the
> only exception are drivers for devices that were around as early PCI.

Isn't there a difference between cpu that have inb()/outb() (probably
only x86?) and architectures (well computer designs) that can generate
PCI 'I/O' cycles by some means.
It isn't even just PCI I/O cycles, I've used an ARM cpu (SA1100)
that mapped a chuck of physical address space onto PCMCIA I/O cycles.

If the hardware can map a PCI 'IO' bar into normal kernel address
space then the bar and accesses can be treated exactly like a memory bar.
This probably leaves x86 as the outlier where you need (IIRC) io_readl()
and friends that can generate in/out instructions for those accesses.

There are also all the x86 ISA devices which need in/out instructions.
But (with the likely exception of the UART) they are pretty much
platform specific.

So, to my mind at least, HAS_IOPORT is just the wrong question.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-06 Thread David Laight
From: Dave Hansen
> Sent: 05 April 2023 17:37
> 
> On 4/5/23 07:17, Uros Bizjak wrote:
> > Add generic and target specific support for local{,64}_try_cmpxchg
> > and wire up support for all targets that use local_t infrastructure.
> 
> I feel like I'm missing some context.
> 
> What are the actual end user visible effects of this series?  Is there a
> measurable decrease in perf overhead?  Why go to all this trouble for
> perf?  Who else will use local_try_cmpxchg()?

I'm assuming the local_xxx operations only have to be save wrt interrupts?
On x86 it is possible that an alternate instruction sequence
that doesn't use a locked instruction may actually be faster!

Although, maybe, any kind of locked cmpxchg just needs to ensure
the cache line isn't 'stolen', so apart from possible slight
delays on another cpu that gets a cache miss for the line in
all makes little difference.
The cache line miss costs a lot anyway, line bouncing more
and is best avoided.
So is there actually much of a benefit at all?

Clearly the try_cmpxchg help - but that is a different issue.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-06 Thread David Laight
From: Uros Bizjak
> Sent: 06 April 2023 09:39
> 
> On Thu, Apr 6, 2023 at 10:26 AM David Laight  wrote:
> >
> > From: Dave Hansen
> > > Sent: 05 April 2023 17:37
> > >
> > > On 4/5/23 07:17, Uros Bizjak wrote:
> > > > Add generic and target specific support for local{,64}_try_cmpxchg
> > > > and wire up support for all targets that use local_t infrastructure.
> > >
> > > I feel like I'm missing some context.
> > >
> > > What are the actual end user visible effects of this series?  Is there a
> > > measurable decrease in perf overhead?  Why go to all this trouble for
> > > perf?  Who else will use local_try_cmpxchg()?
> >
> > I'm assuming the local_xxx operations only have to be save wrt interrupts?
> > On x86 it is possible that an alternate instruction sequence
> > that doesn't use a locked instruction may actually be faster!
> 
> Please note that "local" functions do not use lock prefix. Only atomic
> properties of cmpxchg instruction are exploited since it only needs to
> be safe wrt interrupts.

Gah, I was assuming that LOCK was implied - like it is for xchg
and all the bit instructions.

In any case I suspect it makes little difference unless the
locked variant affects the instruction pipeline.
In fact, you may want to stop the cacheline being invalidated
between the read and write in order to avoid an extra cache
line bounce.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v4] Kconfig: introduce HAS_IOPORT option and select it as necessary

2023-04-11 Thread David Laight
From: Geert Uytterhoeven
> Sent: 11 April 2023 09:50
> 
> Hi David,
> 
> On Wed, Apr 5, 2023 at 11:37 PM David Laight  wrote:
> > From: Linuxppc-dev Arnd Bergmann
> > > Sent: 05 April 2023 21:32
> > >
> > > On Wed, Apr 5, 2023, at 22:00, H. Peter Anvin wrote:
> > > > On April 5, 2023 8:12:38 AM PDT, Niklas Schnelle 
> > > >  wrote:
> > > >>On Thu, 2023-03-23 at 17:33 +0100, Niklas Schnelle wrote:
> > > >>> We introduce a new HAS_IOPORT Kconfig option to indicate support for 
> > > >>> I/O
> > > >>> Port access. In a future patch HAS_IOPORT=n will disable compilation 
> > > >>> of
> > > >>> the I/O accessor functions inb()/outb() and friends on architectures
> > > >>> which can not meaningfully support legacy I/O spaces such as s390.
> > > >>> >>
> > > >>Gentle ping. As far as I can tell this hasn't been picked to any tree
> > > >>sp far but also hasn't seen complains so I'm wondering if I should send
> > > >>a new version of the combined series of this patch plus the added
> > > >>HAS_IOPORT dependencies per subsystem or wait until this is picked up.
> > > >
> > > > You need this on a system supporting not just ISA but also PCI.
> > > >
> > > > Typically on non-x86 architectures this is simply mapped into a memory 
> > > > window.
> > >
> > > I'm pretty confident that the list is correct here, as the HAS_IOPORT
> > > symbol is enabled exactly for the architectures that have a way to
> > > map the I/O space. PCIe generally works fine without I/O space, the
> > > only exception are drivers for devices that were around as early PCI.
> >
> > Isn't there a difference between cpu that have inb()/outb() (probably
> > only x86?) and architectures (well computer designs) that can generate
> > PCI 'I/O' cycles by some means.
> > It isn't even just PCI I/O cycles, I've used an ARM cpu (SA1100)
> > that mapped a chuck of physical address space onto PCMCIA I/O cycles.
> >
> > If the hardware can map a PCI 'IO' bar into normal kernel address
> > space then the bar and accesses can be treated exactly like a memory bar.
> > This probably leaves x86 as the outlier where you need (IIRC) io_readl()
> > and friends that can generate in/out instructions for those accesses.
> >
> > There are also all the x86 ISA devices which need in/out instructions.
> > But (with the likely exception of the UART) they are pretty much
> > platform specific.
> >
> > So, to my mind at least, HAS_IOPORT is just the wrong question.
> 
> Not all PCI controllers support mapping the I/O bar in MMIO space, so
> in general you cannot say that CONFIG_PCI=y means CONFIG_HAS_IOPORT=y.

But a CONFIG_HAS_PCI_IO=y would imply CONFIG_HAS_IOPORT=y.
It is the former that is more interesting for driver support.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v2 0/5] locking: Introduce local{,64}_try_cmpxchg

2023-04-11 Thread David Laight
From: Dave Hansen
> Sent: 11 April 2023 14:44
> 
> On 4/11/23 04:35, Mark Rutland wrote:
> > I agree it'd be nice to have performance figures, but I think those would 
> > only
> > need to demonstrate a lack of a regression rather than a performance
> > improvement, and I think it's fairly clear from eyeballing the generated
> > instructions that a regression isn't likely.
> 
> Thanks for the additional context.
> 
> I totally agree that there's zero burden here to show a performance
> increase.  If anyone can think of a quick way to do _some_ kind of
> benchmark on the code being changed and just show that it's free of
> brown paper bags, it would be appreciated.  Nothing crazy, just think of
> one workload (synthetic or not) that will stress the paths being changed
> and run it with and without these changes.  Make sure there are not
> surprises.
> 
> I also agree that it's unlikely to be brown paper bag material.

The only thing I can think of is that, on x86, the locked
variant may actually be faster!
Both require exclusive access to the cache line (the unlocked
variant always does the write! [1]).
So if the cache line is contended between cpu the unlocked
variant might ping-pong the cache line twice!
Of course, if the line is shared like that then performance
is horrid.

[1] I checked on an uncached PCIe address on which I can monitor
the TLP. The write always happens so you can use cmpxchg18b
with a 'known bad value' to do a 16 byte read as a single TLP
(without using an SSE register).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v2 3/3] treewide: use get_random_u32_between() when possible

2022-11-17 Thread David Laight
From: Theodore Ts'o
> Sent: 17 November 2022 15:43
...
> The problem with "between", "ranged", "spanning" is that they don't
> tell the reader whether we're dealing with an "open interval" or a
> "closed interval".  They are just different ways of saying that it's a
> range between, say, 0 and 20.  But it doesn't tell you whether it
> includes 0 or 20 or not.
> 
> The only way I can see for making it ambiguous is either to use the
> terminology "closed interval" or "inclusive".  And "open" and "closed"
> can have other meanings, so get_random_u32_inclusive() is going to be
> less confusing than get_random_u32_closed().

It has to be said that removing the extra function and requiring
the callers use 'base + get_random_below(high [+1] - base)' is
likely to be the only way to succinctly make the code readable
and understandable.

Otherwise readers either have to look up another function to see
what it does or waste variable brain cells on more trivia.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 2/4] fs: define a firmware security filesystem named fwsecurityfs

2022-11-21 Thread David Laight
From: James Bottomley
> Sent: 21 November 2022 14:03
...
> > Then how does the networking code handle the namespace stuff in
> > sysfs?
> > That seems to work today, or am I missing something?
> 
> have you actually tried?
> 
> jejb@lingrow:~> sudo unshare --net bash
> lingrow:/home/jejb # ls /sys/class/net/
> lo  tun0  tun10  wlan0
> lingrow:/home/jejb # ip link show
> 1: lo:  mtu 65536 qdisc noop state DOWN mode DEFAULT group
> default qlen 1000
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 
> So, as you see, I've entered a network namespace and ip link shows me
> the only interface I can see in that namespace (a down loopback) but
> sysfs shows me every interface on the system outside the namespace.

You have to remount /sys to get the restricted copy.
eg by running 'ip netns exec namespace command'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH 08/41] mm: introduce CONFIG_PER_VMA_LOCK

2023-01-11 Thread David Laight
From: Ingo Molnar
> Sent: 11 January 2023 09:54
> 
> * Michal Hocko  wrote:
> 
> > On Tue 10-01-23 16:44:42, Suren Baghdasaryan wrote:
> > > On Tue, Jan 10, 2023 at 4:39 PM Davidlohr Bueso  wrote:
> > > >
> > > > On Mon, 09 Jan 2023, Suren Baghdasaryan wrote:
> > > >
> > > > >This configuration variable will be used to build the support for VMA
> > > > >locking during page fault handling.
> > > > >
> > > > >This is enabled by default on supported architectures with SMP and MMU
> > > > >set.
> > > > >
> > > > >The architecture support is needed since the page fault handler is 
> > > > >called
> > > > >from the architecture's page faulting code which needs modifications to
> > > > >handle faults under VMA lock.
> > > >
> > > > I don't think that per-vma locking should be something that is 
> > > > user-configurable.
> > > > It should just be depdendant on the arch. So maybe just remove 
> > > > CONFIG_PER_VMA_LOCK?
> > >
> > > Thanks for the suggestion! I would be happy to make that change if
> > > there are no objections. I think the only pushback might have been the
> > > vma size increase but with the latest optimization in the last patch
> > > maybe that's less of an issue?
> >
> > Has vma size ever been a real problem? Sure there might be a lot of those
> > but your patch increases it by rwsem (without the last patch) which is
> > something like 40B on top of 136B vma so we are talking about 400B in
> > total which even with wild mapcount limits shouldn't really be
> > prohibitive. With a default map count limit we are talking about 2M
> > increase at most (per address space).
> >
> > Or are you aware of any specific usecases where vma size is a real
> > problem?
> 
> 40 bytes for the rwsem, plus the patch also adds a 32-bit sequence counter:
> 
>   + int vm_lock_seq;
>   + struct rw_semaphore lock;
> 
> So it's +44 bytes.

Depend in whether vm_lock_seq goes into a padding hole or not
it will be 40 or 48 bytes.

But if these structures are allocated individually (not an array)
then it depends on how may items kmalloc() fits into a page (or 2,4).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 12/41] mm: add per-VMA lock and helper functions to control it

2023-01-18 Thread David Laight
...
> > One thing that might be gnarly here is that I think you might not be
> > allowed to use up_read() to fully release ownership of an object -
> > from what I remember, I think that up_read() (unlike something like
> > spin_unlock()) can access the lock object after it's already been
> > acquired by someone else.
> 
> Yes, I think you are right. From a look into the code it seems that
> the UAF is quite unlikely as there is a ton of work to be done between
> vma_write_lock used to prepare vma for removal and actual removal.
> That doesn't make it less of a problem though.

All it takes is a hardware interrupt
Especially if the softint code can also run.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH V2] tools/perf/tests: Fix string substitutions in build id test

2023-01-19 Thread David Laight
From: Athira Rajeev
> Sent: 19 January 2023 11:31
...
> diff --git a/tools/perf/tests/shell/buildid.sh 
> b/tools/perf/tests/shell/buildid.sh
> index aaf851108ca3..43e43e131be7 100755
> --- a/tools/perf/tests/shell/buildid.sh
> +++ b/tools/perf/tests/shell/buildid.sh
> @@ -66,7 +66,7 @@ check()
>   esac
>   echo "build id: ${id}"
> 
> - link=${build_id_dir}/.build-id/${id:0:2}/${id:2}
> + link=${build_id_dir}/.build-id/$(echo ${id}|cut -c 1-2)/$(echo 
> ${id}|cut -c 3-)
>   echo "link: ${link}"

That is horrid, why not just use valid shell substitutions, eg:
id_file=${id#??}
id_dir=${id%$id_file}
link=$build_id_dir/.build-id/$id_dir/$id_file

...
> - check ${@: -1}
> + check $last

Since this is the end of the shell function you can avoid the eval
by doing:
shift $(($# - 1))
check $1
or maybe:
args="$*"
check ${args##* }

Those should be ok in all posix shells.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH V3] tools/perf/tests: Fix string substitutions in build id test

2023-01-19 Thread David Laight
From: Athira Rajeev
> Sent: 19 January 2023 14:27
...
> The test script "tests/shell/buildid.sh" uses some of the
> string substitution ways which are supported in bash, but not in
> "sh" or other shells. Above error on line number 69 that reports
> "Bad substitution" is:

Looks better - assuming it works :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH V3] tools/perf/tests: Fix string substitutions in build id test

2023-01-20 Thread David Laight
From: Arnaldo Carvalho de Melo
> Sent: 19 January 2023 17:00
> 
> Em Thu, Jan 19, 2023 at 03:49:15PM +0000, David Laight escreveu:
> > From: Athira Rajeev
> > > Sent: 19 January 2023 14:27
> > ...
> > > The test script "tests/shell/buildid.sh" uses some of the
> > > string substitution ways which are supported in bash, but not in
> > > "sh" or other shells. Above error on line number 69 that reports
> > > "Bad substitution" is:
> >
> > Looks better - assuming it works :-)
> 
> :-)
> 
> Can I take this as an:
> 
> Acked-by: David Laight 

I'm not sure that is worth anything.
You can add a Reviewed-by

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: Calculating array sizes in C - was: Re: Build regressions/improvements in v6.2-rc1

2023-01-20 Thread David Laight
From: Segher Boessenkool
> Sent: 20 January 2023 10:54
...
> > > I suggest to file a bug against gcc complaining about a "spurious
> > > warning", and using "-Werror -Wno-error-sizeof-pointer-div" until gcc is
> > > adapted to not emit the warning about the pointer division if the result
> > > is not used.
> 
> Yeah.  If the first operand of a conditional operator is non-zero, the
> second operand is not evaluated, and if the first is zero, the third
> operand is not evaluated.  It is better if we do not warn about
> something we do not evaluate.  In cases like here where it is clear at
> compile time which branch is taken, that shouldn't be too hard.
> 
> Can someone please file a GCC PR?  With reduced testcase preferably.

It's not a bug.
All the operands of the conditional operator have to be valid.
It might be that the optimiser can discard one, but that happens
much later on.
Even the operands of choose_expr() have to be valid - but can
have different types.

I'm not sure what the code is trying to do or why it is failing.
Was it a fail in userspace - where the option to allow sizeof (void)
isn't allowed.

FWIW you can check for a compile-time NULL (or 0) with:
#define is_null(x) (sizeof *(0 : (void *)(x) ? (int *)0) != 1)

Although that is a compile-time error for non-NULL unless
'void *' arithmetic is allowed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v4 02/24] powerpc/pseries: Fix alignment of PLPKS structures and buffers

2023-01-26 Thread David Laight
From: Segher Boessenkool
> Sent: 26 January 2023 17:19
> 
> On Thu, Jan 26, 2023 at 12:09:53AM +1100, Michael Ellerman wrote:
> > Andrew Donnellan  writes:
> > > A number of structures and buffers passed to PKS hcalls have alignment
> > > requirements, which could on occasion cause problems:
> > >
> > > - Authorisation structures must be 16-byte aligned and must not cross a
> > >   page boundary
> > >
> > > - Label structures must not cross page boundaries
> > >
> > > - Password output buffers must not cross page boundaries
> > >
> > > Round up the allocations of these structures/buffers to the next power of
> > > 2 to make sure this happens.
> >
> > It's not the *next* power of 2, it's the *nearest* power of 2, including
> > the initial value if it's already a power of 2.
> 
> It's not the nearest either, the nearest power of two to 65 is 64.  You
> could say "but, round up" to which I would say "round?"  :-P
> 
> "Adjust the allocation size to be the smallest power of two greater than
> or equal to the given size."
> 
> "Pad to a power of two" in shorthand.  "Padded to a power of two if
> necessary" if you want to emphasise it can be a no-op.

Changing the size to kzalloc() doesn't help.
The alignment depends on the allocator and is only required to have
a relatively small alignment (ARCH_MINALIGN?) regardless of the size.

IIRC one of the allocators adds a small header to every item.
It won't return 16 byte aligned items at all.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-08-29 Thread David Laight
From: Nathan Chancellor
> Sent: 28 August 2019 19:45
...
> However, I think that -fno-builtin-* would be appropriate here because
> we are providing our own setjmp implementation, meaning clang should not
> be trying to do anything with the builtin implementation like building a
> declaration for it.

Isn't implementing setjmp impossible unless you tell the compiler that
you function is 'setjmp-like' ?

For instance I think it all goes horribly wrong if the generated code
looks like:
push local_variable
// setup arguments to setjmp
call setjmp
pop local_variable
// check return value of setjmp

With a naive compiler and simple ABI setjmp just has to save the
return address, stack pointer and caller saved registers.
With modern compilers and ABI I doubt you can implement setjmp
without some help from the compiler.

It is probably best to avoid setjmp/longjmp completely.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-04 Thread David Laight
From: Nathan Chancellor [mailto:natechancel...@gmail.com]
> Sent: 04 September 2019 01:24
> On Tue, Sep 03, 2019 at 02:31:28PM -0500, Segher Boessenkool wrote:
> > On Mon, Sep 02, 2019 at 10:55:53PM -0700, Nathan Chancellor wrote:
> > > On Thu, Aug 29, 2019 at 09:59:48AM +, David Laight wrote:
> > > > From: Nathan Chancellor
> > > > > Sent: 28 August 2019 19:45
> > > > ...
> > > > > However, I think that -fno-builtin-* would be appropriate here because
> > > > > we are providing our own setjmp implementation, meaning clang should 
> > > > > not
> > > > > be trying to do anything with the builtin implementation like 
> > > > > building a
> > > > > declaration for it.
> > > >
> > > > Isn't implementing setjmp impossible unless you tell the compiler that
> > > > you function is 'setjmp-like' ?
> > >
> > > No idea, PowerPC is the only architecture that does such a thing.
> >
> > Since setjmp can return more than once, yes, exciting things can happen
> > if you do not tell the compiler about this.
> >
> >
> > Segher
> >
> 
> Fair enough so I guess we are back to just outright disabling the
> warning.

Just disabling the warning won't stop the compiler generating code
that breaks a 'user' implementation of setjmp().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread David Laight
From: Adhemerval Zanella
> Sent: 21 April 2020 16:01
> 
> On 21/04/2020 11:39, Rich Felker wrote:
> > On Tue, Apr 21, 2020 at 12:28:25PM +0000, David Laight wrote:
> >> From: Nicholas Piggin
> >>> Sent: 20 April 2020 02:10
> >> ...
> >>>>> Yes, but does it really matter to optimize this specific usage case
> >>>>> for size? glibc, for instance, tries to leverage the syscall mechanism
> >>>>> by adding some complex pre-processor asm directives.  It optimizes
> >>>>> the syscall code size in most cases.  For instance, kill in static case
> >>>>> generates on x86_64:
> >>>>>
> >>>>>  <__kill>:
> >>>>>0:   b8 3e 00 00 00  mov$0x3e,%eax
> >>>>>5:   0f 05   syscall
> >>>>>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
> >>>>>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
> >>
> >> Hmmm... that cmp + jae is unnecessary here.
> >
> > It's not.. Rather the objdump was just mistakenly done without -r so
> > it looks like a nop jump rather than a conditional tail call to the
> > function that sets errno.
> >
> 
> Indeed, the output with -r is:
> 
>  <__kill>:
>0:   b8 3e 00 00 00  mov$0x3e,%eax
>5:   0f 05   syscall
>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>
> f: R_X86_64_PLT32   __syscall_error-0x4
>   13:   c3  retq

Yes, I probably should have remembered it looked like that :-)
...
> >> I also suspect it gets predicted very badly.
> >
> > I doubt that. This is a very standard idiom and the size of the offset
> > (which is necessarily 32-bit because it has a relocation on it) is
> > orthogonal to the condition on the jump.

Yes, it only gets mispredicted as badly as any other conditional jump.
I believe modern intel x86 will randomly predict it taken (regardless
of the direction) and then hit a TLB fault on text.unlikely :-)

> > FWIW a syscall like kill takes global kernel-side locks to be able to
> > address a target process by pid, and the rate of meaningful calls you
> > can make to it is very low (since it's bounded by time for target
> > process to act on the signal). Trying to optimize it for speed is
> > pointless, and even size isn't important locally (although in
> > aggregate, lots of wasted small size can add up to more pages = more
> > TLB entries = ...).
> 
> I agree and I would prefer to focus on code simplicity to have a
> platform neutral way to handle error and let the compiler optimize
> it than messy with assembly macros to squeeze this kind of
> micro-optimizations.

syscall entry does get micro-optimised.
Real speed-ups can probably be found by optimising other places.
I've a patch i need to resumbit that should improve the reading
of iov[] from user space.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [musl] Powerpc Linux 'scv' system call ABI proposal take 2

2020-04-21 Thread David Laight
From: Nicholas Piggin
> Sent: 20 April 2020 02:10
...
> >> Yes, but does it really matter to optimize this specific usage case
> >> for size? glibc, for instance, tries to leverage the syscall mechanism
> >> by adding some complex pre-processor asm directives.  It optimizes
> >> the syscall code size in most cases.  For instance, kill in static case
> >> generates on x86_64:
> >>
> >>  <__kill>:
> >>0:   b8 3e 00 00 00  mov$0x3e,%eax
> >>5:   0f 05   syscall
> >>7:   48 3d 01 f0 ff ff   cmp$0xf001,%rax
> >>d:   0f 83 00 00 00 00   jae13 <__kill+0x13>

Hmmm... that cmp + jae is unnecessary here.
It is also a 32bit offset jump.
I also suspect it gets predicted very badly.

> >>   13:   c3  retq
> >>
> >> While on musl:
> >>
> >>  :
> >>0:  48 83 ec 08 sub$0x8,%rsp
> >>4:  48 63 ffmovslq %edi,%rdi
> >>7:  48 63 f6movslq %esi,%rsi
> >>a:  b8 3e 00 00 00  mov$0x3e,%eax
> >>f:  0f 05   syscall
> >>   11:  48 89 c7mov%rax,%rdi
> >>   14:  e8 00 00 00 00  callq  19 
> >>   19:  5a  pop%rdx
> >>   1a:  c3  retq
> >
> > Wow that's some extraordinarily bad codegen going on by gcc... The
> > sign-extension is semantically needed and I don't see a good way
> > around it (glibc's asm is kinda a hack taking advantage of kernel not
> > looking at high bits, I think), but the gratuitous stack adjustment
> > and refusal to generate a tail call isn't. I'll see if we can track
> > down what's going on and get it fixed.

A suitable cast might get rid of the sign extension.
Possibly just (unsigned int).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


<    1   2   3   4   5   6   >