RE: hardcoded SIGSEGV in __die() ?

2020-03-25 Thread David Laight
From: Joakim Tjernlund
> Sent: 23 March 2020 15:45
...
> > > I tried to follow that chain thinking it would end up sending a signal to 
> > > user space but I cannot
> see
> > > that happens. Seems to be related to debugging.
> > >
> > > In short, I cannot see any signal being delivered to user space. If so 
> > > that would explain why
> > > our user space process never dies.
> > > Is there a signal hidden in machine_check handler for SIGBUS I cannot see?
> > >
> >
> > Isn't it done in do_exit(), called from oops_end() ?
> 
> hmm, so it seems. The odd thing though is that do_exit takes an exit code, 
> not signal number.
> Also, feels a bit odd to force an exit(that we haven't seen happening) rather 
> than just a signal.

Isn't there something 'magic' that converts EFAULT into SIGSEGV?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [RESEND PATCH v2 9/9] ath5k: Constify ioreadX() iomem argument (as in generic implementation)

2020-02-24 Thread David Laight
From: Geert Uytterhoeven
> Sent: 24 February 2020 12:54
> To: Krzysztof Kozlowski 
...
> > > > diff --git a/drivers/net/wireless/ath/ath5k/ahb.c 
> > > > b/drivers/net/wireless/ath/ath5k/ahb.c
> > > > index 2c9cec8b53d9..8bd01df369fb 100644
> > > > --- a/drivers/net/wireless/ath/ath5k/ahb.c
> > > > +++ b/drivers/net/wireless/ath/ath5k/ahb.c
> > > > @@ -138,18 +138,18 @@ static int ath_ahb_probe(struct platform_device 
> > > > *pdev)
> > > >
> > > > if (bcfg->devid >= AR5K_SREV_AR2315_R6) {
> > > > /* Enable WMAC AHB arbitration */
> > > > -   reg = ioread32((void __iomem *) AR5K_AR2315_AHB_ARB_CTL);
> > > > +   reg = ioread32((const void __iomem *) 
> > > > AR5K_AR2315_AHB_ARB_CTL);
> > >
> > > While I understand why the parameter of ioread32 should be const, I
> > > don't see a reason for these casts on the users' side. What does it
> > > bring except longer code to read?
> >
> > Because the argument is an int:
> >
> > drivers/net/wireless/ath/ath5k/ahb.c: In function ‘ath_ahb_probe’:
> > drivers/net/wireless/ath/ath5k/ahb.c:141:18: warning: passing argument 1 of 
> > ‘ioread32’ makes pointer
> from integer without a cast [-Wint-conversion]
> >reg = ioread32(AR5K_AR2315_AHB_ARB_CTL);
> 
> That's an argument for keeping the cast to "void __iomem *", not for
> adding the "const", right?

Or more likely change the definitions to use a struct for the layout.
That also stops the constants being used in the wrong place.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()

2020-02-05 Thread David Laight
From: Wei Yang
> Sent: 05 February 2020 09:59
...
> If it is me, I would like to take out these two similar logic out.
> 
> For example:
> 
>   if () {
>   } else if () {
>   } else {
>   goto out;
>   }

I'm pretty sure the kernel layout rules disallow 'else if'.
It is also pretty horrid unless the conditionals are all related
(so it is almost a switch statement).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: z constraint in powerpc inline assembly ?

2020-01-16 Thread David Laight
> You mean the mpc8xx , but I'm also using the mpc832x which has a e300c2
> core and is capable of executing 2 insns in parallel if not in the same
> Unit.

That should let you do a memory read and an add.
(I can't remember if the ppc has 'add from memory' but that is
likely to use both units anyway.)
An infinitely unrolled loop will then be 4 clocks/byte (for 32bit).
If you get to 3 for a real loop you are doing ok.

Remember, unroll too much and you displace other code from
the i-cache. Also the i-cache loads themselves kill you.
(A hot-cache benchmark won't see this...)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: z constraint in powerpc inline assembly ?

2020-01-16 Thread David Laight
From: Segher Boessenkool
> Sent: 16 January 2020 16:22
...
> > However a loop of 'add with carry' instructions may not be the
> > fastest code by any means.
> > Because the carry flag is needed for every 'adc' you can't do more
> > that one adc per clock.
> > This limits you to 8 bytes/clock on a 64bit system - even one
> > that can schedule multiple memory reads and lots of instructions
> > every clock.
> >
> > I don't know ppc, but on x86 you don't even get 1 adc per clock
> > until very recent (Haswell I think) cpus.
> > Sandy/Ivy bridge will do so if you add to alternate registers.
> 
> The carry bit is renamed just fine on all modern Power cpus.  On Power9
> there is an extra carry bit, precisely so you can do two interleaved
> chains.  And you can run lots of these insns at once, every cycle.

The limitation on old x86 was that each u-op could only have 2 inputs.
Since adc needs 3 it always took 2 clocks.
The first 'fix' still had an extra delay on the result register.

There is also a big problem of false dependencies against the flags.
PPC may not have this problem, but it makes it very difficult to
loop carry any of the flags.
Using 'dec' (which doesn't affect carry, but does set zero) is really slow.

Even though the latest x86 cpu have ADOX and ADCX (that use the
overflow and carry flags) and can run in parallel the LOOP 'dec jump
non-zero' instruction is microcoded and serialising!
I have got 12 bytes/clock without too much unrolling, but it is hard
work and probably not worth the effort.

...
> Christophe uses a very primitive 32-bit cpu, not even superscalar.  A
> loop doing adde is pretty much optimal, probably wants some unrolling
> though.

Or interleaving so it does read_a, [read_b, adc_a, read_a, adc_b]* adc_a.
That might be enough to get the loop 'for free' if there are memory stalls.

> Do normal 64-bit adds, and in parallel also accumulate the values shifted
> right by 32 bits.  You can add 4G of them this way, and restore the 96-bit
> actual sum from these two accumulators, so that you can fold it to a proper
> ones' complement sum after the loop.

That is probably too many instructions per word - unless you are using
simd ones.

> But you can easily beat 8B/clock using vectors, or doing multiple addition
> chains (interleaved) in parallel.  Not that it helps, your limiting factor
> is the memory bandwidth anyway, if anything in the memory pipeline stalls
> all your optimisations are for nothing.

Yep, if the data isn't in the L1 cache anything complex is a waste of time.

Unrolling too much just makes the top/bottom code take too long and
then it dominates for a lot of 'real world' buffers.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: z constraint in powerpc inline assembly ?

2020-01-16 Thread David Laight
From: Christophe Leroy
> Sent: 16 January 2020 06:12
> 
> I'm trying to see if we could enhance TCP checksum calculations by
> splitting inline assembly blocks to give GCC the opportunity to mix it
> with other stuff, but I'm getting difficulties with the carry.

if you are trying to 'loop carry' the 'carry flag' with 'add with carry'
instructions you'll almost certainly need to write the loop in asm.
Since the loop itself is simple, this probably doesn't matter.

However a loop of 'add with carry' instructions may not be the
fastest code by any means.
Because the carry flag is needed for every 'adc' you can't do more
that one adc per clock.
This limits you to 8 bytes/clock on a 64bit system - even one
that can schedule multiple memory reads and lots of instructions
every clock.

I don't know ppc, but on x86 you don't even get 1 adc per clock
until very recent (Haswell I think) cpus.
Sandy/Ivy bridge will do so if you add to alternate registers.

For earlier cpu it is actually difficult to beat the 4 bytes/clock
you get by adding 32bit values to a 64bit register in C code.

One possibility is to do a normal add then shift the carry
into a separate register.
After 64 words use 'popcnt' to sum the carry bits.
With 2 accumulators (and carry shifts) you'd need to
break the loop every 1024 bytes.
This should beat 8 bytes/clock if you can exeute more than
1 memory read, one add and one shift each clock.

I've not tried this on an old x86 cpu - which would need a software
'popcnt'. It got close to 8 bytes/clock on Ivy bridge.
It almost certainly beats the 4 bytes/clock of the current x86-64
code on such systems.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [RFT 00/13] iomap: Constify ioreadX() iomem argument

2020-01-08 Thread David Laight
From: Christophe Leroy
> Sent: 08 January 2020 08:49
...
> And as pointed by Arnd, the volatile is really only necessary for the
> dereference itself, should the arch use dereferencing.

I've had trouble with some versions of gcc and reading of 'volatile unsigned 
char *'.
It tended to follow the memory read with an extra mask with 0xff.
(I suspect that internally the value landed into a temporary 'int' variable.)

I got better code using memory barriers.
So putting an asm barrier for the exact location of the memory read
either side of the read should have the desired effect without adding
extra instructions.
(You might think 'volatile' would mean that - but it doesn't.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH] powerpc/tools: Don't quote $objdump in scripts

2019-10-25 Thread David Laight
From: Segher Boessenkool
> Sent: 24 October 2019 18:29
> On Thu, Oct 24, 2019 at 11:47:30AM +1100, Michael Ellerman wrote:
> > Some of our scripts are passed $objdump and then call it as
> > "$objdump". This doesn't work if it contains spaces because we're
> > using ccache, for example you get errors such as:
> >
> >   ./arch/powerpc/tools/relocs_check.sh: line 48: ccache ppc64le-objdump: No 
> > such file or directory
> >   ./arch/powerpc/tools/unrel_branch_check.sh: line 26: ccache 
> > ppc64le-objdump: No such file or directory
> >
> > Fix it by not quoting the string when we expand it, allowing the shell
> > to do the right thing for us.
> 
> This breaks things for people with spaces in their paths.  Why doesn't your
> user use something like  alias objdump="ccache ppc64le-objdump"  , instead?

Given that make doesn't handle spaces in filenames it is likely that a build
will have terrible issues is there are spaces in any directory names.
(It is a right PITA running make on a certain OS.)
For command paths, spaces can be replaced by ? relying on shell globbing
to restore the space.

OTOH rather than alias, put the name of a script containing:
#! /bin/sh
exec ccache ppc64le-objdump "$@"
into $objdump.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: passing NULL to clock_getres (VDSO): terminated by unexpected signal 11

2019-10-21 Thread David Laight
From: Thomas Gleixner
> Sent: 20 October 2019 20:53
> On Sun, 20 Oct 2019, Andreas Schwab wrote:
> > On Okt 20 2019, Thomas Gleixner wrote:
> >
> > > POSIX does not mention anything about the validity of the pointer handed 
> > > to
> > > clock_getres().
> >
> > Sure it does: "If the argument res is not NULL, the resolution of the
> > specified clock shall be stored in the location pointed to by res.  If
> > res is NULL, the clock resolution is not returned.".
> 
> Sigh, that makes a lot of sense - NOT.
> 
> But for the sake of making a non-sensical specification happy we can add a
> NULL pointer check for this. The interesting question is what should be
> returned in this case. The kernel returns EFAULT which is probably not
> POSIX compliant either.

The application won't see errno == EFAULT.
EFAULT gets converted to SIGSEGV (probably) in the return-to-user code path.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v5 03/23] PCI: hotplug: Add a flag for the movable BARs feature

2019-09-30 Thread David Laight
From: Bjorn Helgaas
> Sent: 27 September 2019 23:02
> On Fri, Aug 16, 2019 at 07:50:41PM +0300, Sergey Miroshnichenko wrote:
> > When hot-adding a device, the bridge may have windows not big enough (or
> > fragmented too much) for newly requested BARs to fit in. And expanding
> > these bridge windows may be impossible because blocked by "neighboring"
> > BARs and bridge windows.
> >
> > Still, it may be possible to allocate a memory region for new BARs with the
> > following procedure:
> >
> > 1) notify all the drivers which support movable BARs to pause and release
> >the BARs; the rest of the drivers are guaranteed that their devices will
> >not get BARs moved;
> >
> > 2) release all the bridge windows except of root bridges;
> >
> > 3) try to recalculate new bridge windows that will fit all the BAR types:
> >- fixed;
> >- immovable;
> >- movable;
> >- newly requested by hot-added devices;
> >
> > 4) if the previous step fails, disable BARs for one of the hot-added
> >devices and retry from step 3;
> >
> > 5) notify the drivers, so they remap BARs and resume.
> 
> You don't do the actual recalculation in *this* patch, but since you
> mention the procedure here, are we confident that we never make things
> worse?
> 
> It's possible that a hot-add will trigger this attempt to move things
> around, and it's possible that we won't find space for the new device
> even if we move things around.  But are we certain that every device
> that worked *before* the hot-add will still work *afterwards*?
> 
> Much of the assignment was probably done by the BIOS using different
> algorithms than Linux has, so I think there's some chance that the
> BIOS did a better job and if we lose that BIOS assignment, we might
> not be able to recreate it.

Yep, removing everything and starting again is probably OTT and most of the 
churn won't help.

I think you need to work out what can be moved in order to make the required 
resources available
to each bus and then make the required changes.

In the simplest case you are trying to add resource below a bridge so need to 
'shuffle'
everything allocated after that bridge to later addresses (etc).

Many devices that support address reassignment might not need to be moved - so 
there is
no point remmapping them.

There is also the case when a device that is present but not currently is use 
could be taken
through a remove+insert sequence in order to change its resources.
Much easier to implement than 'remap while active'.
This would require a call into the driver (than can sleep) to request whether 
it is idle.
(and probably one at the end if the remove wasn't done).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-09-04 Thread David Laight
From: Nathan Chancellor [mailto:natechancel...@gmail.com]
> Sent: 04 September 2019 01:24
> On Tue, Sep 03, 2019 at 02:31:28PM -0500, Segher Boessenkool wrote:
> > On Mon, Sep 02, 2019 at 10:55:53PM -0700, Nathan Chancellor wrote:
> > > On Thu, Aug 29, 2019 at 09:59:48AM +, David Laight wrote:
> > > > From: Nathan Chancellor
> > > > > Sent: 28 August 2019 19:45
> > > > ...
> > > > > However, I think that -fno-builtin-* would be appropriate here because
> > > > > we are providing our own setjmp implementation, meaning clang should 
> > > > > not
> > > > > be trying to do anything with the builtin implementation like 
> > > > > building a
> > > > > declaration for it.
> > > >
> > > > Isn't implementing setjmp impossible unless you tell the compiler that
> > > > you function is 'setjmp-like' ?
> > >
> > > No idea, PowerPC is the only architecture that does such a thing.
> >
> > Since setjmp can return more than once, yes, exciting things can happen
> > if you do not tell the compiler about this.
> >
> >
> > Segher
> >
> 
> Fair enough so I guess we are back to just outright disabling the
> warning.

Just disabling the warning won't stop the compiler generating code
that breaks a 'user' implementation of setjmp().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] powerpc: Avoid clang warnings around setjmp and longjmp

2019-08-29 Thread David Laight
From: Nathan Chancellor
> Sent: 28 August 2019 19:45
...
> However, I think that -fno-builtin-* would be appropriate here because
> we are providing our own setjmp implementation, meaning clang should not
> be trying to do anything with the builtin implementation like building a
> declaration for it.

Isn't implementing setjmp impossible unless you tell the compiler that
you function is 'setjmp-like' ?

For instance I think it all goes horribly wrong if the generated code
looks like:
push local_variable
// setup arguments to setjmp
call setjmp
pop local_variable
// check return value of setjmp

With a naive compiler and simple ABI setjmp just has to save the
return address, stack pointer and caller saved registers.
With modern compilers and ABI I doubt you can implement setjmp
without some help from the compiler.

It is probably best to avoid setjmp/longjmp completely.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH v1 1/2] open: add close_range()

2019-05-23 Thread David Laight
From:  Konstantin Khlebnikov
> Sent: 23 May 2019 17:22

>  > In addition, the syscall will also work for tasks that do not have procfs
>  > mounted and on kernels that do not have procfs support compiled in. In such
>  > situations the only way to make sure that all file descriptors are closed
>  > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
>  > OPEN_MAX trickery (cf. comment [8] on Rust).

Code using RLIMIT_NOFILE is broken.
It is easy to reduce the hard limit below that of an open fd.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH] vsprintf: Do not break early boot with probing addresses

2019-05-15 Thread David Laight
From: Petr Mladek
> Sent: 15 May 2019 08:36
> On Tue 2019-05-14 14:37:51, Steven Rostedt wrote:
> >
> > [ Purple is a nice shade on the bike shed. ;-) ]
> >
> > On Tue, 14 May 2019 11:02:17 +0200
> > Geert Uytterhoeven  wrote:
> >
> > > On Tue, May 14, 2019 at 10:29 AM David Laight  
> > > wrote:
> > > > > And I like Steven's "(fault)" idea.
> > > > > How about this:
> > > > >
> > > > >   if ptr < PAGE_SIZE  -> "(null)"
> > > > >   if IS_ERR_VALUE(ptr)-> "(fault)"
> > > > >
> > > > >   -ss
> > > >
> > > > Or:
> > > > if (ptr < PAGE_SIZE)
> > > > return ptr ? "(null+)" : "(null)";
> >
> > Hmm, that is useful.
> >
> > > > if IS_ERR_VALUE(ptr)
> > > > return "(errno)"
> >
> > I still prefer "(fault)" as is pretty much all I would expect from a
> > pointer dereference, even if it is just bad parsing of, say, a parsing
> > an MAC address. "fault" is generic enough. "errno" will be confusing,
> > because that's normally a variable not a output.
> >
> > >
> > > Do we care about the value? "(-E%u)"?
> >
> > That too could be confusing. What would (-E22) be considered by a user
> > doing an sprintf() on some string. I know that would confuse me, or I
> > would think that it was what the %pX displayed, and wonder why it
> > displayed it that way. Whereas "(fault)" is quite obvious for any %p
> > use case.
> 
> This discussion clearly shows that it is hard to make anyone happy.
> 
> I considered switching to "(fault)" because there seems to be more
> people in favor of this.
> 
> But there is used also "(einval)" when an unsupported pointer
> modifier is passed. The idea is to show error codes that people
> are familiar with.
> 
> It might have been better to use the uppercase "(EFAULT)" and
> "(EINVAL)" to make it more obvious. But I wanted to follow
> the existing style with the lowercase "(null)".

Printing 'fault' when the code was (trying to) validate the
address was ok.
When the only check is for an -errno value it seems wrong as
most invalid addresses will actually fault (and panic).

The reason modern printf generate "(null)" is that it is far too
easy for a diagnostic print to fail to test a pointer.
It also makes it easier when 'throwing in' printf while debugging
to add a single trace that will work regardless of whether a
call had succeeded or not.

With the Linux kernel putting errno values into pointers it
seems likely that most invalid pointers in printf will actaully
be error values.
Printing the value will be helpful during debugging - as a
trace can be put after a call and show the parameters and result.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] vsprintf: Do not break early boot with probing addresses

2019-05-14 Thread David Laight
> And I like Steven's "(fault)" idea.
> How about this:
> 
>   if ptr < PAGE_SIZE  -> "(null)"
>   if IS_ERR_VALUE(ptr)-> "(fault)"
> 
>   -ss

Or:
if (ptr < PAGE_SIZE)
return ptr ? "(null+)" : "(null)";
if IS_ERR_VALUE(ptr)
return "(errno)"

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] vsprintf: Do not break early boot with probing addresses

2019-05-13 Thread David Laight
From: christophe leroy
> Sent: 10 May 2019 18:35
> Le 10/05/2019 à 18:24, Steven Rostedt a écrit :
> > On Fri, 10 May 2019 10:42:13 +0200
> > Petr Mladek  wrote:
> >
> >>   static const char *check_pointer_msg(const void *ptr)
> >>   {
> >> -  char byte;
> >> -
> >>if (!ptr)
> >>return "(null)";
> >>
> >> -  if (probe_kernel_address(ptr, byte))
> >> +  if ((unsigned long)ptr < PAGE_SIZE || IS_ERR_VALUE(ptr))
> >>return "(efault)";

"efault" looks a bit like a spellling mistake for "default".

> > < PAGE_SIZE ?
> >
> > do you mean: < TASK_SIZE ?
> 
> I guess not.
> 
> Usually, < PAGE_SIZE means NULL pointer dereference (via the member of a
> struct)

Maybe the caller should pass in a short buffer so that you can return
"(err-%d)" or "(null+%#x)" ?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH] vsprintf: Do not break early boot with probing addresses

2019-05-09 Thread David Laight
From: Michal Suchánek
> Sent: 09 May 2019 14:38
...
> > The problem is the combination of some new code called via printk(),
> > check_pointer() which calls probe_kernel_read(). That then calls
> > allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> > (before we've patched features).
> 
> There is early_mmu_has_feature for this case. mmu_has_feature does not
> work before patching so parts of kernel that can run before patching
> must use the early_ variant which actually runs code reading the
> feature bitmap to determine the answer.

Does the early_ variant get patched so the it is reasonably
efficient after the 'patching' is done?
Or should there be a third version which gets patched across?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH RFC v4 09/21] PCI: Mark immovable BARs with PCI_FIXED

2019-03-27 Thread David Laight
From: Bjorn Helgaas
> Sent: 26 March 2019 20:29
> 
> On Mon, Mar 11, 2019 at 04:31:10PM +0300, Sergey Miroshnichenko wrote:
> > If a PCIe device driver doesn't yet have support for movable BARs,
> > mark device's BARs with IORESOURCE_PCI_FIXED.
> 
> I'm hesitant about using IORESOURCE_PCI_FIXED for this purpose.  That
> was originally added to describe resources that can not be changed
> because they're hardwired in the device, e.g., legacy resources and
> Enhanced Allocation resources.
> 
> In general, I think the bits in res->flags should tell us things about
> the hardware.  This particular use would be something about the
> *driver*, and I think we should figure that out by looking at
> dev->driver.

There will also be drivers that don't support BARs being moved,
but may be in a state (ie not actually open) where they can go
through a remove-rescan sequence to allow the BAR be moved.

This might even be true if the open count is non-zero.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 0/2] PCI/AER: Consistently use _OSC to determine who owns AER

2018-11-16 Thread David Laight
From: Alexandru Gagniuc
> Sent: 15 November 2018 23:16
...
> I've asked around a few people at Dell and they unanimously agree that
> _OSC is the correct way to determine ownership of AER.

This is all very well, but we have systems (they might be Dell ones)
where failure of a PCIe link (even when all the drivers are removed)
causes an NMI - and crashes the kernel.

There are other systems which have AER registers available for
some of the PCIe devices, but the BIOS ignores them and _OSC
doesn't let the kernel take control.
In this case the AER registers can contain useful information
after a read returns ~0u.
It would be useful to be able to get this information without
having to grovel through all the PCIe config space.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: powerpc/xmon: Relax frame size for clang

2018-11-02 Thread David Laight
From: Linuxppc-dev 
[mailto:linuxppc-dev-bounces+david.laight=aculab@lists.ozlabs.org] On 
Behalf Of
> Michael Ellerman
> Subject: Re: powerpc/xmon: Relax frame size for clang
> 
> On Wed, 2018-10-31 at 01:09:34 UTC, Joel Stanley wrote:
> > When building with clang (8 trunk, 7.0 release) the frame size limit is
> > hit:
> >
> >  arch/powerpc/xmon/xmon.c:452:12: warning: stack frame size of 2576
> >  bytes in function 'xmon_core' [-Wframe-larger-than=]
> >
> > Some investigation by Naveen indicates this is due to clang saving the
> > addresses to printf format strings on the stack.
> >
> > While this issue is investigated, bump up the frame size limit for xmon
> > when building with clang.
> >
> > Link: https://github.com/ClangBuiltLinux/linux/issues/252
> > Signed-off-by: Joel Stanley 

Would it be better to stop some of the functions being inlined?

Clearly clang shouldn't be using separate on-stack temporaries
for every printf() call.
That is indicative of a bigger problem.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed

2018-11-02 Thread David Laight
From: Paul E. McKenney
> Sent: 01 November 2018 17:02
...
> And there is a push to define C++ signed arithmetic as 2s complement,
> but there are still 1s complement systems with C compilers.  Just not
> C++ compilers.  Legacy...

Hmmm... I've used C compilers for DSPs where signed integer arithmetic
used the 'data registers' and would saturate, unsigned used the 'address
registers' and wrapped.
That was deliberate because it is much better to clip analogue values.

Then there was the annoying cobol run time that didn't update the
result variable if the result wouldn't fit.
Took a while to notice that the sum of a list of values was even wrong!
That would be perfectly valid for C - if unexpected.

> > But for us using -fno-strict-overflow which actually defines signed
> > overflow

I wonder how much real code 'strict-overflow' gets rid of?
IIRC gcc silently turns loops like:
int i; for (i = 1; i != 0; i *= 2) ...
into infinite ones.
Which is never what is required.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] net: ethernet: fs-enet: Use generic CRC32 implementation

2018-07-24 Thread David Laight
From: Krzysztof Kozlowski
> Sent: 23 July 2018 17:20
> Use generic kernel CRC32 implementation because it:
> 1. Should be faster (uses lookup tables),

Are you sure?
The lookup tables are unlikely to be in the data cache and
the 6 cache misses kill performance.
(Not that it particularly matters when setting up multicast hash tables).

> 2. Removes duplicated CRC generation code,
> 3. Uses well-proven algorithm instead of coding it one more time.
...
> 
> Not tested on hardware.

Have you verified that the old and new functions give the
same result for a few mac addresses?
It is very easy to use the wrong bits in crc calculations
or generate the output in the wrong bit order.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH] net: ethernet: fs-enet: Use generic CRC32 implementation

2018-07-24 Thread David Laight
From: Krzysztof Kozlowski
> Sent: 24 July 2018 12:12
...
> >> Not tested on hardware.
> >
> > Have you verified that the old and new functions give the
> > same result for a few mac addresses?
> > It is very easy to use the wrong bits in crc calculations
> > or generate the output in the wrong bit order.
> 
> I copied the original code and new one onto a different driver and run
> this in a loop for thousands of data input (although not all possible
> MAC combinations). The output was the same. I agree however that real
> testing would be important.

Since CRC are linear you only need to check that each input
bit generates the correct output.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


RE: [PATCH v2] powerpc/64: Fix build failure with GCC 8.1

2018-05-29 Thread David Laight
From: Christophe LEROY
> Sent: 29 May 2018 10:37
...
>  -   strncpy(new_part->header.name, name, 12);
>  +   memcpy(new_part->header.name, name, strnlen(name,
>  sizeof(new_part->header.name)));
> >>>
> >>>
> >>> The comment for nvram_header.lgnth says:
> >>>
> >>>   /* Terminating null required only for names < 12 chars. */
> >>>
> >>> This will not terminate the string with a zero (the struct is
> >>> allocated with kmalloc).
> >>> So the original code is correct, the new one isn't.
> >>
> >> Right, then I have to first zeroize the destination.
> >
> > Using kzalloc() instead of kmalloc() will do.
> >
> > Still, papering around these warnings seems to obscure things, IMHO.
> > And it increases code size, as you had to add a call to strnlen().
> 
> Right but then, what is the best solution to elimate that warning ?

Time to add the I_really_mean_strncy() function.

David



RE: RFC on writel and writel_relaxed

2018-03-29 Thread David Laight
From: Jason Gunthorpe
> Sent: 29 March 2018 15:45
...
> > > When talking about ordering between the devices, the relevant question
> > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > from the DEVICE_FOO. 'ordered' means that in this case
> > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > by BAR.
> >
> > Yes, and that isn't the case for arm because the writes can still be
> > buffered.
> 
> The statement is not about buffering, or temporal completion order, or
> the order of acks returning to the CPU. It is about pure transaction
> ordering inside the interconnect.
> 
> Can write BAR -> FOO pass write CPU -> FOO?

Almost certainly.
The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
The second cpu write then completes (to a different target).
That target then issues a peer to peer transfer that reaches the shared bridge.
I doubt the order of the transactions is guaranteed when it becomes 
'un-stalled'.

Of course, these are peer to peer transfers, and strange ones at that.
Normally you'd not be doing peer to peer transfers that access 'memory'
the cpu has just written to.

Requiring extra barriers in this case, or different functions for WC accesses
shouldn't really be an issue.

Even requiring a barrier between a write to dma coherent memory and a write
that starts dma isn't really onerous.
Even if it is a nop on all current architectures it is a good comment in the 
code.
It could even have a 'dev' argument.

David



RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 16:13
...
> > I've always wondered exactly what the twi;isync were for - always seemed
> > very heavy handed for most mmio reads.
> > Particularly if you are doing mmio reads from a data fifo.
> 
> If you do that you should use the "s" version of the accessors. Those
> will only do the above trick at the end of the access series. Also a
> FIFO needs special care about endianness anyway, so you should use
> those accessors regardless. (Hint: you never endian swap a FIFO even on
> BE on a LE device, unless something's been wired very badly in HW).

That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface.
Reading the high address 'clocked' the fifo.
So the first 3 reads could happen in any order, but the 4th had to be last.
This is a small ppc and we shovel a lot of data through that fifo.

Whether it needed byteswapping depended completely on how our hardware people
had built the pcb (not made easy by some docs using the ibm bit numbering).
In fact it didn't

While that driver only had to run on a very specific small ppc, generic drivers
might have similar issues.

I suspect that writel() is always (or should always be):
barrier_before_writel()
writel_relaxed()
barrier_after_writel()
So if a driver needs to do multiple writes (without strong ordering)
it should be able to repeat the writel_relaxed() with only one set
of barriers.
Similarly for readl().
In addition a lesser barrier is probably enough between a readl_relaxed()
and a writel_relaxed() that is conditional on the read value.

David



RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 28 March 2018 10:56
...
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
>   writel(1, RESET_REG);
>   usleep(10);
>   writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 
> The typical "fix" is to turn that into:
> 
>   writel(1, RESET_REG);
>   readl(RESET_REG); /* Flush posted writes */

Would a writel(1, RESET_REG) here provide enough synchronsiation?

>   usleep(10);
>   writel(0, RESET_REG);
> 
> *However* the issue here, at least on power, is that the CPU can issue
> that readl but doesn't necessarily wait for it to complete (ie, the
> data to return), before proceeding to the usleep. Now a usleep contains
> a bunch of loads and stores and is probably fine, but a udelay which
> just loops on the timebase may not be.
> 
> Thus we may still violate the timing requirement.

I've seem that sort of code (with udelay() and no read back) quite often.
How many were in linux I don't know.

For small delays I usually fix it by repeated writes (of the same value)
to the device register. That can guarantee very short intervals.

The only time I've actually seen buffered writes break timing was
between a 286 and an 8859 interrupt controller.
If you wrote to the mask then enabled interrupts the first IACK cycle
could be too close to write and break the cycle recovery time.
That clobbered most of the interrupt controller registers.
That probably affected every 286 board ever built!
Not sure how much software added the required extra bus cycle.

> What we did inside readl, with the twi;isync sequence (which basically
> means, trap on return value with "trap never" as a condition, followed
> by isync that ensures all excpetion conditions are resolved), is force
> the CPU to "consume" the data from the read before moving on.
> 
> This effectively makes readl fully synchronous (we would probably avoid
> that if we were to implement a readl_relaxed).

I've always wondered exactly what the twi;isync were for - always seemed
very heavy handed for most mmio reads.
Particularly if you are doing mmio reads from a data fifo.

Perhaps there should be a writel_wait() that is allowed to do a read back
for such code paths?

David



RE: RFC on writel and writel_relaxed

2018-03-28 Thread David Laight
From: Will Deacon
> Sent: 28 March 2018 09:54
...
> > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > expand to outX, and outX is not ordered with respect to other types of
> > > memory.
> >
> > Ugh ?
> >
> > My understanding of HW at least is the exact opposite. outX is *more*
> > ordered if anything, than any other accessors. IO space is completely
> > synchronous, non posted and ordered afaik.
> 
> I'm just going by memory-barriers.txt:
> 
> 
>  (*) inX(), outX():
> 
>  [...]
> 
>  They are guaranteed to be fully ordered with respect to each other.
> 
>  They are not guaranteed to be fully ordered with respect to other types 
> of
>  memory and I/O operation.

A long time ago there was a document from Intel that said that inb/outb weren't
necessarily synchronised wrt memory accesses.
(Might be P-pro era).
However no processors actually behaved that way and more recent docs
say that inb/outb are fully ordered.

David



RE: RFC on writel and writel_relaxed

2018-03-27 Thread David Laight
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong...

Certainly requiring the driver writes use explicit barriers should make
them understand when and why they are needed - and then put in the correct ones.

The problem I've had is that I have a good idea which barriers are needed
but find that readl/writel seem to contain a lot of extra ones.
Maybe the are required in some places, but the extra synchronising
instructions could easily have measureable performance effects on
hot paths.

Drivers are likely to contain sequences like:
read_io
if (...) return
write_mem
...
write_mem
barrier
write_mem
barrier
write_io
for things like ring updates.
Where the 'mem' might actually be in io space.
In such sequences not all the synchronising instructions are needed.
I'm not at all sure it is easy to get the right set.

David






RE: RFC on writel and writel_relaxed

2018-03-26 Thread David Laight
> > This is a super performance critical operation for most drivers and
> > directly impacts network performance.

Perhaps there ought to be writel_nobarrier() (etc) that never contain
any barriers at all.
This might mean that they are always just the memory operation,
but it would make it more obvious what the driver was doing.

The driver would then be explicitly responsible for all the rmb(), wmb()
and mmiowb() (etc).
Performance critical paths could then avoid all the extra barriers.

David




RE: RFC on writel and writel_relaxed

2018-03-22 Thread David Laight
From: Oliver
> Sent: 22 March 2018 05:24
...
> > No less painful was doing a byteswapping write to normal memory.
> 
> What was the problem? The reverse indexed load/store instructions are
> a little awkward to use, but they work...

Finding something that would generate the right instruction without any
barriers.
ISTR writing my own asm pattern.

David



RE: RFC on writel and writel_relaxed

2018-03-21 Thread David Laight
> x86 has compiler barrier inside the relaxed() API so that code does not
> get reordered. ARM64 architecturally guarantees device writes to be observed
> in order.

There are places where you don't even need a compile barrier between
every write.

I had horrid problems getting some ppc code (for a specific embedded SoC)
optimised to have no extra barriers.
I ended up just writing through 'pointer to volatile' and adding an
explicit 'eieio' between the block of writes and status read.

No less painful was doing a byteswapping write to normal memory.

David



RE: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-10 Thread David Laight
From: Matthew Wilcox
> Sent: 09 November 2017 19:44
> 
> On Fri, Nov 10, 2017 at 04:15:26AM +1100, Nicholas Piggin wrote:
> > So these semantics are what we're going with? Anything that does mmap() is
> > guaranteed of getting a 47-bit pointer and it can use the top 17 bits for
> > itself? Is intended to be cross-platform or just x86 and power specific?
> 
> It is x86 and powerpc specific.  The arm64 people have apparently stumbled
> across apps that expect to be able to use bit 48 for their own purposes.
> And their address space is 48 bit by default.  Oops.

(Do you mean 49bit?)

Aren't such apps just doomed to be broken?

ISTR there is something on (IIRC) sparc64 that does a 'match'
on the high address bits to make it much harder to overrun
one area into another.

David



RE: [PATCH] [net-next,v2] ibmvnic: Feature implementation of Vital Product Data (VPD) for the ibmvnic driver

2017-11-06 Thread David Laight
From: David Miller
> Sent: 04 November 2017 13:21
> From: Desnes Augusto Nunes do Rosario 
> Date: Wed,  1 Nov 2017 19:03:32 -0200
> 
> > +   substr = strnstr(adapter->vpd->buff, "RM", adapter->vpd->len);
> > +   if (!substr) {
> > +   dev_info(dev, "No FW level provided by VPD\n");
> > +   complete(>fw_done);
> > +   return;
> > +   }
> > +
> > +   /* get length of firmware level ASCII substring */
> > +   fw_level_len = *(substr + 2);
> > +
> > +   /* copy firmware version string from vpd into adapter */
> > +   ptr = strncpy((char *)adapter->fw_version,
> > + substr + 3, fw_level_len);
> 
> You have to be more careful here, making sure first that
> (substr + 2) < (adapter->vpd->buff + adapter->vpd->len),
> and next that (substr + 2 + fw_level_len) is in range
> as well.

And that the copy isn't longer than the target buffer.

David



RE: [PATCH 1/4] powerpc/tm: Add commandline option to disable hardware transactional memory

2017-10-23 Thread David Laight
From: Michael Neuling
> Sent: 21 October 2017 02:00
> To: David Laight; 'Breno Leitao'; Michael Ellerman
> Cc: stew...@linux.vnet.ibm.com; linuxppc-...@ozlabs.org; cyril...@gmail.com
> Subject: Re: [PATCH 1/4] powerpc/tm: Add commandline option to disable 
> hardware transactional memory
> 
> On Fri, 2017-10-20 at 12:58 +, David Laight wrote:
> > > > This patch adds a simple commandline option so that HTM can be
> > > > disabled at boot time.
> >
> > ISTM that being able to disable it after boot would be more useful.
> > (ie in a startup script)
> 
> I agree bug unfortunately that's impossible.
> 
> If a process is already running in tm suspend, there is no way to stop it 
> other
> than killing the process.  At that point you may as well kexec with a new
> cmdline option

Isn't that unlikely when the first rc scripts are being run?
Setting an early rc script is generally easier than adding a command line
parameter.

I don't know about ppc, but grub on x86 makes it painfully almost
impossible to have two boot entries that have different command line
options.

David



RE: [PATCH 1/4] powerpc/tm: Add commandline option to disable hardware transactional memory

2017-10-20 Thread David Laight
> > This patch adds a simple commandline option so that HTM can be
> > disabled at boot time.

ISTM that being able to disable it after boot would be more useful.
(ie in a startup script)

David



RE: Adjusting further size determinations?

2017-10-18 Thread David Laight
From: SF Markus Elfring
>  Unpleasant consequences are possible in both cases.
> >> How much do you care to reduce the failure probability further?
> >
> > Zero.
> 
> I am interested to improve the software situation a bit more here.

There are probably better places to spend your time!

If you want 'security' for kmalloc() then:

#define KMALLOC_TYPE(flags) (type *)kmalloc(sizeof (type), flags)
#define KMALLOC(ptr, flags) *(ptr) = KMALLOC_TYPE(typeof *(ptr), flags)

and change:
ptr = kmalloc(sizeof *ptr, flags);
to:
KMALLOC(, flags);

But it is all churn for churn's sake.

David



RE: [PATCH] powerpc/lib/sstep: Fix count leading zeros instructions

2017-10-09 Thread David Laight
From: naveen.n@linux.vnet.ibm.com
> Sent: 09 October 2017 15:48
...
> This is about the behavior of the gcc builtin being undefined, rather
> than the actual cpu instruction itself.

I'd have hoped that the ggc builtin just generated the expected cpu instruction.
So is only undefined because it is very cpu dependant.

More problematic here would be any cpu flag register settings.
eg: x86 would set the 'Z' flag for zero input.

David



RE: [PATCH] powerpc/lib/sstep: Fix count leading zeros instructions

2017-10-09 Thread David Laight
From: Segher Boessenkool
> Sent: 09 October 2017 15:21
> On Mon, Oct 09, 2017 at 01:49:26PM +, David Laight wrote:
> > From: Sandipan Das
> > > Sent: 09 October 2017 12:07
> > > According to the GCC documentation, the behaviour of __builtin_clz()
> > > and __builtin_clzl() is undefined if the value of the input argument
> > > is zero. Without handling this special case, these builtins have been
> > > used for emulating the following instructions:
> > >   * Count Leading Zeros Word (cntlzw[.])
> > >   * Count Leading Zeros Doubleword (cntlzd[.])
> > >
> > > This fixes the emulated behaviour of these instructions by adding an
> > > additional check for this special case.
> >
> > Presumably the result is undefined because the underlying cpu
> > instruction is used - and it's return value is implementation defined.
> 
> It is undefined because the result is undefined, and the compiler
> optimises based on that.  The return value of the builtin is undefined,
> not implementation defined.
> 
> The patch is correct.

But the code you are emulating might be relying on the (un)defined value
the cpu instruction gives for zero input rather than the input width.

Or, put another way, if the return value for a clz instruction with zero
argument is undefined (as it is on x86 - intel and amd may differ) then the
emulation can return any value since the code can't care.
So the conditional is not needed.

David




RE: [PATCH] powerpc/lib/sstep: Fix count leading zeros instructions

2017-10-09 Thread David Laight
From: Sandipan Das
> Sent: 09 October 2017 12:07
> According to the GCC documentation, the behaviour of __builtin_clz()
> and __builtin_clzl() is undefined if the value of the input argument
> is zero. Without handling this special case, these builtins have been
> used for emulating the following instructions:
>   * Count Leading Zeros Word (cntlzw[.])
>   * Count Leading Zeros Doubleword (cntlzd[.])
> 
> This fixes the emulated behaviour of these instructions by adding an
> additional check for this special case.

Presumably the result is undefined because the underlying cpu
instruction is used - and it's return value is implementation defined.

Here you are emulating the cpu instruction - so executing one will
give the same answer as it the 'real' one were execucted.

Indeed it might be worth an asm statement that definitely executes
the cpu instruction?

David

> 
> Signed-off-by: Sandipan Das 
> ---
>  arch/powerpc/lib/sstep.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 0f7e41bd7e88..ebbc0b92650c 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -1717,11 +1717,19 @@ int analyse_instr(struct instruction_op *op, const 
> struct pt_regs *regs,
>   * Logical instructions
>   */
>   case 26:/* cntlzw */
> - op->val = __builtin_clz((unsigned int) regs->gpr[rd]);
> + val = (unsigned int) regs->gpr[rd];
> + if (val == 0)
> + op->val = 32;
> + else
> + op->val = __builtin_clz(val);
>   goto logical_done;
>  #ifdef __powerpc64__
>   case 58:/* cntlzd */
> - op->val = __builtin_clzl(regs->gpr[rd]);
> + val = regs->gpr[rd];
> + if (val == 0)
> + op->val = 64;
> + else
> + op->val = __builtin_clzl(val);
>   goto logical_done;
>  #endif
>   case 28:/* and */
> --
> 2.13.6



RE: [PATCH v10 09/10] mm: stop zeroing memory during allocation in vmemmap

2017-10-06 Thread David Laight
From: Michal Hocko
> Sent: 06 October 2017 12:47
> On Fri 06-10-17 11:10:14, David Laight wrote:
> > From: Pavel Tatashin
> > > Sent: 05 October 2017 22:11
> > > vmemmap_alloc_block() will no longer zero the block, so zero memory
> > > at its call sites for everything except struct pages.  Struct page memory
> > > is zero'd by struct page initialization.
> >
> > It seems dangerous to change an allocator to stop zeroing memory.
> > It is probably saver to add a new function that doesn't zero
> > the memory and use that is the places where you don't want it
> > to be zeroed.
> 
> Not sure what you mean. memblock_virt_alloc_try_nid_raw is a new
> function which doesn't zero out...

You should probably leave vmemap_alloc_block() zeroing the memory
so that existing alls don't have to be changed - apart from the
ones you are explicitly optimising.

David


RE: [PATCH v10 09/10] mm: stop zeroing memory during allocation in vmemmap

2017-10-06 Thread David Laight
From: Pavel Tatashin
> Sent: 05 October 2017 22:11
> vmemmap_alloc_block() will no longer zero the block, so zero memory
> at its call sites for everything except struct pages.  Struct page memory
> is zero'd by struct page initialization.

It seems dangerous to change an allocator to stop zeroing memory.
It is probably saver to add a new function that doesn't zero
the memory and use that is the places where you don't want it
to be zeroed.

David



RE: [PATCH 04/11] ia64: make dma_cache_sync a no-op

2017-10-04 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
>
> ia64 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/ia64/include/asm/dma-mapping.h | 5 -
>  1 file changed, 5 deletions(-)
> 
> diff --git a/arch/ia64/include/asm/dma-mapping.h 
> b/arch/ia64/include/asm/dma-mapping.h
> index 3ce5ab4339f3..99dfc1aa9d3c 100644
> --- a/arch/ia64/include/asm/dma-mapping.h
> +++ b/arch/ia64/include/asm/dma-mapping.h
> @@ -48,11 +48,6 @@ static inline void
>  dma_cache_sync (struct device *dev, void *vaddr, size_t size,
>   enum dma_data_direction dir)
>  {
> - /*
> -  * IA-64 is cache-coherent, so this is mostly a no-op.  However, we do 
> need to
> -  * ensure that dma_cache_sync() enforces order, hence the mb().
> -  */
> - mb();
>  }

Are you sure about this one?
It looks as though you are doing a mechanical change for all architectures.
Some of them are probably stranger than you realise.

Even with cache coherent memory any cpu 'store/write buffer' may not
be snooped by dma reads.

Something needs to flush the store buffer between the last cpu write
to the dma buffer and the write (probably a device register) that
tells the device it can read the memory.

My guess from the comment is that dma_cache_synch() is expected to
include that barrier - and it might not be anywhere else.

David



RE: [PATCH 02/11] x86: make dma_cache_sync a no-op

2017-10-03 Thread David Laight
From: Christoph Hellwig
> Sent: 03 October 2017 11:43
> x86 does not implement DMA_ATTR_NON_CONSISTENT allocations, so it doesn't
> make any sense to do any work in dma_cache_sync given that it must be a
> no-op when dma_alloc_attrs returns coherent memory.

I believe it is just about possible to require an explicit
write flush on x86.
ISTR this can happen with something like write combining.

Whether this can actually happen is the kernel is another matter.

David



RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2017-09-28 Thread David Laight
From: Simon Guo
> Sent: 27 September 2017 19:34
...
> > On X86 all the AVX registers are caller saved, the system call
> > entry could issue the instruction that invalidates them all.
> > Kernel code running in the context of a user process could then
> > use the registers without saving them.
> > It would only need to set a mark to ensure they are invalidated
> > again on return to user (might be cheap enough to do anyway).
> > Dunno about PPC though.
> 
> I am not aware of any ppc instruction which can set a "mark" or provide
> any high granularity flag against single or subgroup of vec regs' validity.
> But ppc experts may want to correct me.

I was just thinking of a software flag.

David



RE: [PATCH v2 2/3] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2017-09-27 Thread David Laight
From: Segher Boessenkool
> Sent: 27 September 2017 10:28
...
> You also need nasty code to deal with the start and end of strings, with
> conditional branches and whatnot, which quickly overwhelms the benefit
> of using vector registers at all.  This tradeoff also changes with newer
> ISA versions.

The goal posts keep moving.
For instance with modern intel x86 cpus 'rep movsb' is by far the fastest
way to copy data (from cached memory).

> Things have to become *really* cheap before it will be good to often use
> vector registers in the kernel though.

I've had thoughts about this in the past.
If the vector registers belong to the current process then you might
get away with just saving the ones you want to use.
If they belong to a different process then you also need to tell the
FPU save code where you've saved the registers.
Then the IPI code can recover all the correct values.

On X86 all the AVX registers are caller saved, the system call
entry could issue the instruction that invalidates them all.
Kernel code running in the context of a user process could then
use the registers without saving them.
It would only need to set a mark to ensure they are invalidated
again on return to user (might be cheap enough to do anyway).
Dunno about PPC though.

David



RE: [PATCH] mm: fix RODATA_TEST failure "rodata_test: test data was not read only"

2017-09-25 Thread David Laight
From: Segher Boessenkool
> Sent: 25 September 2017 08:37
> On Sun, Sep 24, 2017 at 12:17:51PM -0700, Kees Cook wrote:
> > On Thu, Sep 21, 2017 at 2:37 AM, Christophe Leroy
> >  wrote:
> > > On powerpc, RODATA_TEST fails with message the following messages:
> > >
> > > [6.199505] Freeing unused kernel memory: 528K
> > > [6.203935] rodata_test: test data was not read only
> > >
> > > This is because GCC allocates it to .data section:
> > >
> > > c0695034 g O .data  0004 rodata_test_data
> >
> > Uuuh... that seems like a compiler bug. It's marked "const" -- it
> > should never end up in .data. I would argue that this has done exactly
> > what it was supposed to do, and shows that something has gone wrong.
> > It should always be const. Adding "static" should just change
> > visibility. (I'm not opposed to the static change, but it seems to
> > paper over a problem with the compiler...)
> 
> The compiler puts this item in .sdata, for 32-bit.  There is no .srodata,
> so if it wants to use a small data section, it must use .sdata .
> 
> Non-external, non-referenced symbols are not put in .sdata, that is the
> difference you see with the "static".
> 
> I don't think there is a bug here.  If you think there is, please open
> a GCC bug.

The .sxxx sections are for 'small' data that can be accessed (typically)
using small offsets from a global register.
This means that all sections must be adjacent in the image.
So you can't really have readonly small data.

My guess is that the linker script is putting .srodata in with .sdata.

David



RE: [PATCH v2 3/3] powerpc:selftest update memcmp_64 selftest for VMX implementation

2017-09-25 Thread David Laight
From: wei.guo.si...@gmail.com
> Sent: 21 September 2017 00:35
> This patch adjust selftest memcmp_64 so that memcmp selftest can be
> compiled successfully.
...
>  #define ITERATIONS 1
> 
> +#define LARGE_SIZE (5 * 1024)
> +#define LARGE_ITERATIONS 1000
...

Measuring performance by doing a lot of iterations isn't ideal
and is pretty pointless.
Cold cache performance can be more useful.
Also you don't really want any dynamic branch prediction logic
tuned to the exact test you keep doing.

David




RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp

2017-09-20 Thread David Laight
From: Simon Guo
> Sent: 20 September 2017 10:57
> On Tue, Sep 19, 2017 at 10:12:50AM +, David Laight wrote:
> > From: wei.guo.si...@gmail.com
> > > Sent: 19 September 2017 11:04
> > > Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
> > > mode) if either src or dst address is not 8 bytes aligned. It can be
> > > opmitized if both addresses are with the same offset with 8 bytes 
> > > boundary.
> > >
> > > memcmp() can align the src/dst address with 8 bytes firstly and then
> > > compare with .Llong mode.
> >
> > Why not mask both addresses with ~7 and mask/shift the read value to ignore
> > the unwanted high (BE) or low (LE) bits.
> >
> > The same can be done at the end of the compare with any final, partial word.
> 
> Yes. That will be better. A prototyping shows ~5% improvement on 32 bytes
> size comparison with v1. I will rework on v2.

Clearly you have to be careful to return the correct +1/-1 on mismatch.

For systems that can do misaligned transfers you can compare the first
word, then compare aligned words and finally the last word.
Rather like a memcpy() function I wrote (for NetBDSD) that copied
the last word first, then a whole number of words aligned at the start.
(Hope no one expected anything special for overlapping copies.)

David



RE: [PATCH 0/5] [RFC] printk/ia64/ppc64/parisc64: let's deprecate %pF/%pf printk specifiers

2017-09-20 Thread David Laight
From: Helge Deller
> Sent: 19 September 2017 21:08
...
> > Using 'unsigned long' for any kind of pointer is an accident
> > waiting do happen.
> > It also makes it difficult to typecheck the function calls.
> > Using 'void *' isn't any better.
> > Either a pointer to an undefined struct, or a struct containing
> > a single 'char' member, is likely to be safest.
> 
> David, you might be right in most cases, but in this case I'd prefer
> unsigned long too. I think this will create the least amount of
> typecasts here.

I've not looked at the specifics case...

Another option is using a struct with a single member and
passing it by value.
This could be used for things like user-space pointers or
even errno values.
The only problem is old ABI where even small structures are
always passed by reference.

David



RE: [PATCH 0/5] [RFC] printk/ia64/ppc64/parisc64: let's deprecate %pF/%pf printk specifiers

2017-09-19 Thread David Laight
From: Sergey Senozhatsky
> Sent: 19 September 2017 03:06
...
> I'll simply convert everything to `unsigned long'. including the
> dereference_function_descriptor() function [I believe there are
> still some casts happening when we pass addr from kernel/module
> dereference functions to dereference_function_descriptor(), or
> when we return `void *' back to symbol resolution code, etc.)
> besides, it seems that everything that uses
> dereference_function_descriptor() wants `unsigned long' anyway:

Using 'unsigned long' for any kind of pointer is an accident
waiting do happen.
It also makes it difficult to typecheck the function calls.
Using 'void *' isn't any better.
Either a pointer to an undefined struct, or a struct containing
a single 'char' member, is likely to be safest.

David



RE: [PATCH v1 1/3] powerpc: Align bytes before fall back to .Lshort in powerpc memcmp

2017-09-19 Thread David Laight
From: wei.guo.si...@gmail.com
> Sent: 19 September 2017 11:04
> Currently memcmp() in powerpc will fall back to .Lshort (compare per byte
> mode) if either src or dst address is not 8 bytes aligned. It can be
> opmitized if both addresses are with the same offset with 8 bytes boundary.
> 
> memcmp() can align the src/dst address with 8 bytes firstly and then
> compare with .Llong mode.

Why not mask both addresses with ~7 and mask/shift the read value to ignore
the unwanted high (BE) or low (LE) bits.

The same can be done at the end of the compare with any final, partial word.

David
 


RE: [PATCH kernel] powerpc/powernv: Update comment about shifting IOV BAR

2017-09-14 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 14 September 2017 04:40
> On Thu, 2017-09-14 at 13:18 +1000, Alexey Kardashevskiy wrote:
> > On 14/09/17 13:07, Benjamin Herrenschmidt wrote:
> > > On Thu, 2017-09-14 at 12:45 +1000, Alexey Kardashevskiy wrote:
> > > > On 31/08/17 13:34, Alexey Kardashevskiy wrote:
> > > > > From: Benjamin Herrenschmidt 
> > > >
> > > > Oops, this was not right :)
> > > >
> > > > Anyway, Ben, please comment. Thanks.
> > >
> > > This is incorrect, we can do hotplug behind switches afaik.
> >
> > Do we have an actual system which allows this?
> 
> Tuleta no ?

You can logically 'hotplug' PCI(e) on any system [1].

The 'problem' is that whatever enumerates the PCI(e) at system
powerup doesn't normally assign extra resources to bridges to allow
for devices that aren't present at boot time.
So you can normally only replace cards with ones that use the same
(or less) resources, or that are not behind any bridges.
This is problematic if you have a docking station connected via
a bridge.

[1] Apart from some annoying x86 Dell servers we have which generate
an NMI when the PCIe link goes down (when we reprogram the fpga).
They also fail to boot if a link doesn't come up...

David



RE: [PATCH 1/2] powerpc/xmon: hdec is now 64bits

2017-08-30 Thread David Laight
From: Balbir Singh
> Sent: 30 August 2017 01:28
> ISA 300 defines hypervisor decrementer to be 64 bits in length.
> This patch extends the print format for all archs to be 64 bits
> 
> Signed-off-by: Balbir Singh 
> ---
>  arch/powerpc/xmon/xmon.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> index 9e68f1d..1b26d53 100644
> --- a/arch/powerpc/xmon/xmon.c
> +++ b/arch/powerpc/xmon/xmon.c
> @@ -1749,7 +1749,7 @@ static void dump_206_sprs(void)
> 
>   printf("sdr1   = %.16lx  hdar  = %.16lx hdsisr = %.8x\n",
>   mfspr(SPRN_SDR1), mfspr(SPRN_HDAR), mfspr(SPRN_HDSISR));
> - printf("hsrr0  = %.16lx hsrr1  = %.16lx hdec = %.8x\n",
> + printf("hsrr0  = %.16lx hsrr1  = %.16lx hdec = %.16lx\n",
>   mfspr(SPRN_HSRR0), mfspr(SPRN_HSRR1), mfspr(SPRN_HDEC));
>   printf("lpcr   = %.16lx  pcr   = %.16lx lpidr = %.8x\n",
>   mfspr(SPRN_LPCR), mfspr(SPRN_PCR), mfspr(SPRN_LPID));

On the face of it the patch doesn't do what the commit message says.
Not only that it is really silly to print a 32bit value with 8 extra
leading zero digits.

Something more subtle was also wrong:
There were 3 mfspr() calls, 2 printed with %lx and one with %x.
That ought to generate a warning from gcc.

David



RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-17 Thread David Laight
From: Alex Williamson
> Sent: 16 August 2017 17:56
...
> Firmware pissing match...  Processors running with 8k or less page size
> fall within the recommendations of the PCI spec for register alignment
> of MMIO regions of the device and this whole problem becomes less of an
> issue.

Actually if qemu is causing the MSI-X table accesses to fault, why doesn't
it just lie to the guest about the physical address of the MSI-X table?
Then mmio access to anything in the same physical page will just work.

It has already been pointed out that you can't actually police the
interrupts that are raised without host hardware support.

Actually, putting other vectors in the MSI-X table is boring, most
drivers will ignore unexpected interrupts.
Much more interesting are physical memory addresses and accessible IO
addresses.
Of course, a lot of boards have PCI master capability and can probably
be persuaded to do writes to specific location anyway.

David



RE: [RFC PATCH v5 0/5] vfio-pci: Add support for mmapping MSI-X table

2017-08-15 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 15 August 2017 02:34
> On Tue, 2017-08-15 at 09:16 +0800, Jike Song wrote:
> > > Taking a step back, though, why does vfio-pci perform this check in the
> > > first place? If a malicious guest already has control of a device, any
> > > kind of interrupt spoofing it could do by fiddling with the MSI-X
> > > message address/data it could simply do with a DMA write anyway, so the
> > > security argument doesn't stand up in general (sure, not all PCIe
> > > devices may be capable of arbitrary DMA, but that seems like more of a
> > > tenuous security-by-obscurity angle to me).
> 
> I tried to make that point for years, thanks for re-iterating it :-)

Indeed, we have an FPGA based PCIe card where the MSI-X table is just a
piece of PCIe accessible memory.
The device driver has to read the MSI-X table and write the address+data
values to other registers which are then used to raise the interrupt.
(Ok, I've written a better interrupt generator so we don't do that
any more.)

David



RE: [v6 11/15] arm64/kasan: explicitly zero kasan shadow memory

2017-08-08 Thread David Laight
From: Pasha Tatashin
> Sent: 08 August 2017 12:49
> Thank you for looking at this change. What you described was in my
> previous iterations of this project.
> 
> See for example here: https://lkml.org/lkml/2017/5/5/369
> 
> I was asked to remove that flag, and only zero memory in place when
> needed. Overall the current approach is better everywhere else in the
> kernel, but it adds a little extra code to kasan initialization.

Perhaps you could #define the function prototype(s?) so that the flags
are not passed unless it is a kasan build?

David



RE: [PATCH v9 14/14] powerpc: rewrite local_t using soft_irq

2017-08-04 Thread David Laight
From: Nicholas Piggin
> Sent: 04 August 2017 10:04
> On Fri, 04 Aug 2017 11:40:43 +1000
> Benjamin Herrenschmidt  wrote:
> 
> > On Fri, 2017-08-04 at 03:50 +1000, Nicholas Piggin wrote:
> > > Hey, so... why are any of these implemented in asm? We should
> > > just do them all in C, right? I looked a bit harder at code gen
> > > and a couple of them are still emitting larx/stcx.
> >
> > As long as we can guarantee that the C compiler won't play games
> > moving stuff around. But yes, I tend to agree.
> 
> 
> I believe so. I mean we already depend on the same pattern for any
> other sequence of local_irq_disable(); c code; local_irq_enable();
> so we'd have other problems if we couldn't.

I'd guess that a "memory" clobber on the irq_disable/enable would be enough.
It could be restricted to the memory area being updated.

David



RE: [RESEND][PATCH V10 0/3] powernv : Add support for OPAL-OCC command/response interface

2017-08-01 Thread David Laight
From: Shilpasri G Bhat
> Sent: 31 July 2017 08:43
> In P9, OCC (On-Chip-Controller) supports shared memory based
> commad-response interface. Within the shared memory there is an OPAL
  ^ typo
> command buffer and OCC response buffer that can be used to send
> inband commands to OCC. The following commands are supported:
...

David


RE: [PATCH 1/5] Fix packed and aligned attribute warnings.

2017-07-31 Thread David Laight
From: SZ Lin
> Sent: 29 July 2017 08:24
...
> diff --git a/drivers/char/tpm/tpm_ibmvtpm.h b/drivers/char/tpm/tpm_ibmvtpm.h
> index 91dfe766d080..9f708ca3dc84 100644
> --- a/drivers/char/tpm/tpm_ibmvtpm.h
> +++ b/drivers/char/tpm/tpm_ibmvtpm.h
> @@ -25,7 +25,7 @@ struct ibmvtpm_crq {
>   __be16 len;
>   __be32 data;
>   __be64 reserved;
> -} __attribute__((packed, aligned(8)));
> +} __packed __aligned(8);

You can't need __packed and __aligned(8) on that structure.
There are no gaps and you are saying it is always aligned.

So just remove the pointless attributes.

David



RE: [RFC Part1 PATCH v3 07/17] x86/mm: Include SEV for encryption memory attribute changes

2017-07-28 Thread David Laight
From: Borislav Petkov
> Sent: 27 July 2017 15:59
> On Mon, Jul 24, 2017 at 02:07:47PM -0500, Brijesh Singh wrote:
> > From: Tom Lendacky 
> >
> > The current code checks only for sme_active() when determining whether
> > to perform the encryption attribute change.  Include sev_active() in this
> > check so that memory attribute changes can occur under SME and SEV.
> >
> > Signed-off-by: Tom Lendacky 
> > Signed-off-by: Brijesh Singh 
> > ---
> >  arch/x86/mm/pageattr.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> > index dfb7d65..b726b23 100644
> > --- a/arch/x86/mm/pageattr.c
> > +++ b/arch/x86/mm/pageattr.c
> > @@ -1781,8 +1781,8 @@ static int __set_memory_enc_dec(unsigned long addr, 
> > int numpages, bool enc)
> > unsigned long start;
> > int ret;
> >
> > -   /* Nothing to do if the SME is not active */
> > -   if (!sme_active())
> > +   /* Nothing to do if SME and SEV are not active */
> > +   if (!sme_active() && !sev_active())
> 
> This is the second place which does
> 
>   if (!SME && !SEV)
> 
> I wonder if, instead of sprinking those, we should have a
> 
>   if (mem_enc_active())
> 
> or so which unifies all those memory encryption logic tests and makes
> the code more straightforward for readers who don't have to pay
> attention to SME vs SEV ...

If any of the code paths are 'hot' it would make sense to be checking
a single memory location.

David



RE: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-27 Thread David Laight
From: Brijesh Singh
> Sent: 26 July 2017 21:07
...
> I am not sure if I understand your concern.
> 
> Are you commenting on amount of code duplication ? If so, I can certainly 
> improve
> and use the similar macro used into header file to generate the functions 
> body.

If you are careful the real functions could expand the inline functions
that get used when SEV is compiled out.

Oh, if you are looking at this, can you fix memcpy_to_io()
so that it is never 'rep movsb'?

David



RE: [PATCH 3/4] powerpc: pseries: only store the device node basename in full_name

2017-07-26 Thread David Laight
From: Rob Herring
> Sent: 25 July 2017 22:44
> With dependencies on full_name containing the entire device node path
> removed, stop storing the full_name in nodes created by
> dlpar_configure_connector() and pSeries_reconfig_add_node().
...
>   dn = kzalloc(sizeof(*dn), GFP_KERNEL);
>   if (!dn)
>   return NULL;
> 
>   name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
> - dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
> + dn->full_name = kasprintf(GFP_KERNEL, "%s", name);

Isn't this just strdup()?
Perhaps you should rename full_name to name - since it is no longer 'full'?

Maybe you could do a single malloc() for both 'dn' and the name?

David



RE: [PATCH v3 2/5] powerpc/lib/sstep: Add popcnt instruction emulation

2017-07-25 Thread David Laight
From: Linuxppc-dev 
[mailto:linuxppc-dev-bounces+david.laight=aculab@lists.ozlabs.org] On 
Behalf Of
> Matt Brown
> Sent: 25 July 2017 04:33
> To: linuxppc-dev@lists.ozlabs.org
> Subject: [PATCH v3 2/5] powerpc/lib/sstep: Add popcnt instruction emulation
> 
> This adds emulations for the popcntb, popcntw, and popcntd instructions.
> Tested for correctness against the popcnt{b,w,d} instructions on ppc64le.
> 
> Signed-off-by: Matt Brown 
> ---
> v3:
>   - optimised using the Giles-Miller method of side-ways addition
> v2:
>   - fixed opcodes
>   - fixed typecasting
>   - fixed bitshifting error for both 32 and 64bit arch
> ---
>  arch/powerpc/lib/sstep.c | 40 +++-
>  1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
> index 87d277f..c1f9cdb 100644
> --- a/arch/powerpc/lib/sstep.c
> +++ b/arch/powerpc/lib/sstep.c
> @@ -612,6 +612,32 @@ static nokprobe_inline void do_cmpb(struct pt_regs 
> *regs, unsigned long v1,
>   regs->gpr[rd] = out_val;
>  }
> 
> +/*
> + * The size parameter is used to adjust the equivalent popcnt instruction.
> + * popcntb = 8, popcntw = 32, popcntd = 64
> + */
> +static nokprobe_inline void do_popcnt(struct pt_regs *regs, unsigned long v1,
> + int size, int ra)
> +{
> + unsigned long long out = v1;
> +
> + out = (0x & out) + (0x & (out >> 1));
> + out = (0x & out) + (0x & (out >> 2));
> + out = (0x0f0f0f0f0f0f0f0f & out) + (0x0f0f0f0f0f0f0f0f & (out >> 4));
> + if (size == 8) {/* popcntb */
> + regs->gpr[ra] = out;

I'm pretty sure you need to mask the result with 7.

David



RE: [RFC Part1 PATCH v3 13/17] x86/io: Unroll string I/O when SEV is active

2017-07-25 Thread David Laight
From: Brijesh Singh
> Sent: 24 July 2017 20:08
> From: Tom Lendacky 
> 
> Secure Encrypted Virtualization (SEV) does not support string I/O, so
> unroll the string I/O operation into a loop operating on one element at
> a time.
> 
> Signed-off-by: Tom Lendacky 
> Signed-off-by: Brijesh Singh 
> ---
>  arch/x86/include/asm/io.h | 26 ++
>  1 file changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index e080a39..2f3c002 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -327,14 +327,32 @@ static inline unsigned type in##bwl##_p(int port)   
> \
>   \
>  static inline void outs##bwl(int port, const void *addr, unsigned long 
> count) \
>  {

Is it even worth leaving these as inline functions?
Given the speed of IO cycles it is unlikely that the cost of calling a real
function will be significant.
The code bloat reduction will be significant.

David



RE: [PATCH] tty: Fix TIOCGPTPEER ioctl definition

2017-07-11 Thread David Laight
From: Linuxppc-dev  Gleb Fotengauer-Malinovskiy
> Sent: 11 July 2017 01:12
> This ioctl does nothing to justify an _IOC_READ or _IOC_WRITE flag
> because it doesn't copy anything from/to userspace to access the
> argument.
> 
> Fixes: 54ebbfb1 ("tty: add TIOCGPTPEER ioctl")
...
> -#define TIOCGPTPEER  _IOR('T', 0x41, int) /* Safely open the slave */
> +#define TIOCGPTPEER  _IO('T', 0x41) /* Safely open the slave */

This is a user API change. When was the ioctl added?

David



RE: [PATCH v2 08/10] powerpc/opal: Add opal_async_wait_response_interruptible() to opal-async

2017-07-10 Thread David Laight
From: Cyril Bur
> Sent: 10 July 2017 02:31
> This patch adds an _interruptible version of opal_async_wait_response().
> This is useful when a long running OPAL call is performed on behalf of a
> userspace thread, for example, the opal_flash_{read,write,erase}
> functions performed by the powernv-flash MTD driver.
> 
> It is foreseeable that these functions would take upwards of two minutes
> causing the wait_event() to block long enough to cause hung task
> warnings. Furthermore, wait_event_interruptible() is preferable as
> otherwise there is no way for signals to stop the process which is going
> to be confusing in userspace.

ISTM that if you are doing (something like) a flash full device erase
(that really can take minutes) it isn't actually an interruptible
operation - the flash chip will still be busy.
So allowing the user process be interrupted just leaves a big mess.

OTOH the 'hung task' warning isn't the only problem with uninterruptible
sleeps - the processes also count towards the 'load average'.
Some software believes the 'load average' is a meaningful value.

It would be more generally useful for tasks to be able to sleep
uninterruptibly without counting towards the 'load average' or triggering
the 'task stuck' warning.
(I've code somewhere that sleeps interruptibly unless there is a signal
pending when it sleeps uninterruptibly.)

WRT flash erases, 'whole device' erases aren't significantly quicker
than sector by sector erases.
The latter can be interrupted between sectors.
I'm not sure you'd want to do writes than lock down enough kernel
memory to take even a second to complete.

David



RE: [PATCH v5 2/7] powerpc/vmlinux.lds: Align __init_begin to 16M

2017-06-29 Thread David Laight
From: Balbir Singh
> Sent: 28 June 2017 18:04
> For CONFIG_STRICT_KERNEL_RWX align __init_begin to 16M.
> We use 16M since its the larger of 2M on radix and 16M
> on hash for our linear mapping. The plan is to have
> .text, .rodata and everything upto __init_begin marked
> as RX. Note we still have executable read only data.
> We could further align rodata to another 16M boundary.
> I've used keeping text plus rodata as read-only-executable
> as a trade-off to doing read-only-executable for text and
> read-only for rodata.
...

Doesn't this go against 'address space randomisation'?
(Yes I realise a PIC kernel is probably non-trivial to compile
and load.)

David




RE: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer

2017-06-27 Thread David Laight
From: Frederic Barrat
> Sent: 26 June 2017 19:09
> P9 has support for PCI peer-to-peer, enabling a device to write in the
> mmio space of another device directly, without interrupting the CPU.
> 
> This patch adds support for it on powernv, by adding a new API to be
> called by drivers. The pnv_pci_set_p2p(...) call configures an
> 'initiator', i.e the device which will issue the mmio operation, and a
> 'target', i.e. the device on the receiving side.
...

Two questions:

1) How does the driver get the correct address to give to the 'initiator'
   in order to perform an access to the 'target'?

2) Surely the API call the driver makes should be architecture neutral,
   returning an error on other architectures.

At least some x86 cpus also support peer-to-peer writes,
I believe they can work between cpu chips.
PCIe bridges might support them (or be configurable to support them).

David



RE: [PATCH] powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user()

2017-06-26 Thread David Laight
From: Michael Ellerman
> Sent: 26 June 2017 14:34
..
> Al also pointed out that inlining copy_to/from_user() is probably of little or
> no benefit, which is correct
...

I was a bit horrified at the x86-64 versions of copy_to/from_user() as well.
With code that (tries to) error kernel pointers that cross stack frame
boundaries I'm fairly sure they expand to a lot of code.

I also suspect the cost of that kernel address check is not insignificant
especially if it has to walk down several stack frames.
(Never mind what happens to code compiled without stack frames.)

David




RE: [PATCH] soc/qman: Sleep instead of stuck hacking jiffies.

2017-06-26 Thread David Laight
From: Karim Eshapa
> Sent: 25 June 2017 16:14
> Use msleep() instead of stucking with
> long delay will be more efficient.
...
> --- a/drivers/soc/fsl/qbman/qman.c
> +++ b/drivers/soc/fsl/qbman/qman.c
> @@ -1084,11 +1084,7 @@ static int drain_mr_fqrni(struct qm_portal *p)
>* entries well before the ring has been fully consumed, so
>* we're being *really* paranoid here.
>*/
> - u64 now, then = jiffies;
> -
> - do {
> - now = jiffies;
> - } while ((then + 1) > now);
> + msleep(1);
...
How is that in any way equivalent?
If HZ is 1000 the old code loops for 10 seconds.
If HZ is 250 (common for some distros) it loops for 40 seconds.

Clearly both are horrid, but it isn't at all clear that a 1ms sleep
is performing the same job.

My guess is that this code is never called, and broken if actually called.

David



RE: [PATCH] powerpc/powernv: Enable PCI peer-to-peer

2017-06-02 Thread David Laight
From: Benjamin Herrenschmidt
> Sent: 31 May 2017 03:26
...
> We might also need a way to provide the "initiator" with translated DMA
> addresses that allow to target the receiver.

Related to that is the ability to allocate memory that two (or more)
PCIe devices can DMA to/from.
That can be useful if peer-to-peer transfers are not available.

David



RE: [PATCH 3/3] powerpc/8xx: xmon compile fix

2017-05-26 Thread David Laight
From:  Michael Ellerman
> Sent: 26 May 2017 08:24
> Nicholas Piggin  writes:
> > diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
> > index f11f65634aab..438fdb0fb142 100644
> > --- a/arch/powerpc/xmon/xmon.c
> > +++ b/arch/powerpc/xmon/xmon.c
> > @@ -1242,14 +1242,16 @@ bpt_cmds(void)
> >  {
> > int cmd;
> > unsigned long a;
> > -   int mode, i;
> > +   int i;
> > struct bpt *bp;
> > -   const char badaddr[] = "Only kernel addresses are permitted "
> > -   "for breakpoints\n";
> >
> > cmd = inchar();
> > switch (cmd) {
> > -#ifndef CONFIG_8xx
> > +#ifndef CONFIG_PPC_8xx
> > +   int mode;
> > +   const char badaddr[] = "Only kernel addresses are permitted "
> > +   "for breakpoints\n";
> > +
> > case 'd':   /* bd - hardware data breakpoint */
> > mode = 7;
> > cmd = inchar();
> 
> GCC 7 rejects this:
> 
>   arch/powerpc/xmon/xmon.c: In function bpt_cmds:
>   arch/powerpc/xmon/xmon.c:1252:13: error: statement will never be executed 
> [-Werror=switch-
> unreachable]
> const char badaddr[] = "Only kernel addresses are permitted for 
> breakpoints\n";
>^~~

Try 'static' ?

David



RE: RFC: better timer interface

2017-05-23 Thread David Laight
From: Thomas Gleixner
> Sent: 23 May 2017 12:59
> On Tue, 23 May 2017, David Laight wrote:
> 
> > From: Thomas Gleixner
> > > Sent: 21 May 2017 19:15
> > ...
> > > > timer_start(timer, ms, abs)
> > >
> > > I'm not even sure, whether we need absolute timer wheel timers at
> > > all, because most use cases are relative to now.
> >
> > Posix requires absolute timers for some userspace calls
> > (annoying because the code often wants relative).
> 
> Posix is completely irrelevant here. These timers are purely kernel
> internal.

Somehow pthread_cond_timedwait() has to be implemented.
Doing so without kernel timers that use absolute 'wall clock' time is tricky.

David



RE: RFC: better timer interface

2017-05-23 Thread David Laight
From: Thomas Gleixner
> Sent: 21 May 2017 19:15
...
> > timer_start(timer, ms, abs)
> 
> I'm not even sure, whether we need absolute timer wheel timers at
> all, because most use cases are relative to now.

Posix requires absolute timers for some userspace calls
(annoying because the code often wants relative).

OTOH how much conditional code is there for the 'abs' argument.
And is there any code that doesn't pass a constant?

Certainly worth a separate timer_start_abs(timer, wall_time)
function since you can't correctly map a wall_time timer
to a jiffies one.

David



RE: [PATCH 3/3] EDAC: mv64x60: replace in_le32/out_le32 with ioread32/iowrite32

2017-05-19 Thread David Laight
From: Arnd Bergmann
> Sent: 17 May 2017 22:40
> 
> On Wed, May 17, 2017 at 11:16 PM, Chris Packham
>  wrote:
> > On 18/05/17 06:18, Borislav Petkov wrote:
> > One thing I would like confirmation on is is in_le32 -> ioread32 the
> > correct change? I tossed up between ioread32 and readl. Looking at
> > mv643xx_eth.c which supports both the MV643XX and Orion it's using readl
> > so perhaps I should be using that.
> 
> There is no easy answer: on powerpc, readl is used for PCI,
> while in_le32 is used for on-chip devices, so in_le32 is the
> right one in principle. The main difference is that readl can
> work with CONFIG_EEH on pseries, but in_le32 is cheaper.
> 
> On ARM and most other architectures, readl is used for both
> PCI and on-chip devices, so that's what portable code tends
> to use.
> 
> ioread32 is required to behave the same way as readl
> on all __iomem pointers returned from ioremap(), but
> is an extern function on powerpc and can be more
> expensive when CONFIG_GENERIC_IOMAP is set.

What about x86?
Isn't ioread32() an extern function that checks for 'io' addresses
than need 'inb' (etc) instructions rather than memory ones.
If we know a PCI slave isn't 'io' should be be using ioread32() or readl()?
Don't some architectures have different enforced barriers in both these?

David





RE: [PATCH 2/9] timers: provide a "modern" variant of timers

2017-05-19 Thread David Laight
From: Christoph Hellwig
> Sent: 16 May 2017 12:48
>
> The new callback gets a pointer to the timer_list itself, which can
> then be used to get the containing structure using container_of
> instead of casting from and to unsigned long all the time.

What about sensible drivers that put some other value in the 'data'
field?

Perhaps it ought to have been 'void *data'.

Seems retrograde to be passing the address of the timer structure
(which, in principle, the callers no nothing about).

So I wouldn't call it 'modern', just different.

David



RE: [PATCH] spin loop primitives for busy waiting

2017-05-12 Thread David Laight
From: Linus Torvalds
> Sent: 11 May 2017 19:48
...
> The one question I have is about "spin_on_cond()": since you
> explicitly document that the "no spinning" case is expected to be the
> default, I really think that the default implementation should be
> along the lines if
> 
>   #define spin_on_cond(cond) do { \
> if (unlikely(!(cond))) { \
> spin_begin(); do spin_cpu_relax(); while (!(cond)); spin_end(); \
> } \
>   } while (0)
> 
> which will actually result in better code generation even if
> spin_begin/end() are no-ops, and just generally optimizes for the
> right behavior (ie do the spinning out-of-line, since by definition it
> cannot be performance-critical after the first iteration).

At least some versions of gcc convert while (cond) do {body}
into if (cond) do {body} while (cond) even when 'cond'
is a non-trivial expression and 'body' is trivial.
The code-bloat is silly.
No point enforcing the 'optimisation' here.

David


RE: [PATCH v2] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations

2017-05-04 Thread David Laight
From: Paul Clarke
> Sent: 04 May 2017 16:07
...
> > +   if ((c = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) {
> 
> Shouldn't this be MODULE_NAME_LEN + 1, since the ':' can come after a module 
> name of length
> MODULE_NAME_LEN?

No, because MODULE_NAME_LEN includes the terminating '\0'.

David



RE: [PATCH v2] powerpc/kprobes: refactor kprobe_lookup_name for safer string operations

2017-05-04 Thread David Laight
From: Naveen N. Rao [mailto:naveen.n@linux.vnet.ibm.com]
> Sent: 04 May 2017 11:25
> Use safer string manipulation functions when dealing with a
> user-provided string in kprobe_lookup_name().
> 
> Reported-by: David Laight <david.lai...@aculab.com>
> Signed-off-by: Naveen N. Rao <naveen.n@linux.vnet.ibm.com>
> ---
> Changed to ignore return value of 0 from strscpy(), as suggested by
> Masami.

let's see what this code looks like;

>   char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
>   bool dot_appended = false;
> + const char *c;
> + ssize_t ret = 0;
> + int len = 0;
> +
> + if ((c = strnchr(name, MODULE_NAME_LEN, ':')) != NULL) {

I don't like unnecessary assignments in conditionals.

> + c++;
> + len = c - name;
> + memcpy(dot_name, name, len);
> + } else
> + c = name;
> +
> + if (*c != '\0' && *c != '.') {
> + dot_name[len++] = '.';
>   dot_appended = true;

If you don't append a dot, then you can always just lookup
the original string.

>   }
> + ret = strscpy(dot_name + len, c, KSYM_NAME_LEN);
> + if (ret > 0)
> + addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name);

I'm not sure you need 'ret' here at all.

> + /* Fallback to the original non-dot symbol lookup */
> + if (!addr && dot_appended)
>   addr = (kprobe_opcode_t *)kallsyms_lookup_name(name);

We can bikeshed this function to death:

/* The function name must start with a '.'.
 * If it doesn't then we insert one. */
c = strnchr(name, MODULE_NAME_LEN, ':');
if (c && c[1] && c[1] != '.') {
/* Insert a '.' after the ':' */
c++;
len = c - name;
memcpy(dot_name, name, len);
} else {
if (name[0] == '.')
goto check_name:
/* Insert a '.' before name */
c = name;
len = 0;
}

dot_name[len++] = '.';
if (strscpy(dot_name + len, c, KSYM_NAME_LEN) > 0) {
addr = (kprobe_opcode_t *)kallsyms_lookup_name(dot_name);
if (addr)
return addr;
}
/* Symbol with extra '.' not found, fallback to original name */

check_name:
return (kprobe_opcode_t *)kallsyms_lookup_name(name);

David



RE: [PATCH] powerpc/xive: Fix/improve verbose debug output

2017-04-28 Thread David Laight
From: Michael Ellerman
> Sent: 28 April 2017 07:34
...
> Not sure what you mean. You can enable pr_debug()s individually, by
> function, by module, by file, or for the whole kernel.

It is sort of a shame that you can't turn pr_info() off in the same way.

David



RE: [PATCH] kallsyms: optimize kallsyms_lookup_name() for a few cases

2017-04-25 Thread David Laight
From: Naveen N. Rao
> Sent: 25 April 2017 17:18
> 1. Fail early for invalid/zero length symbols.
> 2. Detect names of the form  and skip checking for kernel
> symbols in that case.
> 
> Signed-off-by: Naveen N. Rao 
> ---
> Masami, Michael,
> I have added two very simple checks here, which I felt is good to have,
> rather than the elaborate checks in the previous version. Given the
> change in module code to use strnchr(), the checks are now safe and
> further tests are not probably not that useful.
...
> + if (strnchr(name, MODULE_NAME_LEN, ':'))
> + return module_kallsyms_lookup_name(name);

Should that be MODULE_NAME_LEN - 1 ?

David



RE: [PATCH 8/8] selftests: x86: override clean in lib.mk to fix warnings

2017-04-24 Thread David Laight
From: Linuxppc-dev Michael Ellerman
> Shuah Khan  writes:
> 
> > Add override for lib.mk clean to fix the following warnings from clean
> > target run.
> >
> > Makefile:44: warning: overriding recipe for target 'clean'
> > ../lib.mk:55: warning: ignoring old recipe for target 'clean'
> >
> > Signed-off-by: Shuah Khan 
> > ---
> >  tools/testing/selftests/x86/Makefile | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/tools/testing/selftests/x86/Makefile 
> > b/tools/testing/selftests/x86/Makefile
> > index 38e0a9c..4d27550 100644
> > --- a/tools/testing/selftests/x86/Makefile
> > +++ b/tools/testing/selftests/x86/Makefile
> > @@ -40,8 +40,9 @@ all_32: $(BINARIES_32)
> >
> >  all_64: $(BINARIES_64)
> >
> > -clean:
> > +override define CLEAN
> > $(RM) $(BINARIES_32) $(BINARIES_64)
> > +endef
> 
> Simpler as:
> 
> EXTRA_CLEAN := $(BINARIES_32) $(BINARIES_64)

Actually for builds that insist on crapping all over the source tree I've used:

clean:
rm -rf `cat .cvsignore 2>/dev/null`

David



RE: [PATCH 2/8] selftests: lib.mk: define CLEAN macro to allow Makefiles to override clean

2017-04-24 Thread David Laight
From: Shuah Khan
> Sent: 22 April 2017 00:15
> Define CLEAN macro to allow Makefiles to override common clean target
> in lib.mk. This will help fix the following failures:
> 
> warning: overriding recipe for target 'clean'
> ../lib.mk:55: warning: ignoring old recipe for target 'clean'
> 
> Signed-off-by: Shuah Khan 
> ---
>  tools/testing/selftests/lib.mk | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
> index 775c589..959273c 100644
> --- a/tools/testing/selftests/lib.mk
> +++ b/tools/testing/selftests/lib.mk
> @@ -51,8 +51,12 @@ endef
>  emit_tests:
>   $(EMIT_TESTS)
> 
> -clean:
> +define CLEAN
>   $(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES) 
> $(EXTRA_CLEAN)
> +endef
> +
> +clean:
> + $(CLEAN)

If might be easier to do something like:

ifneq($(NO_CLEAN),y)
clean:
$(RM) -r $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES) 
$(EXTRA_CLEAN)
endif

David



RE: [PATCH v3 4/7] powerpc: kprobes: use safer string functions in kprobe_lookup_name()

2017-04-21 Thread David Laight
From: Naveen N. Rao
> Sent: 19 April 2017 13:51
...
>   dot_name[0] = '\0';
> - strncat(dot_name, name, sizeof(dot_name) - 1);
> + strlcat(dot_name, name, sizeof(dot_name));
...

Is that really zeroing the first byte just so it can append to it?

David



RE: [PATCH] powerpc/32: Move entry_32 functions just after HEAD functions.

2017-04-19 Thread David Laight
From: Christophe Leroy
> By default, PPC8xx PINs an ITLB on the first 8M of memory in order
> to avoid any ITLB miss on kernel code.
> However, with some debug functions like DEBUG_PAGEALLOC and
> (soon to come) DEBUG_RODATA, the PINned TLB is invalidated soon
> after startup so ITLB missed start to happen also on the kernel code.
> 
> In order to avoid any ITLB miss in a critical section, we have to
> ensure that their is no page boundary crossed between the setup of
> a new value in SRR0/SRR1 and the associated RFI. This cannot be done
> easily if entry_32 functions sits in the middle of other .text
> functions. By placing entry_32 just after the .head section (as already
> done for entry_64 on PPC64), we can more easily ensure the issue
> doesn't happen.

Shouldn't this be done by putting all the functions that 'matter'
into a named section instead of relying on the order of the input files?
(Which is what I think this is doing.)

David



RE: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a function

2017-04-19 Thread David Laight
From: Naveen N. Rao
> Sent: 19 April 2017 09:09
> To: David Laight; Michael Ellerman
> Cc: linux-ker...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; Masami 
> Hiramatsu; Ingo Molnar
> Subject: RE: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a 
> function
> 
> Excerpts from David Laight's message of April 18, 2017 18:22:
> > From: Naveen N. Rao
> >> Sent: 12 April 2017 11:58
> > ...
> >> +kprobe_opcode_t *kprobe_lookup_name(const char *name)
> >> +{
> > ...
> >> +  char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
> >> +  const char *modsym;
> >> +  bool dot_appended = false;
> >> +  if ((modsym = strchr(name, ':')) != NULL) {
> >> +  modsym++;
> >> +  if (*modsym != '\0' && *modsym != '.') {
> >> +  /* Convert to  */
> >> +  strncpy(dot_name, name, modsym - name);
> >> +  dot_name[modsym - name] = '.';
> >> +  dot_name[modsym - name + 1] = '\0';
> >> +  strncat(dot_name, modsym,
> >> +  sizeof(dot_name) - (modsym - name) - 2);
> >> +  dot_appended = true;
> >
> > If the ':' is 'a way down' name[] then although the strncpy() won't
> > overrun dot_name[] the rest of the code can.
> 
> Nice catch, thanks David!
> We need to be validating the length of 'name'. I'll put out a patch for
> that.

Silent truncation is almost certainly wrong here.

> As an aside, I'm not sure I follow what you mean when you say that the
> strncpy() won't overrun dot_name[]. If we have a name[] longer than
> sizeof(dot_name) with the ':' after that, the strncpy() can also overrun
> dot_name[].

Yes, that should just be a memcpy(), as should the strncat().

Using strncpy() where the length is other than the size of the target buffer
should be banned. Not that it ever does what people expect.
strncat() is even worse.

David






RE: [PATCH v2 1/5] kprobes: convert kprobe_lookup_name() to a function

2017-04-18 Thread David Laight
From: Naveen N. Rao
> Sent: 12 April 2017 11:58
...
> +kprobe_opcode_t *kprobe_lookup_name(const char *name)
> +{
...
> + char dot_name[MODULE_NAME_LEN + 1 + KSYM_NAME_LEN];
> + const char *modsym;
> + bool dot_appended = false;
> + if ((modsym = strchr(name, ':')) != NULL) {
> + modsym++;
> + if (*modsym != '\0' && *modsym != '.') {
> + /* Convert to  */
> + strncpy(dot_name, name, modsym - name);
> + dot_name[modsym - name] = '.';
> + dot_name[modsym - name + 1] = '\0';
> + strncat(dot_name, modsym,
> + sizeof(dot_name) - (modsym - name) - 2);
> + dot_appended = true;

If the ':' is 'a way down' name[] then although the strncpy() won't
overrun dot_name[] the rest of the code can.

The strncat() call is particularly borked.

David



RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread David Laight
From: Daniel Axtens
> Sent: 15 March 2017 22:30
> Hi David,
> 
> > While not part of this change, the unrolled loops look as though
> > they just destroy the cpu cache.
> > I'd like be convinced that anything does CRC over long enough buffers
> > to make it a gain at all.
> >
> > With modern (not that modern now) superscalar cpus you can often
> > get the loop instructions 'for free'.
> > Sometimes pipelining the loop is needed to get full throughput.
> > Unlike the IP checksum, you don't even have to 'loop carry' the
> > cpu carry flag.
> 
> Internal testing on a NVMe device with T10DIF enabled on 4k blocks
> shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
> uses over 60% of CPU time - with these patches CRC drops to single
> digits.
> 
> I should probably have lead with that, sorry.

I'm not doubting that using the cpu instruction for crcs gives a
massive performance boost.

Just that the heavily unrolled loop is unlikely to help overall.
Some 'cold cache' tests on shorter buffers might be illuminating.
 
> FWIW, the original patch showed a 3.7x gain on btrfs as well -
> 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")
> 
> When Anton wrote the original code he had access to IBM's internal
> tooling for looking at how instructions flow through the various stages
> of the CPU, so I trust it's pretty much optimal from that point of view.

Doesn't mean he used it :-)

David




RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread David Laight
From: Linuxppc-dev Daniel Axtens
> Sent: 15 March 2017 12:38
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
...

While not part of this change, the unrolled loops look as though
they just destroy the cpu cache.
I'd like be convinced that anything does CRC over long enough buffers
to make it a gain at all.

With modern (not that modern now) superscalar cpus you can often
get the loop instructions 'for free'.
Sometimes pipelining the loop is needed to get full throughput.
Unlike the IP checksum, you don't even have to 'loop carry' the
cpu carry flag.

David



RE: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'

2017-03-14 Thread David Laight
From: Linuxppc-dev Till Smejkal
> Sent: 13 March 2017 22:14
> The only way until now to create a new memory map was via the exported
> function 'mm_alloc'. Unfortunately, this function not only allocates a new
> memory map, but also completely initializes it. However, with the
> introduction of first class virtual address spaces, some initialization
> steps done in 'mm_alloc' are not applicable to the memory maps needed for
> this feature and hence would lead to errors in the kernel code.
> 
> Instead of introducing a new function that can allocate and initialize
> memory maps for first class virtual address spaces and potentially
> duplicate some code, I decided to split the mm_alloc function as well as
> the 'mm_init' function that it uses.
> 
> Now there are four functions exported instead of only one. The new
> 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
> want to have the old behavior of mm_alloc one can use the newly introduced
> function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
> also fully initializes it.
...

That looks like bugs waiting to happen.
You need unchanged code to fail to compile.

David




RE: [PATCH] powerpc: Avoid panic during boot due to divide by zero in init_cache_info()

2017-03-06 Thread David Laight
From: Segher Boessenkool
> Sent: 06 March 2017 14:18
> On Mon, Mar 06, 2017 at 01:03:19PM +0100, Gabriel Paubert wrote:
> > > > > The PowerPC divw etc. instructions do not trap by themselves, but 
> > > > > recent
> > > > > GCC inserts trap instructions on code paths that are always undefined
> > > > > behaviour (like, dividing by zero).
> > > >
> > > > Is it systematic or does it depend from, e.g., optimization levels?
> > >
> > > In this case it needs -fisolate-erroneous-paths-dereference which is
> > > default at -O2 and higher.
> >
> > Great, another optimization-dependent behaviour. :-(
> 
> It makes the "behaviour" for undefined behaviour *less* surprising.
> It does not change anything else: malformed programs stay malformed,
> correct programs do exactly what they did before, too.

Yep, 'undefined behaviour' is exactly that.
It doesn't mean 'undefined result', or 'maybe a signal'.
Wiping the disk and targeting the user with an ICBM are both valid.

David



RE: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-06 Thread David Laight
From: Peter Zijlstra
> Sent: 06 March 2017 11:22
> To: Madhavan Srinivasan
> Cc: Wang Nan; Alexander Shishkin; linux-ker...@vger.kernel.org; Arnaldo 
> Carvalho de Melo; Alexei
> Starovoitov; Ingo Molnar; Stephane Eranian; Sukadev Bhattiprolu; 
> linuxppc-dev@lists.ozlabs.org
> Subject: Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of 
> perf_mem_data_src
> 
> On Mon, Mar 06, 2017 at 04:13:08PM +0530, Madhavan Srinivasan wrote:
> > From: Sukadev Bhattiprolu 
> >
> > perf_mem_data_src is an union that is initialized via the ->val field
> > and accessed via the bitmap fields. For this to work on big endian
> > platforms, we also need a big-endian represenation of perf_mem_data_src.
> 
> Doesn't this break interpreting the data on a different endian machine?

Best to avoid bitfields if you ever care about the bit order.

David



RE: [PATCH v2 0/2] Allow configurable stack size (especially 32k on PPC64)

2017-02-24 Thread David Laight
From: Hamish Martin
> Sent: 24 February 2017 00:52
> This patch series adds the ability to configure the THREAD_SHIFT value and
> thereby alter the stack size on powerpc systems. We are particularly 
> interested
> in configuring for a 32k stack on PPC64.
> 
> Using an NXP T2081 (e6500 PPC64 cores) we are observing stack overflows as a
> result of applying a DTS overlay containing some I2C devices. Our scenario is
> an ethernet switch chassis with plug-in cards. The I2C is driven from the 
> T2081
> through a PCA9548 mux on the main board. When we detect insertion of the 
> plugin
> card we schedule work for a call to of_overlay_create() to install a DTS
> overlay for the plugin board. This DTS overlay contains a further PCA9548 mux
> with more devices hanging off it including a PCA9539 GPIO expander. The
> ultimate installed I2C tree is:
> 
> T2081 --- PCA9548 MUX --- PCA9548 MUX --- PCA9539 GPIO Expander
> 
> When we install the overlay the devices described in the overlay are probed 
> and
> we see a large number of stack frames used as a result. If this is coupled 
> with
> an interrupt happening that requires moderate to high stack use we observe
> stack corruption. Here is an example long stack (from a 4.10-rc8 kernel) that
> does not show corruption but does demonstrate the length and frame sizes
> involved.
...

ISTM that the device probe needs to be iterative rather than recursive so
that deeply nested buses don't require deep stacks.

Switching stacks on interrupt entry would also make it much less likely that
you'll get an unexpected stack corruption.

The problem with just doubling the stack size is that code will just eat
it all up and, in a few years, something will hit the limit again.

David




RE: [RFC][PATCH] powerpc/64s: optimise syscall entry with relon hypercalls

2017-02-13 Thread David Laight
From: Nicholas Piggin
> Sent: 10 February 2017 18:23
> After bc3551257a ("powerpc/64: Allow for relocation-on interrupts from
> guest to host"), a getppid() system call goes from 307 cycles to 358
> cycles (+17%). This is due significantly to the scratch SPR used by the
> hypercall.
> 
> It turns out there are a some volatile registers common to both system
> call and hypercall (in particular, r12, cr0, ctr), which can be used to
> avoid the SPR and some other overheads for the system call case. This
> brings getppid to 320 cycles (+4%).
...
> + * syscall register convention is in Documentation/powerpc/syscall64-abi.txt
> + *
> + * For hypercalls, the register convention is as follows:
> + * r0 volatile
> + * r1-2 nonvolatile
> + * r3 volatile parameter and return value for status
> + * r4-r10 volatile input and output value
> + * r11 volatile hypercall number and output value
> + * r12 volatile
> + * r13-r31 nonvolatile
> + * LR nonvolatile
> + * CTR volatile
> + * XER volatile
> + * CR0-1 CR5-7 volatile
> + * CR2-4 nonvolatile
> + * Other registers nonvolatile
> + *
> + * The intersection of volatile registers that don't contain possible
> + * inputs is: r12, cr0, xer, ctr. We may use these as scratch regs
> + * upon entry without saving.

Except that they must surely be set to some known value on exit in order
to avoid leaking information to the guest.

David



RE: [PATCH] block: sed-opal: reduce stack size of ioctl handler

2017-02-09 Thread David Laight
From: Arnd Bergmann
> Sent: 08 February 2017 21:15
>
> When CONFIG_KASAN is in use, the sed_ioctl function uses unusually large 
> stack,
> as each possible ioctl argument gets its own stack area plus redzone:

Why not do a single copy_from_user() at the top of sed_ioctl() based on
the _IOC_DIR() and __IOC_SIZE() values?

Something like:
int sed_ioctl(..., unsigned int cmd, void __user *arg)
{
u64 buf[??]; /* or a union */
unsigned int cmd_sz = _IOC_SIZE(cmd);

if (_IOC_DIR(cmd) & (_IOC_WRITE | _IOC_READ) && cmd_sz > sizeof buf)
return -EINVAL;

if (_IOC_DIR(cmd) & _IOC_WRITE) {
if (copy_from_user(buf, arg, cmd_sz))
return -EFAULT;
} else {
if (IOC_DIR(cmd) & _IOC_READ))
memzero(buf, cmd_sz);
}

switch (cmd) {
...
rval = ...
...
}

if (rval >= 0 && (_IOC_DIR(cmd) & _IOC_READ)
&& copy_to_user(arg, buf, cmd_sz));
return -EFAULT;

return rval;
}

David



RE: ibmvtpm byteswapping inconsistency

2017-02-02 Thread David Laight
From: Michal Suchánek
> Sent: 02 February 2017 11:30
...
> The word is marked correctly as __be64 in that patch because count and
> handle are swapped to BE when saved to it and the whole word is then
> swapped again when loaded. If you just load ((u64)IBMVTPM_VALID_CMD <<
> 56 | ((u64)VTPM_TPM_COMMAND << 48) | ((u64)count << 32) |
> ibmvtpm->rtce_dma_handle in a register it works equally well
> without any __be and swaps involved.

And that version will almost certainly generate much better code.

David



RE: ibmvtpm byteswapping inconsistency

2017-01-30 Thread David Laight
From: Tyrel Datwyler
> Sent: 27 January 2017 18:03
> On 01/26/2017 05:50 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2017-01-26 at 17:42 -0800, Tyrel Datwyler wrote:
> >> On 01/26/2017 12:22 PM, Michal Suchnek wrote:
> >>> Hello,
> >>>
> >>> building ibmvtpm I noticed gcc warning complaining that second word
> >>> of
> >>> struct ibmvtpm_crq in tpm_ibmvtpm_suspend is uninitialized.
> >>>
> >>> The structure is defined as
> >>>
> >>> struct ibmvtpm_crq {
> >>> u8 valid;
> >>> u8 msg;
> >>> __be16 len;
> >>> __be32 data;
> >>> __be64 reserved;
> >>> } __attribute__((packed, aligned(8)));
> >>>
> >>> initialized as
> >>>
> >>> struct ibmvtpm_crq crq;
> >>> u64 *buf = (u64 *) 
> >>> ...
> >>> crq.valid = (u8)IBMVTPM_VALID_CMD;
> >>> crq.msg = (u8)VTPM_PREPARE_TO_SUSPEND;
> >>>
> >>> and submitted with
> >>>
> >>> rc = ibmvtpm_send_crq(ibmvtpm->vdev, cpu_to_be64(buf[0]),
> >>>   cpu_to_be64(buf[1]));

Isn't the real fubar here the use of that memory layout structure at all?
It would probably all be better if the call looked like:
rc = ibmvtpm_send_crq(ibmvtpm->vdev, MAKE_REQ(IBMVTPM_VALID_CMD,
VTPM_PREPARE_TO_SUSPEND, xxx_len, xxx_data), 0);
and MAKE_REQ() did all the required endian independant shifts to generate
the correct 32bit value.

David



RE: ibmvtpm byteswapping inconsistency

2017-01-27 Thread David Laight
From: Michal Suchánek
> building ibmvtpm I noticed gcc warning complaining that second word of
> struct ibmvtpm_crq in tpm_ibmvtpm_suspend is uninitialized.
> 
> The structure is defined as
> 
> struct ibmvtpm_crq {
> u8 valid;
> u8 msg;
> __be16 len;
> __be32 data;
> __be64 reserved;
> } __attribute__((packed, aligned(8)));
> 
> initialized as
> 
> struct ibmvtpm_crq crq;
> u64 *buf = (u64 *) 
...

Hrummfff
What is that attribute for, seems pretty confusing and pointless to me.

I also suspect that if you want to access it as two 64bit words it
ought to be a union.

David



RE: gcc trunk fails to build kernel on PowerPC64 due to oprofile warnings

2017-01-26 Thread David Laight
From: Anton Blanchard
> Sent: 25 January 2017 23:01
> gcc trunk has failed to build PowerPC64 kernels for a month or so. The issue
> is in oprofile, which is common code but ends up being sucked into
> arch/powerpc and therefore subject to the -Werror applied to arch/powerpc:
...
> linux/arch/powerpc/oprofile/../../../drivers/oprofile/oprofile_stats.c:55:25: 
> error: %d directive
> output may be truncated writing between 1 and 11 bytes into a region of size 
> 7 [-Werror=format-
> truncation=]
>snprintf(buf, 10, "cpu%d", i);

FFS these warnings are getting OTT.
The compiler needs to be able to track the domain of integers before applying
some of these warnings.

David



<    1   2   3   4   5   6   >