Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Sean Christopherson
On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
> > So there are effectively three reasons we want a delay:
> > 
> >  1. To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM 
> > can
> > enter the guest so that the guest doesn't need an arch-specific VM-Exit 
> > source.
> > 
> >  2. To let ioctl(KVM_RUN) make its way back to the test before the next 
> > round
> > of migration.
> > 
> >  3. To ensure the read-side can make forward progress, e.g. if 
> > sched_getcpu()
> > involves a syscall.
> > 
> > 
> > After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is 
> > the
> > only arch that currently uses xfer_to_guest_mode_work(), i.e. the test 
> > could be
> > tweaked to be overtly x86-specific.  But since a delay is needed for #2 and 
> > #3,
> > I'd prefer to rely on it for #1 as well in the hopes that this test provides
> > coverage for arm64 and/or s390 if they're ever converted to use the common
> > xfer_to_guest_mode_work().
> 
> Now that we have this understanding of why we need the delay, it would be 
> good to
> write this down in a comment within the test.

Ya, I'll get a new version out next week.

> Does it reproduce if we randomize the delay to have it picked randomly from 
> 0us
> to 100us (with 1us step) ? It would remove a lot of the needs for 
> arch-specific
> magic delay value.

My less-than-scientific testing shows that it can reproduce at delays up to 
~500us,
but above ~10us the reproducibility starts to drop.  The bug still reproduces
reliably, it just takes more iterations, and obviously the test runs a bit 
slower.

Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?


Re: [PATCH] net: spider_net: switch from 'pci_' to 'dma_' API

2021-08-27 Thread Geoff Levand
Hi Christophe,

On 8/27/21 12:56 PM, Christophe JAILLET wrote:
> It has *not* been compile tested because I don't have the needed
> configuration or cross-compiler.

The powerpc ppc64_defconfig has CONFIG_SPIDER_NET set. My
tdd-builder Docker image has the needed gcc-powerpc-linux-gnu
cross compiler to build ppc64_defconfig:

  https://hub.docker.com/r/glevand/tdd-builder

-Geoff
 


Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers
- On Aug 27, 2021, at 7:23 PM, Sean Christopherson sea...@google.com wrote:

> On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
[...]
>> Does it reproduce if we randomize the delay to have it picked randomly from 
>> 0us
>> to 100us (with 1us step) ? It would remove a lot of the needs for 
>> arch-specific
>> magic delay value.
> 
> My less-than-scientific testing shows that it can reproduce at delays up to
> ~500us,
> but above ~10us the reproducibility starts to drop.  The bug still reproduces
> reliably, it just takes more iterations, and obviously the test runs a bit
> slower.
> 
> Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?

Works for me, thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[PATCH] net: spider_net: switch from 'pci_' to 'dma_' API

2021-08-27 Thread Christophe JAILLET
In [1], Christoph Hellwig has proposed to remove the wrappers in
include/linux/pci-dma-compat.h.

Some reasons why this API should be removed have been given by Julia
Lawall in [2].

A coccinelle script has been used to perform the needed transformation
Only relevant parts are given below.

@@ @@
-PCI_DMA_BIDIRECTIONAL
+DMA_BIDIRECTIONAL

@@ @@
-PCI_DMA_TODEVICE
+DMA_TO_DEVICE

@@ @@
-PCI_DMA_FROMDEVICE
+DMA_FROM_DEVICE

@@
expression e1, e2, e3, e4;
@@
-pci_map_single(e1, e2, e3, e4)
+dma_map_single(>dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_unmap_single(e1, e2, e3, e4)
+dma_unmap_single(>dev, e2, e3, e4)

@@
expression e1, e2;
@@
-pci_dma_mapping_error(e1, e2)
+dma_mapping_error(>dev, e2)


[1]: 
https://lore.kernel.org/kernel-janitors/20200421081257.ga131...@infradead.org/
[2]: 
https://lore.kernel.org/kernel-janitors/alpine.DEB.2.22.394.2007120902170.2424@hadrien/

Signed-off-by: Christophe JAILLET 
---
It has *not* been compile tested because I don't have the needed
configuration or cross-compiler. However, the modification is completely
mechanical and done by coccinelle.
---
 drivers/net/ethernet/toshiba/spider_net.c | 27 +--
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/toshiba/spider_net.c 
b/drivers/net/ethernet/toshiba/spider_net.c
index 087f0af56c50..66d4e024d11e 100644
--- a/drivers/net/ethernet/toshiba/spider_net.c
+++ b/drivers/net/ethernet/toshiba/spider_net.c
@@ -354,9 +354,10 @@ spider_net_free_rx_chain_contents(struct spider_net_card 
*card)
descr = card->rx_chain.head;
do {
if (descr->skb) {
-   pci_unmap_single(card->pdev, descr->hwdescr->buf_addr,
+   dma_unmap_single(>pdev->dev,
+descr->hwdescr->buf_addr,
 SPIDER_NET_MAX_FRAME,
-PCI_DMA_BIDIRECTIONAL);
+DMA_BIDIRECTIONAL);
dev_kfree_skb(descr->skb);
descr->skb = NULL;
}
@@ -411,9 +412,9 @@ spider_net_prepare_rx_descr(struct spider_net_card *card,
if (offset)
skb_reserve(descr->skb, SPIDER_NET_RXBUF_ALIGN - offset);
/* iommu-map the skb */
-   buf = pci_map_single(card->pdev, descr->skb->data,
-   SPIDER_NET_MAX_FRAME, PCI_DMA_FROMDEVICE);
-   if (pci_dma_mapping_error(card->pdev, buf)) {
+   buf = dma_map_single(>pdev->dev, descr->skb->data,
+SPIDER_NET_MAX_FRAME, DMA_FROM_DEVICE);
+   if (dma_mapping_error(>pdev->dev, buf)) {
dev_kfree_skb_any(descr->skb);
descr->skb = NULL;
if (netif_msg_rx_err(card) && net_ratelimit())
@@ -653,8 +654,9 @@ spider_net_prepare_tx_descr(struct spider_net_card *card,
dma_addr_t buf;
unsigned long flags;
 
-   buf = pci_map_single(card->pdev, skb->data, skb->len, PCI_DMA_TODEVICE);
-   if (pci_dma_mapping_error(card->pdev, buf)) {
+   buf = dma_map_single(>pdev->dev, skb->data, skb->len,
+DMA_TO_DEVICE);
+   if (dma_mapping_error(>pdev->dev, buf)) {
if (netif_msg_tx_err(card) && net_ratelimit())
dev_err(>netdev->dev, "could not iommu-map packet 
(%p, %i). "
  "Dropping packet\n", skb->data, skb->len);
@@ -666,7 +668,8 @@ spider_net_prepare_tx_descr(struct spider_net_card *card,
descr = card->tx_chain.head;
if (descr->next == chain->tail->prev) {
spin_unlock_irqrestore(>lock, flags);
-   pci_unmap_single(card->pdev, buf, skb->len, PCI_DMA_TODEVICE);
+   dma_unmap_single(>pdev->dev, buf, skb->len,
+DMA_TO_DEVICE);
return -ENOMEM;
}
hwdescr = descr->hwdescr;
@@ -822,8 +825,8 @@ spider_net_release_tx_chain(struct spider_net_card *card, 
int brutal)
 
/* unmap the skb */
if (skb) {
-   pci_unmap_single(card->pdev, buf_addr, skb->len,
-   PCI_DMA_TODEVICE);
+   dma_unmap_single(>pdev->dev, buf_addr, skb->len,
+DMA_TO_DEVICE);
dev_consume_skb_any(skb);
}
}
@@ -1165,8 +1168,8 @@ spider_net_decode_one_descr(struct spider_net_card *card)
/* unmap descriptor */
hw_buf_addr = hwdescr->buf_addr;
hwdescr->buf_addr = 0x;
-   pci_unmap_single(card->pdev, hw_buf_addr,
-   SPIDER_NET_MAX_FRAME, PCI_DMA_FROMDEVICE);
+   dma_unmap_single(>pdev->dev, hw_buf_addr, SPIDER_NET_MAX_FRAME,
+DMA_FROM_DEVICE);
 
if ( (status == 

Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Ricardo Neri
On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote:
> On Tue, 10 Aug 2021 at 16:41, Ricardo Neri
>  wrote:
> > @@ -9540,6 +9629,12 @@ static struct rq *find_busiest_queue(struct lb_env 
> > *env,
> > nr_running == 1)
> > continue;
> >
> > +   /* Make sure we only pull tasks from a CPU of lower 
> > priority */
> > +   if ((env->sd->flags & SD_ASYM_PACKING) &&
> > +   sched_asym_prefer(i, env->dst_cpu) &&
> > +   nr_running == 1)
> > +   continue;
> 
> This really looks similar to the test above for SD_ASYM_CPUCAPACITY.
> More generally speaking SD_ASYM_PACKING and SD_ASYM_CPUCAPACITY share
> a lot of common policy and I wonder if at some point we could not
> merge their behavior in LB

I would like to confirm with you that you are not expecting this merge
as part of this series, right?

Thanks and BR,
Ricardo


Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Ricardo Neri
On Fri, Aug 27, 2021 at 05:17:22PM +0200, Vincent Guittot wrote:
> On Fri, 27 Aug 2021 at 16:50, Peter Zijlstra  wrote:
> >
> > On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote:
> > > > +/**
> > > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can 
> > > > pull tasks
> > > > + * @dst_cpu:   Destination CPU of the load balancing
> > > > + * @sds:   Load-balancing data with statistics of the local group
> > > > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > > > + * @sg:The candidate busiet group
> > > > + *
> > > > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > > > decide
> > > > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, 
> > > > it can
> > > > + * pull tasks if two or more of the SMT siblings of @sg are busy. If 
> > > > only one
> > > > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> > > > + *
> > > > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle 
> > > > CPUs
> > > > + * between @sds::local and @sg. Thus, pull tasks from @sg if the 
> > > > difference
> > > > + * between the number of busy CPUs is 2 or more. If the difference is 
> > > > of 1,
> > > > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT 
> > > > siblings
> > > > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and 
> > > > @sg
> > > > + * has lower priority.
> > > > + */
> > > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats 
> > > > *sds,
> > > > +   struct sg_lb_stats *sgs,
> > > > +   struct sched_group *sg)
> > > > +{
> > > > +#ifdef CONFIG_SCHED_SMT
> > > > +   bool local_is_smt, sg_is_smt;
> > > > +   int sg_busy_cpus;
> > > > +
> > > > +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > > > +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> > > > +
> > > > +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > > > +
> > > > +   if (!local_is_smt) {
> > > > +   /*
> > > > +* If we are here, @dst_cpu is idle and does not have 
> > > > SMT
> > > > +* siblings. Pull tasks if candidate group has two or 
> > > > more
> > > > +* busy CPUs.
> > > > +*/
> > > > +   if (sg_is_smt && sg_busy_cpus >= 2)
> > > > +   return true;
> > > > +
> > > > +   /*
> > > > +* @dst_cpu does not have SMT siblings. @sg may have SMT
> > > > +* siblings and only one is busy. In such case, @dst_cpu
> > > > +* can help if it has higher priority and is idle.
> > > > +*/
> > > > +   return !sds->local_stat.group_util &&
> > >
> > > sds->local_stat.group_util can't be used to decide if a CPU or group
> > > of CPUs is idle. util_avg is usually not null when a CPU becomes idle
> > > and you can have to wait  more than 300ms before it becomes Null
> > > At the opposite, the utilization of a CPU can be null but a task with
> > > null utilization has just woken up on it.
> > > Utilization is used to reflect the average work of the CPU or group of
> > > CPUs but not the current state
> >
> > If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus ==
> > sgs->group_weight come to mind.
> 
> yes, I have the same in mind

Thank you very much Vincent and Peter for the feedback! I'll look at
using these stats to determine immediate idle.

> 
> >
> > > > +  sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > > > +   }
> > > > +
> > > > +   /* @dst_cpu has SMT siblings. */
> > > > +
> > > > +   if (sg_is_smt) {
> > > > +   int local_busy_cpus = sds->local->group_weight -
> > > > + sds->local_stat.idle_cpus;
> > > > +   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > > > +
> > > > +   /* Local can always help to even the number busy CPUs. 
> > > > */
> > >
> > > default behavior of the load balance already tries to even the number
>  a> > of idle CPUs.
> >
> > Right, but I suppose this is because we're trapped here and have to deal
> > with the SMT-SMT case too. Ricardo, can you clarify?
> 
> IIUC, this function is used in sg_lb_stats to set
> sgs->group_asym_packing which is then used to set the group state to
> group_asym_packing and force asym migration.
> But if we only want to even the number of busy CPUs between the group,
> we should not need to set the group state to  group_asym_packing

Yes, what Vincent describe is the intent. Then, I think it is probably
true that it is not necessary to even the number of idle CPUs here.
> 
> >
> > > > +   if (busy_cpus_delta >= 2)
> > > > +   return true;
> > > > +
> > > > +   if (busy_cpus_delta == 1)
> > > > + 

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers



- On Aug 26, 2021, at 7:54 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
>> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> >> >> +  r = sched_setaffinity(0, sizeof(allowed_mask), 
>> >> >> _mask);
>> >> >> +  TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d 
>> >> >> (%s)",
>> >> >> +  errno, strerror(errno));
>> >> >> +  atomic_inc(_cnt);
>> >> >> +
>> >> >> +  CPU_CLR(cpu, _mask);
>> >> >> +
>> >> >> +  /*
>> >> >> +   * Let the read-side get back into KVM_RUN to improve 
>> >> >> the odds
>> >> >> +   * of task migration coinciding with KVM's run loop.
>> >> > 
>> >> > This comment should be about increasing the odds of letting the seqlock
>> >> > read-side complete. Otherwise, the delay between the two back-to-back
>> >> > atomic_inc is so small that the seqlock read-side may never have time to
>> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and 
>> >> > can
>> >> > retry forever.
>> > 
>> > Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
>> > isn't
>> > possible (though that syscall would have to be screaming fast),
>> 
>> I don't think we have the same understanding of the livelock scenario. AFAIU 
>> the
>> livelock
>> would be caused by a too-small delay between the two consecutive atomic_inc()
>> from one
>> loop iteration to the next compared to the time it takes to perform a 
>> read-side
>> critical
>> section of the seqlock. Back-to-back atomic_inc can be performed very 
>> quickly,
>> so I
>> doubt that the sched_getcpu implementation have good odds to be fast enough 
>> to
>> complete
>> in that narrow window, leading to lots of read seqlock retry.
> 
> Ooooh, yeah, brain fart on my side.  I was thinking of the two atomic_inc() in
> the
> same loop iteration and completely ignoring the next iteration.  Yes, I 100%
> agree
> that a delay to ensure forward progress is needed.  An assertion in main() 
> that
> the
> reader complete at least some reasonable number of KVM_RUNs is also probably a
> good
> idea, e.g. to rule out a false pass due to the reader never making forward
> progress.

Agreed.

> 
> FWIW, the do-while loop does make forward progress without a delay, but at 
> ~50%
> throughput, give or take.

I did not expect absolutely no progress, but a significant slow down of
the read-side.

> 
>> > but the primary motivation is very much to allow the read-side enough time
>> > to get back into KVM proper.
>> 
>> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, 
>> then we
>> have:
>> 
>> migration thread KVM_RUN/read-side thread
>> ---
>>  - ioctl(KVM_RUN)
>> - atomic_inc_seq_cst()
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>>  - a = atomic_load() & ~1
>>  - smp_rmb()
>>  - b = 
>> LOAD_ONCE(__rseq_abi->cpu_id);
>>  - sched_getcpu()
>>  - smp_rmb()
>>  - re-load seqcnt/compare 
>> (succeeds)
>>- Can only succeed if entire 
>> read-side happens while the seqcnt
>>  is in an even numbered 
>> state.
>>  - if (a != b) abort()
>>   /* no delay. Even counter state is very
>>  short. */
>> - atomic_inc_seq_cst()
>>   /* Let's suppose the lack of delay causes the
>>  setaffinity to complete too early compared
>>  with KVM_RUN ioctl */
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>> 
>>   /* no delay. Even counter state is very
>>  short. */
>> - atomic_inc_seq_cst()
>>   /* Then a setaffinity from a following
>>  migration thread loop will run
>>  concurrently with KVM_RUN */
>>  - ioctl(KVM_RUN)
>> - sched_setaffinity
>> - atomic_inc_seq_cst()
>> 
>> As pointed out here, if the first setaffinity runs too early compared with
>> KVM_RUN,
>> a following setaffinity will run concurrently with it. However, the fact that
>> the even counter state is very short may very well hurt progress of the read
>> seqlock.
> 
> *sigh*
> 
> Several hours later, I think I finally have my head wrapped around everything.
> 
> Due to the way the test is written and because of how KVM's run loop works,
> TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM
> actually
> enters the guest, otherwise KVM will exit to userspace without touching the
> flag,
> i.e. it will be handled by the 

Re: [FSL P50x0] lscpu reports wrong values since the RC1 of kernel 5.13

2021-08-27 Thread Christian Zigotzky



> On 26. Aug 2021, at 05:43, Srikar Dronamraju  
> wrote:
> 
> * Christian Zigotzky  [2021-08-16 14:29:21]:
> 
> 
> Hi Christian,
> 
>> I tested the RC6 of kernel 5.14 today and unfortunately the issue still
>> exists. We have figured out that only P5040 SoCs are affected. [1]
>> P5020 SoCs display the correct values.
>> Please check the CPU changes in the PowerPC updates 5.13-1 and 5.13-2.
>> 
> 
> Thanks for reporting the issue.
> Would it be possible to try
> https://lore.kernel.org/linuxppc-dev/20210821092419.167454-3-sri...@linux.vnet.ibm.com/t/#u

Hi Srikar,

This patch works! Thanks a lot!

Cheers,
Christian

> 
> If the above patch is not helping, then can you please collect the output of
> 
> cat /sys/devices/system/cpu/cpu*/topology/core_siblings
> 
> Were all the CPUs online at the time of boot?
> Did we do any CPU online/offline operations post boot?
> 
> If we did CPU online/offline, can you capture the output just after the
> boot along with lscpu output..
> 
> Since this is being seen on few SOCs, can you summarize the difference
> between P5040 and P5020.
>> 
>> [1] https://forum.hyperion-entertainment.com/viewtopic.php?p=53775#p53775
>> 
>> 
>>> On 09 August 2021 um 02:37 pm, Christian Zigotzky wrote:
>>> Hi All,
>>> 
>>> Lscpu reports wrong values [1] since the RC1 of kernel 5.13 on my FSL
>>> P5040 Cyrus+ board (A-EON AmigaOne X5000). [2]
>>> The differences are:
>>> 
>>> Since the RC1 of kernel 5.13 (wrong values):
>>> 
>>> Core(s) per socket: 1
>>> Socket(s): 3
>>> 
> 
> I know that the socket count was off by 1, but I cant explain how its off by
> 2 here.
> 
>>> Before (correct values):
>>> 
>>> Core(s) per socket: 4
>>> Socket(s): 1
>>> 
> 
> -- 
> Thanks and Regards
> Srikar Dronamraju



Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-08-27 Thread Leonardo Brás
Hello Fred,

On Fri, 2021-08-27 at 19:06 +0200, Frederic Barrat wrote:
> 
> I think it looks ok now as it was mostly me who was misunderstanding
> one 
> part of the previous iteration.
> Reviewed-by: Frederic Barrat 
> 

Thank you for reviewing this series!

> Sorry for the late review, I was enjoying some time off. And thanks
> for 
> that series, I believe it should help with those bugs complaining
> about 
> lack of DMA space. It was also very educational for me, thanks to you
> and Alexey for your detailed answers.

Working on this series caused me to learn a lot about how DMA and IOMMU
works in Power, and it was a great experience. Thanks to Alexey who
helped and guided me through this!

Best regards,
Leonardo Bras



Re: [PATCH v6 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-08-27 Thread Frederic Barrat




On 17/08/2021 08:39, Leonardo Bras wrote:

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---



I think it looks ok now as it was mostly me who was misunderstanding one 
part of the previous iteration.

Reviewed-by: Frederic Barrat 

Sorry for the late review, I was enjoying some time off. And thanks for 
that series, I believe it should help with those bugs complaining about 
lack of DMA space. It was also very educational for me, thanks to you 
and Alexey for your detailed answers.


  Fred



  arch/powerpc/platforms/pseries/iommu.c | 89 +-
  1 file changed, 74 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e11c00b2dc1e..0eccc29f5573 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -375,6 +375,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
  /* protects initializing window twice for same device */
  static DEFINE_MUTEX(direct_window_init_mutex);
  #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
  
  static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,

unsigned long num_pfn, const void *arg)
@@ -940,6 +941,7 @@ static int find_existing_ddw_windows(void)
return 0;
  
  	find_existing_ddw_windows_named(DIRECT64_PROPNAME);

+   find_existing_ddw_windows_named(DMA64_PROPNAME);
  
  	return 0;

  }
@@ -1226,14 +1228,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_create_response create;
int page_shift;
u64 win_addr;
+   const char *win_name;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64;
bool ddw_enabled = false;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
  
  	dn = of_find_node_by_type(NULL, "ibm,pmemory");

pmem_present = dn != NULL;
@@ -1242,6 +1247,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
mutex_lock(_window_init_mutex);
  
  	if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {

+   direct_mapping = (len >= max_ram_len);
ddw_enabled = true;
goto out_unlock;
}
@@ -1322,8 +1328,8 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  query.page_size);
goto out_failed;
}
-   /* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
+
+
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1339,13 +1345,25 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_info(>dev, "Skipping ibm,pmemory");
}
  
+	/* check if the available block * number of ptes will map everything */

if (query.largest_available_block < (1ULL << (len - page_shift))) {
dev_dbg(>dev,

Re: [PATCH v6 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

2021-08-27 Thread Frederic Barrat




On 17/08/2021 08:39, Leonardo Bras wrote:

Update remove_dma_window() so it can be used to remove DDW with a given
property name.

This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.

Also, add return values to it so we can check if the property was found
while removing the active DDW. This allows skipping the remaining property
names while reducing the impact of multiple property names.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---



Reviewed-by: Frederic Barrat 




  arch/powerpc/platforms/pseries/iommu.c | 18 ++
  1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index a47f59a8f107..901f290999d0 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -844,31 +844,33 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
__remove_dma_window(np, ddw_avail, liobn);
  }
  
-static void remove_ddw(struct device_node *np, bool remove_prop)

+static int remove_ddw(struct device_node *np, bool remove_prop, const char 
*win_name)
  {
struct property *win;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
int ret = 0;
  
+	win = of_find_property(np, win_name, NULL);

+   if (!win)
+   return -EINVAL;
+
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
 _avail[0], DDW_APPLICABLE_SIZE);
if (ret)
-   return;
+   return 0;
  
-	win = of_find_property(np, DIRECT64_PROPNAME, NULL);

-   if (!win)
-   return;
  
  	if (win->length >= sizeof(struct dynamic_dma_window_prop))

remove_dma_window(np, ddw_avail, win);
  
  	if (!remove_prop)

-   return;
+   return 0;
  
  	ret = of_remove_property(np, win);

if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
+   return 0;
  }
  
  static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift)

@@ -921,7 +923,7 @@ static int find_existing_ddw_windows(void)
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
continue;
}
  
@@ -1565,7 +1567,7 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti

 * we have to remove the property when releasing
 * the device node.
 */
-   remove_ddw(np, false);
+   remove_ddw(np, false, DIRECT64_PROPNAME);
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);



Re: [PATCH v6 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-08-27 Thread Frederic Barrat




On 17/08/2021 08:39, Leonardo Bras wrote:

Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Also, the error path got remove_ddw() replaced by a new helper
__remove_dma_window(), which only removes the new DDW with an rtas-call.
For this, a new helper clean_dma_window() was needed to clean anything
that could left if walk_system_ram_range() fails.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---



Thanks for fixing the error paths
Reviewed-by: Frederic Barrat 



  arch/powerpc/platforms/pseries/iommu.c | 129 -
  1 file changed, 84 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index b34b473bbdc1..00392582fe10 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -795,17 +795,10 @@ static int __init disable_ddw_setup(char *str)
  
  early_param("disable_ddw", disable_ddw_setup);
  
-static void remove_dma_window(struct device_node *np, u32 *ddw_avail,

- struct property *win)
+static void clean_dma_window(struct device_node *np, struct 
dynamic_dma_window_prop *dwp)
  {
-   struct dynamic_dma_window_prop *dwp;
-   u64 liobn;
int ret;
  
-	dwp = win->value;

-   liobn = (u64)be32_to_cpu(dwp->liobn);
-
-   /* clear the whole window, note the arg is in kernel pages */
ret = tce_clearrange_multi_pSeriesLP(0,
1ULL << (be32_to_cpu(dwp->window_shift) - PAGE_SHIFT), dwp);
if (ret)
@@ -814,18 +807,39 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
else
pr_debug("%pOF successfully cleared tces in window.\n",
 np);
+}
+
+/*
+ * Call only if DMA window is clean.
+ */
+static void __remove_dma_window(struct device_node *np, u32 *ddw_avail, u64 
liobn)
+{
+   int ret;
  
  	ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);

if (ret)
-   pr_warn("%pOF: failed to remove direct window: rtas returned "
+   pr_warn("%pOF: failed to remove DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
-   pr_debug("%pOF: successfully removed direct window: rtas returned 
"
+   pr_debug("%pOF: successfully removed DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
  }
  
+static void remove_dma_window(struct device_node *np, u32 *ddw_avail,

+ struct property *win)
+{
+   struct dynamic_dma_window_prop *dwp;
+   u64 liobn;
+
+   dwp = win->value;
+   liobn = (u64)be32_to_cpu(dwp->liobn);
+
+   clean_dma_window(np, dwp);
+   __remove_dma_window(np, ddw_avail, liobn);
+}
+
  static void remove_ddw(struct device_node *np, bool remove_prop)
  {
struct property *win;
@@ -1153,6 +1167,35 @@ static int iommu_get_page_shift(u32 query_page_size)
return 0;
  }
  
+static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,

+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64->name);
+   kfree(win64->value);
+   kfree(win64);
+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = 

Re: [PATCH v6 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2021-08-27 Thread Frederic Barrat




On 17/08/2021 08:39, Leonardo Bras wrote:

Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---


Looks ok to me now, thanks!
Reviewed-by: Frederic Barrat 



  arch/powerpc/include/asm/iommu.h |  1 +
  arch/powerpc/kernel/iommu.c  | 61 
  2 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..bf3b84128525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
   */
  extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);
  
  #define IOMMU_TABLE_GROUP_MAX_TABLES	2
  
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c

index 2af89a5e379f..ed98ad63633e 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -690,32 +690,24 @@ static void iommu_table_reserve_pages(struct iommu_table 
*tbl,
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
  
-	tbl->it_reserved_start = res_start;

-   tbl->it_reserved_end = res_end;
-
-   /* Check if res_start..res_end isn't empty and overlaps the table */
-   if (res_start && res_end &&
-   (tbl->it_offset + tbl->it_size < res_start ||
-res_end < tbl->it_offset))
-   return;
+   if (res_start < tbl->it_offset)
+   res_start = tbl->it_offset;
  
-	for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)

-   set_bit(i - tbl->it_offset, tbl->it_map);
-}
+   if (res_end > (tbl->it_offset + tbl->it_size))
+   res_end = tbl->it_offset + tbl->it_size;
  
-static void iommu_table_release_pages(struct iommu_table *tbl)

-{
-   int i;
+   /* Check if res_start..res_end is a valid range in the table */
+   if (res_start >= res_end) {
+   tbl->it_reserved_start = tbl->it_offset;
+   tbl->it_reserved_end = tbl->it_offset;
+   return;
+   }
  
-	/*

-* In case we have reserved the first bit, we should not emit
-* the warning below.
-*/
-   if (tbl->it_offset == 0)
-   clear_bit(0, tbl->it_map);
+   tbl->it_reserved_start = res_start;
+   tbl->it_reserved_end = res_end;
  
  	for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)

-   clear_bit(i - tbl->it_offset, tbl->it_map);
+   set_bit(i - tbl->it_offset, tbl->it_map);
  }
  
  /*

@@ -779,6 +771,22 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
return tbl;
  }
  
+bool iommu_table_in_use(struct iommu_table *tbl)

+{
+   unsigned long start = 0, end;
+
+   /* ignore reserved bit0 */
+   if (tbl->it_offset == 0)
+   start = 1;
+   end = tbl->it_reserved_start - tbl->it_offset;
+   if (find_next_bit(tbl->it_map, end, start) != end)
+   return true;
+
+   start = tbl->it_reserved_end - tbl->it_offset;
+   end = tbl->it_size;
+   return find_next_bit(tbl->it_map, end, start) != end;
+}
+
  static void iommu_table_free(struct kref *kref)
  {
struct iommu_table *tbl;
@@ -795,10 +803,8 @@ static void iommu_table_free(struct kref *kref)
  
  	iommu_debugfs_del(tbl);
  
-	iommu_table_release_pages(tbl);

-
/* verify that table contains no entries */
-   if (!bitmap_empty(tbl->it_map, tbl->it_size))
+   if (iommu_table_in_use(tbl))
pr_warn("%s: Unexpected TCEs\n", __func__);
  
  	/* free bitmap */

@@ -1099,14 +1105,9 @@ int iommu_take_ownership(struct iommu_table *tbl)
for (i = 0; i < tbl->nr_pools; i++)
spin_lock_nest_lock(>pools[i].lock, >large_pool.lock);
  
-	iommu_table_release_pages(tbl);

-
-   if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+   if (iommu_table_in_use(tbl)) {
pr_err("iommu_tce: it_map is not empty");
ret = -EBUSY;
-   /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
-   iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-   tbl->it_reserved_end);
} else {

[RFC PATCH 6/6] powerpc/microwatt: Stop building the hash MMU code

2021-08-27 Thread Nicholas Piggin
Microwatt is radix-only, so stop selecting the hash MMU code.

This saves 20kB compressed dtbImage and 56kB vmlinux size.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/configs/microwatt_defconfig | 1 -
 arch/powerpc/platforms/Kconfig.cputype   | 1 +
 arch/powerpc/platforms/microwatt/Kconfig | 1 -
 3 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index bf5f2e5905eb..6fbad42f9e3d 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -26,7 +26,6 @@ CONFIG_PPC_MICROWATT=y
 # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set
 CONFIG_CPU_FREQ=y
 CONFIG_HZ_100=y
-# CONFIG_PPC_MEM_KEYS is not set
 # CONFIG_SECCOMP is not set
 # CONFIG_MQ_IOSCHED_KYBER is not set
 # CONFIG_COREDUMP is not set
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 9b90fc501ed4..b9acd6c68c81 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -371,6 +371,7 @@ config SPE
 config PPC_64S_HASH_MMU
bool "Hash MMU Support"
depends on PPC_BOOK3S_64
+   depends on !PPC_MICROWATT
default y
select PPC_MM_SLICES
help
diff --git a/arch/powerpc/platforms/microwatt/Kconfig 
b/arch/powerpc/platforms/microwatt/Kconfig
index e0ff2cfc1ca0..4f0ce292a6ce 100644
--- a/arch/powerpc/platforms/microwatt/Kconfig
+++ b/arch/powerpc/platforms/microwatt/Kconfig
@@ -6,7 +6,6 @@ config PPC_MICROWATT
select PPC_XICS
select PPC_ICS_NATIVE
select PPC_ICP_NATIVE
-   select PPC_HASH_MMU_NATIVE if PPC_64S_HASH_MMU
select PPC_UDBG_16550
select ARCH_RANDOM
help
-- 
2.23.0



[RFC PATCH 5/6] powerpc/microwatt: select POWER9_CPU

2021-08-27 Thread Nicholas Piggin
Microwatt implements a subset of ISA v3.0 which is equivalent to
the POWER9_CPU selection.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/configs/microwatt_defconfig | 1 +
 arch/powerpc/platforms/microwatt/Kconfig | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index a08b739123da..bf5f2e5905eb 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -14,6 +14,7 @@ CONFIG_EMBEDDED=y
 # CONFIG_COMPAT_BRK is not set
 # CONFIG_SLAB_MERGE_DEFAULT is not set
 CONFIG_PPC64=y
+CONFIG_POWER9_CPU=y
 # CONFIG_PPC_KUEP is not set
 # CONFIG_PPC_KUAP is not set
 CONFIG_CPU_LITTLE_ENDIAN=y
diff --git a/arch/powerpc/platforms/microwatt/Kconfig 
b/arch/powerpc/platforms/microwatt/Kconfig
index 823192e9d38a..e0ff2cfc1ca0 100644
--- a/arch/powerpc/platforms/microwatt/Kconfig
+++ b/arch/powerpc/platforms/microwatt/Kconfig
@@ -2,6 +2,7 @@
 config PPC_MICROWATT
depends on PPC_BOOK3S_64 && !SMP
bool "Microwatt SoC platform"
+   select POWER9_CPU
select PPC_XICS
select PPC_ICS_NATIVE
select PPC_ICP_NATIVE
-- 
2.23.0



[RFC PATCH 4/6] powerpc/64s: Make hash MMU code build configurable

2021-08-27 Thread Nicholas Piggin
Introduce a new option CONFIG_PPC_64S_HASH_MMU which allows the 64s hash
MMU code to be compiled out if radix is selected and the minimum
supported CPU type is POWER9 or higher, and KVM is not selected.

This saves 128kB kernel image size (90kB text) on powernv_defconfig
minus KVM, 350kB on pseries_defconfig minus KVM, 40kB on a tiny config.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/book3s/64/mmu.h  | 22 +-
 .../include/asm/book3s/64/tlbflush-hash.h |  7 ++
 arch/powerpc/include/asm/book3s/pgtable.h |  4 ++
 arch/powerpc/include/asm/mmu.h| 38 +--
 arch/powerpc/include/asm/mmu_context.h|  2 +
 arch/powerpc/include/asm/paca.h   |  8 +++
 arch/powerpc/kernel/asm-offsets.c |  2 +
 arch/powerpc/kernel/dt_cpu_ftrs.c | 10 ++-
 arch/powerpc/kernel/entry_64.S|  4 +-
 arch/powerpc/kernel/exceptions-64s.S  | 16 +
 arch/powerpc/kernel/mce.c |  2 +-
 arch/powerpc/kernel/mce_power.c   | 10 ++-
 arch/powerpc/kernel/paca.c| 18 ++---
 arch/powerpc/kernel/process.c | 13 ++--
 arch/powerpc/kernel/prom.c|  2 +
 arch/powerpc/kernel/setup_64.c|  4 ++
 arch/powerpc/kexec/core_64.c  |  4 +-
 arch/powerpc/kexec/ranges.c   |  4 ++
 arch/powerpc/kvm/Kconfig  |  1 +
 arch/powerpc/mm/book3s64/Makefile | 17 +++--
 arch/powerpc/mm/book3s64/hash_pgtable.c   |  1 -
 arch/powerpc/mm/book3s64/hash_utils.c | 10 ---
 .../{hash_hugetlbpage.c => hugetlbpage.c} |  6 ++
 arch/powerpc/mm/book3s64/mmu_context.c| 16 +
 arch/powerpc/mm/book3s64/pgtable.c| 22 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  4 ++
 arch/powerpc/mm/book3s64/slb.c| 16 -
 arch/powerpc/mm/copro_fault.c |  2 +
 arch/powerpc/mm/fault.c   | 17 +
 arch/powerpc/mm/pgtable.c | 10 ++-
 arch/powerpc/platforms/Kconfig.cputype| 20 +-
 arch/powerpc/platforms/cell/Kconfig   |  1 +
 arch/powerpc/platforms/maple/Kconfig  |  1 +
 arch/powerpc/platforms/microwatt/Kconfig  |  2 +-
 arch/powerpc/platforms/pasemi/Kconfig |  1 +
 arch/powerpc/platforms/powermac/Kconfig   |  1 +
 arch/powerpc/platforms/powernv/Kconfig|  2 +-
 arch/powerpc/platforms/powernv/idle.c |  2 +
 arch/powerpc/platforms/powernv/setup.c|  2 +
 arch/powerpc/platforms/pseries/lpar.c | 68 ++-
 arch/powerpc/platforms/pseries/lparcfg.c  |  2 +-
 arch/powerpc/platforms/pseries/mobility.c |  6 ++
 arch/powerpc/platforms/pseries/ras.c  |  4 ++
 arch/powerpc/platforms/pseries/reconfig.c |  2 +
 arch/powerpc/platforms/pseries/setup.c|  6 +-
 arch/powerpc/xmon/xmon.c  |  8 ++-
 47 files changed, 310 insertions(+), 111 deletions(-)
 rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (95%)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d01e3401581d..db4e0efd229b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -940,6 +940,7 @@ config PPC_MEM_KEYS
prompt "PowerPC Memory Protection Keys"
def_bool y
depends on PPC_BOOK3S_64
+   depends on PPC_64S_HASH_MMU
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
help
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index c02f42d1031e..857dc88b0043 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -98,7 +98,9 @@ typedef struct {
 * from EA and new context ids to build the new VAs.
 */
mm_context_id_t id;
+#ifdef CONFIG_PPC_64S_HASH_MMU
mm_context_id_t extended_id[TASK_SIZE_USER64/TASK_CONTEXT_SIZE];
+#endif
};
 
/* Number of bits in the mm_cpumask */
@@ -110,7 +112,9 @@ typedef struct {
/* Number of user space windows opened in process mm_context */
atomic_t vas_windows;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
struct hash_mm_context *hash_context;
+#endif
 
void __user *vdso;
/*
@@ -133,6 +137,7 @@ typedef struct {
 #endif
 } mm_context_t;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
 static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
 {
return ctx->hash_context->user_psize;
@@ -187,11 +192,22 @@ static inline struct subpage_prot_table 
*mm_ctx_subpage_prot(mm_context_t *ctx)
 }
 #endif
 
+#endif
+
 /*
  * The current system page and segment sizes
  */
-extern int mmu_linear_psize;
+#if defined(CONFIG_PPC_RADIX_MMU) && !defined(CONFIG_PPC_64S_HASH_MMU)
+#ifdef CONFIG_PPC_64K_PAGES
+#define mmu_virtual_psize MMU_PAGE_64K
+#else
+#define mmu_virtual_psize MMU_PAGE_4K

[RFC PATCH 3/6] powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE

2021-08-27 Thread Nicholas Piggin
The pseries platform does not use the native hash code but the PAPR
virtualised hash interfaces. Remove PPC_HASH_MMU_NATIVE. This
requires moving some tlbiel code from hash_native.c to hash_utils.c.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   4 -
 arch/powerpc/mm/book3s64/hash_native.c| 104 --
 arch/powerpc/mm/book3s64/hash_utils.c | 104 ++
 arch/powerpc/platforms/pseries/Kconfig|   1 -
 4 files changed, 104 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 215973b4cb26..d2e80f178b6d 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -14,7 +14,6 @@ enum {
TLB_INVAL_SCOPE_LPID = 1,   /* invalidate TLBs for current LPID */
 };
 
-#ifdef CONFIG_PPC_NATIVE
 static inline void tlbiel_all(void)
 {
/*
@@ -30,9 +29,6 @@ static inline void tlbiel_all(void)
else
hash__tlbiel_all(TLB_INVAL_SCOPE_GLOBAL);
 }
-#else
-static inline void tlbiel_all(void) { BUG(); }
-#endif
 
 static inline void tlbiel_all_lpid(bool radix)
 {
diff --git a/arch/powerpc/mm/book3s64/hash_native.c 
b/arch/powerpc/mm/book3s64/hash_native.c
index 52e170bd95ae..83afd08351f1 100644
--- a/arch/powerpc/mm/book3s64/hash_native.c
+++ b/arch/powerpc/mm/book3s64/hash_native.c
@@ -43,110 +43,6 @@
 
 static DEFINE_RAW_SPINLOCK(native_tlbie_lock);
 
-static inline void tlbiel_hash_set_isa206(unsigned int set, unsigned int is)
-{
-   unsigned long rb;
-
-   rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53));
-
-   asm volatile("tlbiel %0" : : "r" (rb));
-}
-
-/*
- * tlbiel instruction for hash, set invalidation
- * i.e., r=1 and is=01 or is=10 or is=11
- */
-static __always_inline void tlbiel_hash_set_isa300(unsigned int set, unsigned 
int is,
-   unsigned int pid,
-   unsigned int ric, unsigned int prs)
-{
-   unsigned long rb;
-   unsigned long rs;
-   unsigned int r = 0; /* hash format */
-
-   rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53));
-   rs = ((unsigned long)pid << PPC_BITLSHIFT(31));
-
-   asm volatile(PPC_TLBIEL(%0, %1, %2, %3, %4)
-: : "r"(rb), "r"(rs), "i"(ric), "i"(prs), "i"(r)
-: "memory");
-}
-
-
-static void tlbiel_all_isa206(unsigned int num_sets, unsigned int is)
-{
-   unsigned int set;
-
-   asm volatile("ptesync": : :"memory");
-
-   for (set = 0; set < num_sets; set++)
-   tlbiel_hash_set_isa206(set, is);
-
-   ppc_after_tlbiel_barrier();
-}
-
-static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
-{
-   unsigned int set;
-
-   asm volatile("ptesync": : :"memory");
-
-   /*
-* Flush the partition table cache if this is HV mode.
-*/
-   if (early_cpu_has_feature(CPU_FTR_HVMODE))
-   tlbiel_hash_set_isa300(0, is, 0, 2, 0);
-
-   /*
-* Now invalidate the process table cache. UPRT=0 HPT modes (what
-* current hardware implements) do not use the process table, but
-* add the flushes anyway.
-*
-* From ISA v3.0B p. 1078:
-* The following forms are invalid.
-*  * PRS=1, R=0, and RIC!=2 (The only process-scoped
-*HPT caching is of the Process Table.)
-*/
-   tlbiel_hash_set_isa300(0, is, 0, 2, 1);
-
-   /*
-* Then flush the sets of the TLB proper. Hash mode uses
-* partition scoped TLB translations, which may be flushed
-* in !HV mode.
-*/
-   for (set = 0; set < num_sets; set++)
-   tlbiel_hash_set_isa300(set, is, 0, 0, 0);
-
-   ppc_after_tlbiel_barrier();
-
-   asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT "; isync" : : :"memory");
-}
-
-void hash__tlbiel_all(unsigned int action)
-{
-   unsigned int is;
-
-   switch (action) {
-   case TLB_INVAL_SCOPE_GLOBAL:
-   is = 3;
-   break;
-   case TLB_INVAL_SCOPE_LPID:
-   is = 2;
-   break;
-   default:
-   BUG();
-   }
-
-   if (early_cpu_has_feature(CPU_FTR_ARCH_300))
-   tlbiel_all_isa300(POWER9_TLB_SETS_HASH, is);
-   else if (early_cpu_has_feature(CPU_FTR_ARCH_207S))
-   tlbiel_all_isa206(POWER8_TLB_SETS, is);
-   else if (early_cpu_has_feature(CPU_FTR_ARCH_206))
-   tlbiel_all_isa206(POWER7_TLB_SETS, is);
-   else
-   WARN(1, "%s called on pre-POWER7 CPU\n", __func__);
-}
-
 static inline unsigned long  ___tlbie(unsigned long vpn, int psize,
int apsize, int ssize)
 {
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index e35faefa8e59..43c05cdbba73 

[RFC PATCH 2/6] powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE

2021-08-27 Thread Nicholas Piggin
PPC_NATIVE now only controls the native HPT code, so rename it to be
more descriptive. Restrict it to Book3S only.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/Makefile  | 2 +-
 arch/powerpc/mm/book3s64/hash_utils.c  | 2 +-
 arch/powerpc/platforms/52xx/Kconfig| 2 +-
 arch/powerpc/platforms/Kconfig | 4 ++--
 arch/powerpc/platforms/cell/Kconfig| 2 +-
 arch/powerpc/platforms/chrp/Kconfig| 2 +-
 arch/powerpc/platforms/embedded6xx/Kconfig | 2 +-
 arch/powerpc/platforms/maple/Kconfig   | 2 +-
 arch/powerpc/platforms/microwatt/Kconfig   | 2 +-
 arch/powerpc/platforms/pasemi/Kconfig  | 2 +-
 arch/powerpc/platforms/powermac/Kconfig| 2 +-
 arch/powerpc/platforms/powernv/Kconfig | 2 +-
 arch/powerpc/platforms/pseries/Kconfig | 2 +-
 13 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/Makefile 
b/arch/powerpc/mm/book3s64/Makefile
index 1b56d3af47d4..319f4b7f3357 100644
--- a/arch/powerpc/mm/book3s64/Makefile
+++ b/arch/powerpc/mm/book3s64/Makefile
@@ -6,7 +6,7 @@ CFLAGS_REMOVE_slb.o = $(CC_FLAGS_FTRACE)
 
 obj-y  += hash_pgtable.o hash_utils.o slb.o \
   mmu_context.o pgtable.o hash_tlb.o
-obj-$(CONFIG_PPC_NATIVE)   += hash_native.o
+obj-$(CONFIG_PPC_HASH_MMU_NATIVE)  += hash_native.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= radix_pgtable.o radix_tlb.o
 obj-$(CONFIG_PPC_4K_PAGES) += hash_4k.o
 obj-$(CONFIG_PPC_64K_PAGES)+= hash_64k.o
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index ac5720371c0d..e35faefa8e59 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1091,7 +1091,7 @@ void __init hash__early_init_mmu(void)
ps3_early_mm_init();
else if (firmware_has_feature(FW_FEATURE_LPAR))
hpte_init_pseries();
-   else if (IS_ENABLED(CONFIG_PPC_NATIVE))
+   else if (IS_ENABLED(CONFIG_PPC_HASH_MMU_NATIVE))
hpte_init_native();
 
if (!mmu_hash_ops.hpte_insert)
diff --git a/arch/powerpc/platforms/52xx/Kconfig 
b/arch/powerpc/platforms/52xx/Kconfig
index 99d60acc20c8..b72ed2950ca8 100644
--- a/arch/powerpc/platforms/52xx/Kconfig
+++ b/arch/powerpc/platforms/52xx/Kconfig
@@ -34,7 +34,7 @@ config PPC_EFIKA
bool "bPlan Efika 5k2. MPC5200B based computer"
depends on PPC_MPC52xx
select PPC_RTAS
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
 
 config PPC_LITE5200
bool "Freescale Lite5200 Eval Board"
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index e02d29a9d12f..d41dad227de8 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -40,9 +40,9 @@ config EPAPR_PARAVIRT
 
  In case of doubt, say Y
 
-config PPC_NATIVE
+config PPC_HASH_MMU_NATIVE
bool
-   depends on PPC_BOOK3S_32 || PPC64
+   depends on PPC_BOOK3S
help
  Support for running natively on the hardware, i.e. without
  a hypervisor. This option is not user-selectable but should
diff --git a/arch/powerpc/platforms/cell/Kconfig 
b/arch/powerpc/platforms/cell/Kconfig
index cb70c5f25bc6..db4465c51b56 100644
--- a/arch/powerpc/platforms/cell/Kconfig
+++ b/arch/powerpc/platforms/cell/Kconfig
@@ -8,7 +8,7 @@ config PPC_CELL_COMMON
select PPC_DCR_MMIO
select PPC_INDIRECT_PIO
select PPC_INDIRECT_MMIO
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select PPC_RTAS
select IRQ_EDGE_EOI_HANDLER
 
diff --git a/arch/powerpc/platforms/chrp/Kconfig 
b/arch/powerpc/platforms/chrp/Kconfig
index 9b5c5505718a..ff30ed579a39 100644
--- a/arch/powerpc/platforms/chrp/Kconfig
+++ b/arch/powerpc/platforms/chrp/Kconfig
@@ -11,6 +11,6 @@ config PPC_CHRP
select RTAS_ERROR_LOGGING
select PPC_MPC106
select PPC_UDBG_16550
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select FORCE_PCI
default y
diff --git a/arch/powerpc/platforms/embedded6xx/Kconfig 
b/arch/powerpc/platforms/embedded6xx/Kconfig
index 4c6d703a4284..c54786f8461e 100644
--- a/arch/powerpc/platforms/embedded6xx/Kconfig
+++ b/arch/powerpc/platforms/embedded6xx/Kconfig
@@ -55,7 +55,7 @@ config MVME5100
select FORCE_PCI
select PPC_INDIRECT_PCI
select PPC_I8259
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select PPC_UDBG_16550
help
  This option enables support for the Motorola (now Emerson) MVME5100
diff --git a/arch/powerpc/platforms/maple/Kconfig 
b/arch/powerpc/platforms/maple/Kconfig
index 86ae210bee9a..7fd84311ade5 100644
--- a/arch/powerpc/platforms/maple/Kconfig
+++ b/arch/powerpc/platforms/maple/Kconfig
@@ -9,7 +9,7 @@ config PPC_MAPLE
select GENERIC_TBSYNC
select PPC_UDBG_16550
select PPC_970_NAP
-   select PPC_NATIVE
+   select 

[RFC PATCH 1/6] powerpc: Remove unused FW_FEATURE_NATIVE references

2021-08-27 Thread Nicholas Piggin
FW_FEATURE_NATIVE_ALWAYS and FW_FEATURE_NATIVE_POSSIBLE are always
zero and never do anything. Remove them.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/firmware.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 7604673787d6..df9e4aae917e 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -79,8 +79,6 @@ enum {
FW_FEATURE_POWERNV_ALWAYS = 0,
FW_FEATURE_PS3_POSSIBLE = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
FW_FEATURE_PS3_ALWAYS = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
-   FW_FEATURE_NATIVE_POSSIBLE = 0,
-   FW_FEATURE_NATIVE_ALWAYS = 0,
FW_FEATURE_POSSIBLE =
 #ifdef CONFIG_PPC_PSERIES
FW_FEATURE_PSERIES_POSSIBLE |
@@ -90,9 +88,6 @@ enum {
 #endif
 #ifdef CONFIG_PPC_PS3
FW_FEATURE_PS3_POSSIBLE |
-#endif
-#ifdef CONFIG_PPC_NATIVE
-   FW_FEATURE_NATIVE_ALWAYS |
 #endif
0,
FW_FEATURE_ALWAYS =
@@ -104,9 +99,6 @@ enum {
 #endif
 #ifdef CONFIG_PPC_PS3
FW_FEATURE_PS3_ALWAYS &
-#endif
-#ifdef CONFIG_PPC_NATIVE
-   FW_FEATURE_NATIVE_ALWAYS &
 #endif
FW_FEATURE_POSSIBLE,
 
-- 
2.23.0



[RFC PATCH 0/6] powerpc: Make hash MMU code build configurable

2021-08-27 Thread Nicholas Piggin
Now that there's a platform that can make good use of it, here's
a series that can prevent the hash MMU code being built for 64s
platforms that don't need it.

Thanks,
Nick

Nicholas Piggin (6):
  powerpc: Remove unused FW_FEATURE_NATIVE references
  powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE
  powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE
  powerpc/64s: Make hash MMU code build configurable
  powerpc/microwatt: select POWER9_CPU
  powerpc/microwatt: Stop building the hash MMU code

 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/configs/microwatt_defconfig  |   2 +-
 arch/powerpc/include/asm/book3s/64/mmu.h  |  22 +++-
 .../include/asm/book3s/64/tlbflush-hash.h |   7 ++
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   4 -
 arch/powerpc/include/asm/book3s/pgtable.h |   4 +
 arch/powerpc/include/asm/firmware.h   |   8 --
 arch/powerpc/include/asm/mmu.h|  38 +-
 arch/powerpc/include/asm/mmu_context.h|   2 +
 arch/powerpc/include/asm/paca.h   |   8 ++
 arch/powerpc/kernel/asm-offsets.c |   2 +
 arch/powerpc/kernel/dt_cpu_ftrs.c |  10 +-
 arch/powerpc/kernel/entry_64.S|   4 +-
 arch/powerpc/kernel/exceptions-64s.S  |  16 +++
 arch/powerpc/kernel/mce.c |   2 +-
 arch/powerpc/kernel/mce_power.c   |  10 +-
 arch/powerpc/kernel/paca.c|  18 ++-
 arch/powerpc/kernel/process.c |  13 +-
 arch/powerpc/kernel/prom.c|   2 +
 arch/powerpc/kernel/setup_64.c|   4 +
 arch/powerpc/kexec/core_64.c  |   4 +-
 arch/powerpc/kexec/ranges.c   |   4 +
 arch/powerpc/kvm/Kconfig  |   1 +
 arch/powerpc/mm/book3s64/Makefile |  19 +--
 arch/powerpc/mm/book3s64/hash_native.c| 104 
 arch/powerpc/mm/book3s64/hash_pgtable.c   |   1 -
 arch/powerpc/mm/book3s64/hash_utils.c | 116 --
 .../{hash_hugetlbpage.c => hugetlbpage.c} |   6 +
 arch/powerpc/mm/book3s64/mmu_context.c|  16 +++
 arch/powerpc/mm/book3s64/pgtable.c|  22 +++-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |   4 +
 arch/powerpc/mm/book3s64/slb.c|  16 ---
 arch/powerpc/mm/copro_fault.c |   2 +
 arch/powerpc/mm/fault.c   |  17 +++
 arch/powerpc/mm/pgtable.c |  10 +-
 arch/powerpc/platforms/52xx/Kconfig   |   2 +-
 arch/powerpc/platforms/Kconfig|   4 +-
 arch/powerpc/platforms/Kconfig.cputype|  21 +++-
 arch/powerpc/platforms/cell/Kconfig   |   3 +-
 arch/powerpc/platforms/chrp/Kconfig   |   2 +-
 arch/powerpc/platforms/embedded6xx/Kconfig|   2 +-
 arch/powerpc/platforms/maple/Kconfig  |   3 +-
 arch/powerpc/platforms/microwatt/Kconfig  |   2 +-
 arch/powerpc/platforms/pasemi/Kconfig |   3 +-
 arch/powerpc/platforms/powermac/Kconfig   |   3 +-
 arch/powerpc/platforms/powernv/Kconfig|   2 +-
 arch/powerpc/platforms/powernv/idle.c |   2 +
 arch/powerpc/platforms/powernv/setup.c|   2 +
 arch/powerpc/platforms/pseries/Kconfig|   1 -
 arch/powerpc/platforms/pseries/lpar.c |  68 +-
 arch/powerpc/platforms/pseries/lparcfg.c  |   2 +-
 arch/powerpc/platforms/pseries/mobility.c |   6 +
 arch/powerpc/platforms/pseries/ras.c  |   4 +
 arch/powerpc/platforms/pseries/reconfig.c |   2 +
 arch/powerpc/platforms/pseries/setup.c|   6 +-
 arch/powerpc/xmon/xmon.c  |   8 +-
 56 files changed, 427 insertions(+), 240 deletions(-)
 rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (95%)

-- 
2.23.0



Re: [RFC PATCH] powerpc: Investigate static_call concept

2021-08-27 Thread Segher Boessenkool
On Fri, Aug 27, 2021 at 04:18:47PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 27, 2021 at 09:45:37AM +, Christophe Leroy wrote:
> > This RFC is to validate the concept of static_call on powerpc.
> > 
> > Highly copied from x86.
> > 
> > It replaces ppc_md.get_irq() which is called at every IRQ, by
> > a static call.
> 
> The code looks saner, but does it actually improve performance? I'm
> thinking the double branch also isn't free.

It isn't, but it is very cheap, while the branch-to-count is not, even
*if* it is correctly predicted.

> The paranoid in me would've made it:
> 
>   BUG_ON(patch_branch(...));
> 
> just to make sure to notice the target not fitting. Ohh, patch_branch()
> doesn't return the create_branch() error, perhaps that wants to be
> fixed?

Should that be allowed to fail ever?  I.e., should a failure be a fatal
error?  Sounds very fragile otherwise.


Segher


Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Vincent Guittot
On Fri, 27 Aug 2021 at 16:50, Peter Zijlstra  wrote:
>
> On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote:
> > > +/**
> > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can 
> > > pull tasks
> > > + * @dst_cpu:   Destination CPU of the load balancing
> > > + * @sds:   Load-balancing data with statistics of the local group
> > > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > > + * @sg:The candidate busiet group
> > > + *
> > > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > > decide
> > > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, 
> > > it can
> > > + * pull tasks if two or more of the SMT siblings of @sg are busy. If 
> > > only one
> > > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> > > + *
> > > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle 
> > > CPUs
> > > + * between @sds::local and @sg. Thus, pull tasks from @sg if the 
> > > difference
> > > + * between the number of busy CPUs is 2 or more. If the difference is of 
> > > 1,
> > > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT 
> > > siblings
> > > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and 
> > > @sg
> > > + * has lower priority.
> > > + */
> > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > > +   struct sg_lb_stats *sgs,
> > > +   struct sched_group *sg)
> > > +{
> > > +#ifdef CONFIG_SCHED_SMT
> > > +   bool local_is_smt, sg_is_smt;
> > > +   int sg_busy_cpus;
> > > +
> > > +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > > +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> > > +
> > > +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > > +
> > > +   if (!local_is_smt) {
> > > +   /*
> > > +* If we are here, @dst_cpu is idle and does not have SMT
> > > +* siblings. Pull tasks if candidate group has two or more
> > > +* busy CPUs.
> > > +*/
> > > +   if (sg_is_smt && sg_busy_cpus >= 2)
> > > +   return true;
> > > +
> > > +   /*
> > > +* @dst_cpu does not have SMT siblings. @sg may have SMT
> > > +* siblings and only one is busy. In such case, @dst_cpu
> > > +* can help if it has higher priority and is idle.
> > > +*/
> > > +   return !sds->local_stat.group_util &&
> >
> > sds->local_stat.group_util can't be used to decide if a CPU or group
> > of CPUs is idle. util_avg is usually not null when a CPU becomes idle
> > and you can have to wait  more than 300ms before it becomes Null
> > At the opposite, the utilization of a CPU can be null but a task with
> > null utilization has just woken up on it.
> > Utilization is used to reflect the average work of the CPU or group of
> > CPUs but not the current state
>
> If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus ==
> sgs->group_weight come to mind.

yes, I have the same in mind

>
> > > +  sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > > +   }
> > > +
> > > +   /* @dst_cpu has SMT siblings. */
> > > +
> > > +   if (sg_is_smt) {
> > > +   int local_busy_cpus = sds->local->group_weight -
> > > + sds->local_stat.idle_cpus;
> > > +   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > > +
> > > +   /* Local can always help to even the number busy CPUs. */
> >
> > default behavior of the load balance already tries to even the number
 a> > of idle CPUs.
>
> Right, but I suppose this is because we're trapped here and have to deal
> with the SMT-SMT case too. Ricardo, can you clarify?

IIUC, this function is used in sg_lb_stats to set
sgs->group_asym_packing which is then used to set the group state to
group_asym_packing and force asym migration.
But if we only want to even the number of busy CPUs between the group,
we should not need to set the group state to  group_asym_packing

>
> > > +   if (busy_cpus_delta >= 2)
> > > +   return true;
> > > +
> > > +   if (busy_cpus_delta == 1)
> > > +   return sched_asym_prefer(dst_cpu,
> > > +sg->asym_prefer_cpu);
> > > +
> > > +   return false;
> > > +   }
> > > +
> > > +   /*
> > > +* @sg does not have SMT siblings. Ensure that @sds::local does 
> > > not end
> > > +* up with more than one busy SMT sibling and only pull tasks if 
> > > there
> > > +* are not busy CPUs. As CPUs move in and out of idle state 
> > > frequently,
> > > +* also check the group utilization to smoother the 

Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Peter Zijlstra
On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote:
> > +/**
> > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
> > tasks
> > + * @dst_cpu:   Destination CPU of the load balancing
> > + * @sds:   Load-balancing data with statistics of the local group
> > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > + * @sg:The candidate busiet group
> > + *
> > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > decide
> > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it 
> > can
> > + * pull tasks if two or more of the SMT siblings of @sg are busy. If only 
> > one
> > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> > + *
> > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs
> > + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> > + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT 
> > siblings
> > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> > + * has lower priority.
> > + */
> > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > +   struct sg_lb_stats *sgs,
> > +   struct sched_group *sg)
> > +{
> > +#ifdef CONFIG_SCHED_SMT
> > +   bool local_is_smt, sg_is_smt;
> > +   int sg_busy_cpus;
> > +
> > +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> > +
> > +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > +
> > +   if (!local_is_smt) {
> > +   /*
> > +* If we are here, @dst_cpu is idle and does not have SMT
> > +* siblings. Pull tasks if candidate group has two or more
> > +* busy CPUs.
> > +*/
> > +   if (sg_is_smt && sg_busy_cpus >= 2)
> > +   return true;
> > +
> > +   /*
> > +* @dst_cpu does not have SMT siblings. @sg may have SMT
> > +* siblings and only one is busy. In such case, @dst_cpu
> > +* can help if it has higher priority and is idle.
> > +*/
> > +   return !sds->local_stat.group_util &&
> 
> sds->local_stat.group_util can't be used to decide if a CPU or group
> of CPUs is idle. util_avg is usually not null when a CPU becomes idle
> and you can have to wait  more than 300ms before it becomes Null
> At the opposite, the utilization of a CPU can be null but a task with
> null utilization has just woken up on it.
> Utilization is used to reflect the average work of the CPU or group of
> CPUs but not the current state

If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus ==
sgs->group_weight come to mind.

> > +  sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > +   }
> > +
> > +   /* @dst_cpu has SMT siblings. */
> > +
> > +   if (sg_is_smt) {
> > +   int local_busy_cpus = sds->local->group_weight -
> > + sds->local_stat.idle_cpus;
> > +   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > +
> > +   /* Local can always help to even the number busy CPUs. */
> 
> default behavior of the load balance already tries to even the number
> of idle CPUs.

Right, but I suppose this is because we're trapped here and have to deal
with the SMT-SMT case too. Ricardo, can you clarify?

> > +   if (busy_cpus_delta >= 2)
> > +   return true;
> > +
> > +   if (busy_cpus_delta == 1)
> > +   return sched_asym_prefer(dst_cpu,
> > +sg->asym_prefer_cpu);
> > +
> > +   return false;
> > +   }
> > +
> > +   /*
> > +* @sg does not have SMT siblings. Ensure that @sds::local does not 
> > end
> > +* up with more than one busy SMT sibling and only pull tasks if 
> > there
> > +* are not busy CPUs. As CPUs move in and out of idle state 
> > frequently,
> > +* also check the group utilization to smoother the decision.
> > +*/
> > +   if (!sds->local_stat.group_util)
> 
> same comment as above about the meaning of group_util == 0
> 
> > +   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > +
> > +   return false;
> > +#else
> > +   /* Always return false so that callers deal with non-SMT cases. */
> > +   return false;
> > +#endif
> > +}
> > +
> >  static inline bool
> >  sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct 
> > sg_lb_stats *sgs,
> >struct sched_group *group)
> >  {
> > +   /* Only do SMT checks if either local or candidate have 

Re: [PATCH] ppc: add "-z notext" flag to disable diagnostic

2021-08-27 Thread Segher Boessenkool
Hi!

On Sat, Aug 14, 2021 at 12:34:15PM -0700, Fāng-ruì Sòng wrote:
> On Sat, Aug 14, 2021 at 5:59 AM Segher Boessenkool
>  wrote:
> >
> > On Fri, Aug 13, 2021 at 01:05:08PM -0700, Fangrui Song wrote:
> > > Text relocations are considered very awful by linker developers.
> >
> > By very few linker developers.

> https://groups.google.com/g/generic-abi/c/Ckq19PfLxyk/m/uW29sgkoAgAJ
> Ali Bahrami: "My opinion is that no one wants text relocations, but
> not labeling them if they do exist doesn't seem right. I find the
> presence of DF_TEXTREL very interesting when diagnosing various
> issues."

I don't know who that is, and that post has no context.

> https://gcc.gnu.org/legacy-ml/gcc/2016-04/msg00138.html
> ( "So why not simply create 'dynamic text relocations' then?  Is that
> possible with a pure linker change?" )
> Cary Coutant: "Ugh. Besides being a bad idea from a performance point
> of view, it's not even always possible to do. Depending on the
> architecture, a direct reference from an executable to a variable in a
> shared library may not have the necessary reach."

That is about a very specific kind of relocation.

> binutils-gdb commit "Add linker option: --warn-shared-textrel to
> produce warnings when adding a DT_TEXTREL to a shared object."
> Nick Clifton

That does not say text relocations are bad.  Of course you want to know
if they are there, for various reasons, like, if they are disallowed on
some systems.

> https://www.openwall.com/lists/musl/2020/09/26/3
> Szabolcs Nagy: "nice.  and gcc passes -z text for static pie code so
> that case should not end up with text rels."

That does not say text relocations are bad.

> Someone wrote "Overall, the overhead of processing text relocations
> can cause serious performance degradation." in Solaris' Linker and
> Libraries Guide.

In process startup, sure.  And it can make those processes run faster
as well.  That is the tradeoff with *all* relocations; you can make any
code without any relocations.  Relocations are a tradeoff, like most
things.

> > How would this be a benefit to security?
> 
> https://flameeyes.blog/2016/01/16/textrels-text-relocations-and-their-impact-on-hardening-techniques/

This means that those "hardening techniques" have some serious
weaknesses, that is all.  And hardening is not part of security
anyway; it is impact mitigation.

> FWIW I contributed a glibc patch allowing TEXTREL to co-exist with ifunc.
> It requires temporary mapping the text segment W^X.

What does W^X mean here?  It normally means no mapping is both writable
and executable at the same time.

> > > There are no text relocations, therefore no need for -z notext.
> >
> > This is a choice by the compiler, nothing more.  It saves some process
> > startup time, and allows slightly more maps to be shared by processes
> > that run the same images.  But it is a tradeoff, so it might change; and
> > of course it is not an ABI requirement.

> Text relocations are generally awful.

Great arguments, thanks!  :-P

> GNU ld and gold's traditional "add DF_TEXTREL on-demand" behavior made
> such user errors easy to make.

That has no bearing on if text relocations are useful or not.

> I understand that kernels are special applications where we apply
> relocations once and many user-space objection can be less of a
> concern/ignored.
> However, the Linux kernel is already in a position where many linker
> options are controlled and thus should specify -z notext to make
> the intention explicit, or fix the problems (I think x86-64 is good;
> that said, powerpc
> has a higher cost using PC-relative instructions so pay the oneshot relocation
> time cost probably isn't a bad choice)

I have no idea what you mean.


Segher


Re: [RFC PATCH] powerpc: Investigate static_call concept

2021-08-27 Thread Peter Zijlstra
On Fri, Aug 27, 2021 at 09:45:37AM +, Christophe Leroy wrote:
> This RFC is to validate the concept of static_call on powerpc.
> 
> Highly copied from x86.
> 
> It replaces ppc_md.get_irq() which is called at every IRQ, by
> a static call.

The code looks saner, but does it actually improve performance? I'm
thinking the double branch also isn't free.

> When updating the call, we just replace the instruction at the
> trampoline address by a relative jump to the function.
> 
> For the time being, the case of out-of-range functions is not handled.

The paranoid in me would've made it:

BUG_ON(patch_branch(...));

just to make sure to notice the target not fitting. Ohh, patch_branch()
doesn't return the create_branch() error, perhaps that wants to be
fixed?

Did you see the arm64 variant that deals with out-of-range functions in
their trampoline?

  
https://lore.kernel.org/linux-arm-kernel/20201120082103.4840-1-a...@kernel.org/

Not exactly 'nice' but supposedly that works.

> +#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
> + asm(".pushsection .text, \"ax\" \n" \
> + ".align 4   \n" \
> + ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> + STATIC_CALL_TRAMP_STR(name) ":  \n" \
> + "   blr \n" \
> + ".type " STATIC_CALL_TRAMP_STR(name) ", @function   \n" \
> + ".size " STATIC_CALL_TRAMP_STR(name) ", . - " 
> STATIC_CALL_TRAMP_STR(name) " \n" \
> + ".popsection\n")
> +

Since you support CALL_NULL_TRAMP, your patch function below:

> +void arch_static_call_transform(void *site, void *tramp, void *func, bool 
> tail)
> +{
> + mutex_lock(_mutex);
> +
> + if (tramp)
> + patch_branch(tramp, (unsigned long)func, 0);
> +
> + mutex_unlock(_mutex);
> +}
> +EXPORT_SYMBOL_GPL(arch_static_call_transform);

Ought to patch in "blr" when !func to be consistent :-)

I'm thinking that your core kernel text all fits in the native range and
only modules need out-of-range ?


Re: [PATCH v2 0/2] Kconfig symbol fixes on powerpc

2021-08-27 Thread Michael Ellerman
On Thu, 19 Aug 2021 13:39:52 +0200, Lukas Bulwahn wrote:
> The script ./scripts/checkkconfigsymbols.py warns on invalid references to
> Kconfig symbols (often, minor typos, name confusions or outdated references).
> 
> This patch series addresses all issues reported by
> ./scripts/checkkconfigsymbols.py in ./drivers/usb/ for Kconfig and Makefile
> files. Issues in the Kconfig and Makefile files indicate some shortcomings in
> the overall build definitions, and often are true actionable issues to 
> address.
> 
> [...]

Patch 2 applied to powerpc/fixes.

[2/2] powerpc: rectify selection to ARCH_ENABLE_SPLIT_PMD_PTLOCK
  https://git.kernel.org/powerpc/c/310d2e83cb9b7f1e7232319880e3fcb57592fa10

cheers


Re: [PATCH v1] powerpc/64s: Fix scv implicit soft-mask table for relocated kernels

2021-08-27 Thread Michael Ellerman
On Fri, 20 Aug 2021 20:34:31 +1000, Nicholas Piggin wrote:
> The implict soft-mask table addresses get relocated if they use a
> relative symbol like a label. This is right for code that runs relocated
> but not for unrelocated. The scv interrupt vectors run unrelocated, so
> absolute addresses are required for their soft-mask table entry.
> 
> This fixes crashing with relocated kernels, usually an asynchronous
> interrupt hitting in the scv handler, then hitting the trap that checks
> whether r1 is in userspace.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/64s: Fix scv implicit soft-mask table for relocated kernels
  https://git.kernel.org/powerpc/c/787c70f2f9990b5a197320152d2fc32cd8a6ad1a

cheers


Re: [PATCH] powerpc/xive: Do not mark xive_request_ipi() as __init

2021-08-27 Thread Michael Ellerman
On Mon, 16 Aug 2021 11:57:11 -0700, Nathan Chancellor wrote:
> Compiling ppc64le_defconfig with clang-14 shows a modpost warning:
> 
> WARNING: modpost: vmlinux.o(.text+0xa74e0): Section mismatch in
> reference from the function xive_setup_cpu_ipi() to the function
> .init.text:xive_request_ipi()
> The function xive_setup_cpu_ipi() references
> the function __init xive_request_ipi().
> This is often because xive_setup_cpu_ipi lacks a __init
> annotation or the annotation of xive_request_ipi is wrong.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/xive: Do not mark xive_request_ipi() as __init
  https://git.kernel.org/powerpc/c/3f78c90f9eb2e228f44ecc8f4377753f0e11dbab

cheers


Re: [PATCH v2] powerpc/mm: Fix set_memory_*() against concurrent accesses

2021-08-27 Thread Michael Ellerman
On Wed, 18 Aug 2021 22:05:18 +1000, Michael Ellerman wrote:
> Laurent reported that STRICT_MODULE_RWX was causing intermittent crashes
> on one of his systems:
> 
>   kernel tried to execute exec-protected page (c00804073278) - exploit 
> attempt? (uid: 0)
>   BUG: Unable to handle kernel instruction fetch
>   Faulting instruction address: 0xc00804073278
>   Oops: Kernel access of bad area, sig: 11 [#1]
>   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>   Modules linked in: drm virtio_console fuse drm_panel_orientation_quirks ...
>   CPU: 3 PID: 44 Comm: kworker/3:1 Not tainted 5.14.0-rc4+ #12
>   Workqueue: events control_work_handler [virtio_console]
>   NIP:  c00804073278 LR: c00804073278 CTR: c01e9de0
>   REGS: c0002e4ef7e0 TRAP: 0400   Not tainted  (5.14.0-rc4+)
>   MSR:  80004280b033   CR: 24002822 XER: 
> 200400cf
>   ...
>   NIP fill_queue+0xf0/0x210 [virtio_console]
>   LR  fill_queue+0xf0/0x210 [virtio_console]
>   Call Trace:
> fill_queue+0xb4/0x210 [virtio_console] (unreliable)
> add_port+0x1a8/0x470 [virtio_console]
> control_work_handler+0xbc/0x1e8 [virtio_console]
> process_one_work+0x290/0x590
> worker_thread+0x88/0x620
> kthread+0x194/0x1a0
> ret_from_kernel_thread+0x5c/0x64
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/mm: Fix set_memory_*() against concurrent accesses
  https://git.kernel.org/powerpc/c/9f7853d7609d59172eecfc5e7ccf503bc1b690bd

cheers


Re: [PATCH v2] powerpc/32s: Fix random crashes by adding isync() after locking/unlocking KUEP

2021-08-27 Thread Michael Ellerman
On Wed, 18 Aug 2021 06:49:29 + (UTC), Christophe Leroy wrote:
> Commit b5efec00b671 ("powerpc/32s: Move KUEP locking/unlocking in C")
> removed the 'isync' instruction after adding/removing NX bit in user
> segments. The reasoning behind this change was that when setting the
> NX bit we don't mind it taking effect with delay as the kernel never
> executes text from userspace, and when clearing the NX bit this is
> to return to userspace and then the 'rfi' should synchronise the
> context.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/32s: Fix random crashes by adding isync() after locking/unlocking 
KUEP
  https://git.kernel.org/powerpc/c/ef486bf448a057a6e2d50e40ae879f7add6585da

cheers


Re: [PATCH] powerpc/pseries: Fix build error when NUMA=n

2021-08-27 Thread Laurent Dufour

Le 27/08/2021 à 15:15, Michael Ellerman a écrit :

On Mon, 16 Aug 2021 14:10:32 +1000, Michael Ellerman wrote:

As reported by lkp, if NUMA=n we see a build error:

arch/powerpc/platforms/pseries/hotplug-cpu.c: In function 
'pseries_cpu_hotplug_init':
arch/powerpc/platforms/pseries/hotplug-cpu.c:1022:8: error: 
'node_to_cpumask_map' undeclared
 1022 |node_to_cpumask_map[node]);

Use cpumask_of_node() which has an empty stub for NUMA=n, and when
NUMA=y does a lookup from node_to_cpumask_map[].

[...]


Applied to powerpc/next.

[1/1] powerpc/pseries: Fix build error when NUMA=n
   https://git.kernel.org/powerpc/c/8b893ef190b0c440877de04f767efca4bf4d6af8

cheers



Thanks, Michael, for fixing my bugs !


Re: [PATCH -next] selftests/powerpc: Remove duplicated include from tm-poison.c

2021-08-27 Thread Michael Ellerman
On Fri, 26 Mar 2021 14:48:08 +0800, Zheng Yongjun wrote:
> Remove duplicated include.
> 
> 
> 
> 

Applied to powerpc/next.

[1/1] selftests/powerpc: Remove duplicated include from tm-poison.c
  https://git.kernel.org/powerpc/c/6af0b5570b59ce8dd1608a8e48f59eff3f4bdd04

cheers


Re: [PATCH] [v2] arch: powerpc: Remove duplicate includes

2021-08-27 Thread Michael Ellerman
On Tue, 23 Mar 2021 14:29:05 +0800, Wan Jiabing wrote:
> mmu-hash.h: asm/bug.h has been included at line 12, so remove
> the duplicate one at line 21.
> interrupt.c: asm/interrupt.h has been included at line 12, so
> remove the duplicate one at line 10.
> time.c: linux/sched/clock.h has been included at line 33,so
> remove the duplicate one at line 56 and move sched/cputime.h
> under sched including segament.
> 
> [...]

Applied to powerpc/next.

[1/1] arch: powerpc: Remove duplicate includes
  https://git.kernel.org/powerpc/c/e225c4d6bc389701f2f63fc246420a1da3465ab5

cheers


Re: [PATCH v2 0/4] Some improvements on regs usage

2021-08-27 Thread Michael Ellerman
On Sat, 7 Aug 2021 09:02:35 +0800, sxwj...@me.com wrote:
> From: Xiongwei Song 
> 
> When CONFIG_4xx=y or CONFIG_BOOKE=y, currently in code we reference dsisr
> to get interrupt reasons and reference dar to get excepiton address.
> However, in reference manuals, esr is used for interrupt reasons and dear
> is used for excepiton address, so the patchset changes dsisr -> esr,
> dar -> dear for CONFIG_4xx=y or CONFIG_BOOKE=y.
> 
> [...]

Applied to powerpc/next.

[1/4] powerpc: Optimize register usage for esr register
  https://git.kernel.org/powerpc/c/4f8e78c0757e3c5a65d9d8ac76e2434c71a78f5a
[2/4] powerpc/64e: Get esr offset with _ESR macro
  https://git.kernel.org/powerpc/c/cfa47772ca8d53d7a6c9b331a7f6e7c4c9827214
[3/4] powerpc: Optimize register usage for dear register
  https://git.kernel.org/powerpc/c/4872cbd0ca35ca5b20d52e2539e7e1950f126e7b
[4/4] powerpc/64e: Get dear offset with _DEAR macro
  https://git.kernel.org/powerpc/c/d9db6e420268b2d561731468a31f0b15e2e9a145

cheers


Re: [PATCH] powerpc/head_check: use stdout for error messages

2021-08-27 Thread Michael Ellerman
On Sun, 15 Aug 2021 15:23:34 -0700, Randy Dunlap wrote:
> Prefer stderr instead of stdout for error messages.
> This is a good practice and can help CI error detecting and
> reporting (0day in this case).
> 
> 
> 
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/head_check: use stdout for error messages
  https://git.kernel.org/powerpc/c/47c258d71ebfc832a760a1dc6540cf3c33968023

cheers


Re: (subset) [PATCH v2 00/60] KVM: PPC: Book3S HV P9: entry/exit optimisations

2021-08-27 Thread Michael Ellerman
On Thu, 12 Aug 2021 02:00:34 +1000, Nicholas Piggin wrote:
> This reduces radix guest full entry/exit latency on POWER9 and POWER10
> by 2x.
> 
> Nested HV guests should see smaller improvements in their L1 entry/exit,
> but this is also combined with most L0 speedups also applying to nested
> entry. nginx localhost throughput test in a SMP nested guest is improved
> about 10% (in a direct guest it doesn't change much because it uses XIVE
> for IPIs) when L0 and L1 are patched.
> 
> [...]

Applied to powerpc/next.

[01/60] KVM: PPC: Book3S HV: Initialise vcpu MSR with MSR_ME

https://git.kernel.org/powerpc/c/fd42b7b09c602c904452c0c3e5955ca21d8e387a
[02/60] KVM: PPC: Book3S HV: Remove TM emulation from POWER7/8 path

https://git.kernel.org/powerpc/c/daac40e8d7a63ab8608132a7cfce411986feac32
[03/60] KVM: PPC: Book3S HV P9: Fixes for TM softpatch interrupt NIP

https://git.kernel.org/powerpc/c/4782e0cd0d184d727ad3b0cfe20d1d44d9f98239
[04/60] KVM: PPC: Book3S HV Nested: Fix TM softpatch HFAC interrupt emulation

https://git.kernel.org/powerpc/c/d82b392d9b3556b63e3f9916cf057ea847e173a9
[05/60] KVM: PPC: Book3S HV Nested: Sanitise vcpu registers

https://git.kernel.org/powerpc/c/7487cabc7ed2f7716bf304e4e97c57fd995cf70e
[06/60] KVM: PPC: Book3S HV Nested: Make nested HFSCR state accessible

https://git.kernel.org/powerpc/c/8b210a880b35ba75eb42b79dfd65e369c1feb119
[07/60] KVM: PPC: Book3S HV Nested: Stop forwarding all HFUs to L1

https://git.kernel.org/powerpc/c/7c3ded5735141ff4d049747c9f76672a8b737c49
[08/60] KVM: PPC: Book3S HV Nested: save_hv_return_state does not require trap 
argument

https://git.kernel.org/powerpc/c/f2e29db156523bf08a0524e0f4306a641912c6d9
[09/60] KVM: PPC: Book3S HV Nested: Reflect guest PMU in-use to L0 when guest 
SPRs are live

https://git.kernel.org/powerpc/c/1782663897945a5cf28e564ba5eed730098e9aa4
[10/60] powerpc/64s: Remove WORT SPR from POWER9/10

https://git.kernel.org/powerpc/c/0c8fb653d487d2873f5eefdcaf4cecff4e103828

cheers


Re: [PATCH] powerpc/pseries: Fix build error when NUMA=n

2021-08-27 Thread Michael Ellerman
On Mon, 16 Aug 2021 14:10:32 +1000, Michael Ellerman wrote:
> As reported by lkp, if NUMA=n we see a build error:
> 
>arch/powerpc/platforms/pseries/hotplug-cpu.c: In function 
> 'pseries_cpu_hotplug_init':
>arch/powerpc/platforms/pseries/hotplug-cpu.c:1022:8: error: 
> 'node_to_cpumask_map' undeclared
> 1022 |node_to_cpumask_map[node]);
> 
> Use cpumask_of_node() which has an empty stub for NUMA=n, and when
> NUMA=y does a lookup from node_to_cpumask_map[].
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/pseries: Fix build error when NUMA=n
  https://git.kernel.org/powerpc/c/8b893ef190b0c440877de04f767efca4bf4d6af8

cheers


Re: [PATCH v2] ppc: add "-z notext" flag to disable diagnostic

2021-08-27 Thread Michael Ellerman
On Fri, 13 Aug 2021 13:05:11 -0700, Bill Wendling wrote:
> Object files used to link .tmp_vmlinux.kallsyms1 have many R_PPC64_ADDR64
> relocations in non-SHF_WRITE sections. There are many text relocations (e.g. 
> in
> .rela___ksymtab_gpl+* and .rela__mcount_loc sections) in a -pie link and are
> disallowed by LLD:
> 
>   ld.lld: error: can't create dynamic relocation R_PPC64_ADDR64 against local 
> symbol in readonly segment; recompile object files with -fPIC or pass 
> '-Wl,-z,notext' to allow text relocations in the output
>   >>> defined in arch/powerpc/kernel/head_64.o
>   >>> referenced by arch/powerpc/kernel/head_64.o:(__restart_table+0x10)
> 
> [...]

Applied to powerpc/next.

[1/1] ppc: add "-z notext" flag to disable diagnostic
  https://git.kernel.org/powerpc/c/0355785313e2191be4e1108cdbda94ddb0238c48

cheers


Re: [PATCH v2 0/2] Kconfig symbol fixes on powerpc

2021-08-27 Thread Michael Ellerman
On Thu, 19 Aug 2021 13:39:52 +0200, Lukas Bulwahn wrote:
> The script ./scripts/checkkconfigsymbols.py warns on invalid references to
> Kconfig symbols (often, minor typos, name confusions or outdated references).
> 
> This patch series addresses all issues reported by
> ./scripts/checkkconfigsymbols.py in ./drivers/usb/ for Kconfig and Makefile
> files. Issues in the Kconfig and Makefile files indicate some shortcomings in
> the overall build definitions, and often are true actionable issues to 
> address.
> 
> [...]

Patch 1 applied to powerpc/next.

[1/2] powerpc: kvm: remove obsolete and unneeded select
  https://git.kernel.org/powerpc/c/c26d4c5d4f0df7207da3975458261927f9305465

cheers


Re: [PATCH v4 1/3] powerpc/perf: Use stack siar instead of mfspr

2021-08-27 Thread Michael Ellerman
On Wed, 18 Aug 2021 22:45:54 +0530, Kajol Jain wrote:
> Minor optimization in the 'perf_instruction_pointer' function code by
> making use of stack siar instead of mfspr.
> 
> 
> 
> 

Applied to powerpc/next.

[1/3] powerpc/perf: Use stack siar instead of mfspr
  https://git.kernel.org/powerpc/c/b1643084d164cea0c107a39bcdf0119fc52619af
[2/3] powerpc/perf: Drop the case of returning 0 as instruction pointer
  https://git.kernel.org/powerpc/c/cc90c6742ef5b438f4cb86029d7a794bd0a44a06
[3/3] powerpc/perf: Fix the check for SIAR value
  https://git.kernel.org/powerpc/c/3c69a5f3fa3e312689ec218b5059784d49d7

cheers


Re: [PATCH] powerpc/perf/hv-gpci: Fix the logic to compute counter value from the hcall result buffer.

2021-08-27 Thread Michael Ellerman
On Fri, 13 Aug 2021 13:51:58 +0530, Kajol Jain wrote:
> H_GetPerformanceCounterInfo (0xF080) hcall returns the counter data in the
> result buffer. Result buffer has specific format defined in the PAPR
> specification. One of the field is counter offset and width of the counter
> data returned.
> 
> Counter data are returned in a unsigned char array. To
> get the final counter data, these values should be left shifted
> byte at a time. But commit 220a0c609ad17 ("powerpc/perf: Add support
> for the hv gpci (get performance counter info) interface") made the
> shifting bitwise. Because of this, hcall counters values could end up
> in lower side, which messes the counter prev vs now calculation. This
> lead to huge counter value reporting
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/perf/hv-gpci: Fix the logic to compute counter value from the 
hcall result buffer.
  https://git.kernel.org/powerpc/c/f9addd85fbfacf0d155e83dbee8696d6df5ed0c7

cheers


Re: [PATCH v2 0/3] powerpc: mpc855_ads defconfig fixes

2021-08-27 Thread Michael Ellerman
On Tue, 17 Aug 2021 14:24:04 +0930, Joel Stanley wrote:
> v2: fix typos, split out mtd fix from savedefconfig patch
> 
> The first patch fixes a build warning I noticed when testing something
> unrelated.
> 
> The second fixes a regression where the MTD partition support dropped out
> of the config. I have enabled the dependency so this is still part of
> the config.
> 
> [...]

Applied to powerpc/next.

[1/3] powerpc/config: Fix IPV6 warning in mpc855_ads
  https://git.kernel.org/powerpc/c/c5ac55b6cbc604ad4144c380feae20f69453df91
[2/3] powerpc/config: Renable MTD_PHYSMAP_OF
  https://git.kernel.org/powerpc/c/d0e28a6145c3455b69991245e7f6147eb914b34a
[3/3] powerpc/configs: Regenerate mpc885_ads_defconfig
  https://git.kernel.org/powerpc/c/87e0d46bf68913f4c87dba94aadc00da658a874b

cheers


Re: [PATCH v2 1/2] selftests/powerpc: Add missing clobbered register to to ptrace TM tests

2021-08-27 Thread Michael Ellerman
On Thu, 29 Jul 2021 14:13:16 +1000, Jordan Niethe wrote:
> ISA v3.1 removes TM but includes a synthetic implementation for
> backwards compatibility.  With this implementation,  the tests
> ptrace-tm-spd-gpr and ptrace-tm-gpr should never be able to make any
> forward progress and eventually should be killed by the timeout.
> Instead on a P10 running in P9 mode, ptrace_tm_gpr fails like so:
> 
> test: ptrace_tm_gpr
> tags: git_version:unknown
> Starting the child
> ...
> ...
> GPR[27]: 1 Expected: 2
> GPR[28]: 1 Expected: 2
> GPR[29]: 1 Expected: 2
> GPR[30]: 1 Expected: 2
> GPR[31]: 1 Expected: 2
> [FAIL] Test FAILED on line 98
> failure: ptrace_tm_gpr
> selftests:  ptrace-tm-gpr [FAIL]
> 
> [...]

Applied to powerpc/next.

[1/2] selftests/powerpc: Add missing clobbered register to to ptrace TM tests
  https://git.kernel.org/powerpc/c/c95278a0534449efc64ac8169382bce217963be2
[2/2] selftests: Skip TM tests on synthetic TM implementations
  https://git.kernel.org/powerpc/c/e42edf9b9d126bb1c743f2e7984877ba27f09fe7

cheers


Re: [PATCH] powerpc/tau: Add 'static' storage qualifier to 'tau_work' definition

2021-08-27 Thread Michael Ellerman
On Thu, 19 Aug 2021 10:46:54 +1000, Finn Thain wrote:
> This patch prevents the following sparse warning.
> 
> arch/powerpc/kernel/tau_6xx.c:199:1: sparse: sparse: symbol 'tau_work'
> was not declared. Should it be static?
> 
> 
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/tau: Add 'static' storage qualifier to 'tau_work' definition
  https://git.kernel.org/powerpc/c/6cd717fe9b3a787f8e8f9d0bc9b7634a9c104b3e

cheers


Re: [PATCH v2 0/3] KVM: PPC: Book3S HV: kvmhv_copy_tofrom_guest_radix changes

2021-08-27 Thread Michael Ellerman
On Thu, 5 Aug 2021 18:26:13 -0300, Fabiano Rosas wrote:
> This series contains the fix for __kvmhv_copy_tofrom_guest_radix plus
> a couple of changes.
> 
> - Patch 1: The fix itself. I thought a smaller fix upfront would be
>better to facilitate any backports.
> 
> - Patch 2: Adds a sanity check to the low level function. Since the
>hcall API already enforces that the effective address must
>be on quadrant 0, moving the checks from the high level
>function would only add overhead to the hcall path, so I
>opted for a lightweight check at the low level.
> 
> [...]

Applied to powerpc/next.

[1/3] KVM: PPC: Book3S HV: Fix copy_tofrom_guest routines
  https://git.kernel.org/powerpc/c/5d7d6dac8fe99ed59eee2300e4a03370f94d5222
[2/3] KVM: PPC: Book3S HV: Add sanity check to copy_tofrom_guest
  https://git.kernel.org/powerpc/c/c232461c0c3b0aca637f0d7658a7f5d0bb393a4e
[3/3] KVM: PPC: Book3S HV: Stop exporting symbols from book3s_64_mmu_radix
  https://git.kernel.org/powerpc/c/0eb596f1e6103ebe122792a425b88c5dc21c4087

cheers


Re: [PATCH v2 0/2] W=1 fixes

2021-08-27 Thread Michael Ellerman
On Mon, 23 Aug 2021 11:00:37 +0200, Cédric Le Goater wrote:
> These are the remaining patches needed to compile the ppc kernel with
> W=1. Audit issues are now being addressed by Christophe in patch :
> 
> [v2] powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC
> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/dc14509a28a993738b1325211f412be72a4f9b1e.1629701132.git.christophe.le...@csgroup.eu/
> 
> Thanks,
> 
> [...]

Applied to powerpc/next.

[1/2] powerpc/prom: Fix unused variable ‘reserve_map’ when CONFIG_PPC32 is not 
set
  https://git.kernel.org/powerpc/c/3accc0faef081b6813967b34f7d05a3edb855cbd
[2/2] powerpc/compat_sys: Declare syscalls
  https://git.kernel.org/powerpc/c/cc47ad409ba9cc950e9c492c8ba653dabd392148

cheers


Re: [PATCH 0/6] W=1 fixes

2021-08-27 Thread Michael Ellerman
On Thu, 19 Aug 2021 14:56:50 +0200, Cédric Le Goater wrote:
> With this small series, I could compile the ppc kernel with W=1. There
> are certainly other configs which will need more fixes but it's a good
> start.
> 
> The last 2 patches look hacky. Christophe, could you help with these
> to find a better place to include the declarations ?
> 
> [...]

Patches 2-4 applied to powerpc/next.

[2/6] powerpc/pseries/vas: Declare pseries_vas_fault_thread_fn() as static
  https://git.kernel.org/powerpc/c/4cb266074aa17e9cafed3a92e9f43b161516569f
[3/6] KVM: PPC: Book3S PR: Declare kvmppc_handle_exit_pr()
  https://git.kernel.org/powerpc/c/cb53a93e33e108bfec00a13cd12696e1c0ccc8b6
[4/6] KVM: PPC: Book3S PR: Remove unused variable
  https://git.kernel.org/powerpc/c/b352ddae7b2ccd2cb29f495ca0a7c9b6848b623f

cheers


Re: [PATCH v4 1/4] powerpc/ptdump: Use DEFINE_SHOW_ATTRIBUTE()

2021-08-27 Thread Michael Ellerman
On Thu, 8 Jul 2021 16:49:40 + (UTC), Christophe Leroy wrote:
> Use DEFINE_SHOW_ATTRIBUTE() instead of open coding
> open() and fops.
> 
> 
> 
> 

Applied to powerpc/next.

[1/4] powerpc/ptdump: Use DEFINE_SHOW_ATTRIBUTE()
  https://git.kernel.org/powerpc/c/11f27a7fa4ca27935de74e3eb052bdc430d5f8d8
[2/4] powerpc/ptdump: Remove unused 'page_size' parameter
  https://git.kernel.org/powerpc/c/64b87b0c70e0fd28352895cba3c0a9631e0072dd
[3/4] powerpc/ptdump: Reduce level numbers by 1 in note_page() and add p4d level
  https://git.kernel.org/powerpc/c/cf98d2b6eea6a1b2c43f85680ad58fcc3ea9496b
[4/4] powerpc/ptdump: Convert powerpc to GENERIC_PTDUMP
  https://git.kernel.org/powerpc/c/e084728393a58e7fdafeee2e6b20e0faff09b557

cheers


Re: [PATCH v3] powerpc/booke: Avoid link stack corruption in several places

2021-08-27 Thread Michael Ellerman
On Tue, 24 Aug 2021 07:56:26 + (UTC), Christophe Leroy wrote:
> Use bcl 20,31,+4 instead of bl in order to preserve link stack.
> 
> See commit c974809a26a1 ("powerpc/vdso: Avoid link stack corruption
> in __get_datapage()") for details.
> 
> 
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/booke: Avoid link stack corruption in several places
  https://git.kernel.org/powerpc/c/f5007dbf4da729baa850b33a64dc3cc53757bdf8

cheers


Re: (subset) [PATCH v3 1/3] powerpc: Remove MSR_PR check in interrupt_exit_{user/kernel}_prepare()

2021-08-27 Thread Michael Ellerman
On Mon, 23 Aug 2021 08:24:20 + (UTC), Christophe Leroy wrote:
> In those hot functions that are called at every interrupt, any saved
> cycle is worth it.
> 
> interrupt_exit_user_prepare() and interrupt_exit_kernel_prepare() are
> called from three places:
> - From entry_32.S
> - From interrupt_64.S
> - From interrupt_exit_user_restart() and interrupt_exit_kernel_restart()
> 
> [...]

Applied to powerpc/next.

[2/3] powerpc: Refactor verification of MSR_RI
  https://git.kernel.org/powerpc/c/806c0e6e7e97adc17389c8dc1f52d4736f49299b

cheers


Re: [PATCH v2] powerpc: Avoid link stack corruption in misc asm functions

2021-08-27 Thread Michael Ellerman
On Tue, 24 Aug 2021 07:56:35 + (UTC), Christophe Leroy wrote:
> bl;mflr is used at several places to get code position.
> 
> Use bcl 20,31,+4 instead of bl in order to preserve link stack.
> 
> See commit c974809a26a1 ("powerpc/vdso: Avoid link stack corruption
> in __get_datapage()") for details.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc: Avoid link stack corruption in misc asm functions
  https://git.kernel.org/powerpc/c/33e1402435cb9f3021439a15935ea2dc69ec1844

cheers


Re: [PATCH v1 1/2] powerpc: Use lwarx/ldarx directly instead of PPC_LWARX/LDARX macros

2021-08-27 Thread Michael Ellerman
On Tue, 2 Mar 2021 08:48:11 + (UTC), Christophe Leroy wrote:
> Force the eh flag at 0 on PPC32.
> 

Patch 1 applied to powerpc/next.

[1/2] powerpc: Use lwarx/ldarx directly instead of PPC_LWARX/LDARX macros
  https://git.kernel.org/powerpc/c/9401f4e46cf6965e23738f70e149172344a01eef

cheers


Re: [PATCH] powerpc/syscalls: Simplify do_mmap2()

2021-08-27 Thread Michael Ellerman
On Fri, 25 Jun 2021 10:58:33 + (UTC), Christophe Leroy wrote:
> When shift is nul, operations remain valid so no test needed.
> 
> And 'ret' is unnecessary.
> 
> And use IS_ALIGNED() to check alignment, that's more clear.
> 
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/syscalls: Simplify do_mmap2()
  https://git.kernel.org/powerpc/c/316389e904f968d24d44cd96a6d171ee1ef269cf

cheers


Re: [PATCH] powerpc/syscalls: Remove __NR__exit

2021-08-27 Thread Michael Ellerman
On Mon, 23 Aug 2021 06:45:20 + (UTC), Christophe Leroy wrote:
> __NR_exit is nowhere used. On most architectures it was removed by
> commit 135ab6ec8fda ("[PATCH] remove remaining errno and
> __KERNEL_SYSCALLS__ references") but not on powerpc.
> 
> powerpc removed __KERNEL_SYSCALLS__ in commit 3db03b4afb3e ("[PATCH]
> rename the provided execve functions to kernel_execve"), but __NR_exit
> was left over.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/syscalls: Remove __NR__exit
  https://git.kernel.org/powerpc/c/a00ea5b6f2bbef8b004b0b7228c61680a50c7c3f

cheers


Re: [PATCH] powerpc/ptrace: Make user_mode() common to PPC32 and PPC64

2021-08-27 Thread Michael Ellerman
On Tue, 17 Aug 2021 16:00:14 + (UTC), Christophe Leroy wrote:
> Today we have:
> 
>   #ifdef __powerpc64__
>   #define user_mode(regs) regs)->msr) >> MSR_PR_LG) & 0x1)
>   #else
>   #define user_mode(regs) (((regs)->msr & MSR_PR) != 0)
>   #endif
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/ptrace: Make user_mode() common to PPC32 and PPC64
  https://git.kernel.org/powerpc/c/19e932eb6ea47f4f37513eb2ae0daee19117765c

cheers


Re: [PATCH] powerpc/32: indirect function call use bctrl rather than blrl in ret_from_kernel_thread

2021-08-27 Thread Michael Ellerman
On Fri, 20 Aug 2021 05:16:05 + (UTC), Christophe Leroy wrote:
> Copied from commit 89bbe4c798bc ("powerpc/64: indirect function call
> use bctrl rather than blrl in ret_from_kernel_thread")
> 
> blrl is not recommended to use as an indirect function call, as it may
> corrupt the link stack predictor.
> 
> This is not a performance critical path but this should be fixed for
> consistency.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/32: indirect function call use bctrl rather than blrl in 
ret_from_kernel_thread
  https://git.kernel.org/powerpc/c/113ec9ccc8049c3772f0eab46b62c5d6654c09f7

cheers


Re: [PATCH] powerpc/audit: Simplify syscall_get_arch()

2021-08-27 Thread Michael Ellerman
On Fri, 20 Aug 2021 09:39:14 + (UTC), Christophe Leroy wrote:
> Make use of is_32bit_task() and CONFIG_CPU_LITTLE_ENDIAN
> to simplify syscall_get_arch().
> 
> 
> 
> 

Applied to powerpc/next.

[1/1] powerpc/audit: Simplify syscall_get_arch()
  https://git.kernel.org/powerpc/c/770cec16cdc9d1e67896dd2214a4fec9a3fa

cheers


Re: [PATCH] powerpc/audit: Avoid unneccessary #ifdef in syscall_get_arguments()

2021-08-27 Thread Michael Ellerman
On Fri, 20 Aug 2021 09:28:19 + (UTC), Christophe Leroy wrote:
> Use is_32bit_task() which already handles CONFIG_COMPAT.
> 
> 
> 
> 

Applied to powerpc/next.

[1/1] powerpc/audit: Avoid unneccessary #ifdef in syscall_get_arguments()
  https://git.kernel.org/powerpc/c/898a1ef06ad4a2a8e3c9490ce62624fcd1a7b1f8

cheers


Re: [PATCH] powerpc/32: Remove unneccessary calculations in load_up_{fpu/altivec}

2021-08-27 Thread Michael Ellerman
On Wed, 18 Aug 2021 08:47:28 + (UTC), Christophe Leroy wrote:
> No need to re-read SPRN_THREAD, we can calculate thread address
> from current (r2).
> 
> And remove a reload of value 1 into r4 as r4 is already 1.
> 
> 
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/32: Remove unneccessary calculations in load_up_{fpu/altivec}
  https://git.kernel.org/powerpc/c/51ed00e71f0130e0f3534b8e7d78facd16829426

cheers


Re: [PATCH] powerpc/doc: Fix htmldocs errors

2021-08-27 Thread Michael Ellerman
On Wed, 25 Aug 2021 09:54:47 +0530, Aneesh Kumar K.V wrote:
> Fix make htmldocs related errors with the newly added associativity.rst
> doc file.
> 
> 
> 
> 

Applied to powerpc/next.

[1/1] powerpc/doc: Fix htmldocs errors
  https://git.kernel.org/powerpc/c/f50da6edbf1ebf35dd8070847bfab5cb988d472b

cheers


Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Vincent Guittot
On Tue, 10 Aug 2021 at 16:41, Ricardo Neri
 wrote:
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Signed-off-by: Ricardo Neri 
> ---
> Changes since v3:
>   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> powerpc folks showed that this patch should not impact them. Also, more
> recent powerpc processor no longer use asym_packing. (PeterZ)
>   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
>   * Removed unnecessary check for local CPUs when the local group has zero
> utilization. (Joel)
>   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> the fact that it deals with SMT cases.
>   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> that callers can deal with non-SMT cases.
>
> Changes since v2:
>   * Reworded the commit message to reflect updates in code.
>   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> balancing. (PeterZ)
>   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> sched_asym().
>
> Changes since v1:
>   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> tasks. Instead, reclassify the candidate busiest group, as it
> may still be selected. (PeterZ)
>   * Avoid an expensive and unnecessary call to cpumask_weight() when
> determining if a sched_group is comprised of SMT siblings.
> (PeterZ).
> ---
>  kernel/sched/fair.c | 95 +
>  1 file changed, 95 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd411cefb63f..8a1a2a43732c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8531,10 +8531,99 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
>  }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
> tasks
> + * @dst_cpu:   Destination CPU of the load balancing
> + * @sds:   Load-balancing data with statistics of the local group
> + * @sgs:   Load-balancing statistics of the candidate busiest group
> + * @sg:The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT 
> siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +   struct sg_lb_stats *sgs,
> +   struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +   bool local_is_smt, sg_is_smt;
> +   int sg_busy_cpus;
> +
> +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +   if (!local_is_smt) {
> +   /*
> +* If we are here, @dst_cpu is idle and does not have SMT
> +* siblings. Pull tasks if candidate group has two or more
> +* busy CPUs.
> +*/
> +   if (sg_is_smt && sg_busy_cpus >= 2)
> +   return true;
> +
> +   /*
> +* @dst_cpu does not have SMT siblings. @sg may have SMT
> +* siblings and only one is busy. In such case, @dst_cpu
> +* can help if it has higher priority and is idle.
> +*/
> +   return !sds->local_stat.group_util &&

sds->local_stat.group_util can't be used to decide if a CPU or group
of CPUs is idle. util_avg is usually not null when a CPU becomes idle
and you can have to wait  more than 

[RFC PATCH] powerpc: Investigate static_call concept

2021-08-27 Thread Christophe Leroy
This RFC is to validate the concept of static_call on powerpc.

Highly copied from x86.

It replaces ppc_md.get_irq() which is called at every IRQ, by
a static call.

When updating the call, we just replace the instruction at the
trampoline address by a relative jump to the function.

For the time being, the case of out-of-range functions is not handled.

Note that even if we don't immediately use static calls in powerpc
code, supporting static calls would immediately benefit to core
functionnalities using it, like tracing.

Tested on powerpc 8xx.

With the patch:

 <__SCT__ppc_md_get_irq>:
   0:   4e 80 00 20 blr <== Replaced by 'b 
' at runtime
...
0038 <__do_irq>:
  38:   94 21 ff f0 stwur1,-16(r1)
  3c:   7c 08 02 a6 mflrr0
  40:   90 01 00 14 stw r0,20(r1)
  44:   48 00 00 01 bl  44 <__do_irq+0xc>
44: R_PPC_REL24 __SCT__ppc_md_get_irq
...

Before the patch:

0038 <__do_irq>:
  38:   3d 20 00 00 lis r9,0
3a: R_PPC_ADDR16_HA ppc_md+0x20
  3c:   94 21 ff f0 stwur1,-16(r1)
  40:   81 29 00 00 lwz r9,0(r9)
42: R_PPC_ADDR16_LO ppc_md+0x20
  44:   7c 08 02 a6 mflrr0
  48:   90 01 00 14 stw r0,20(r1)
  4c:   7d 29 03 a6 mtctr   r9
  50:   4e 80 04 21 bctrl
...

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/machdep.h |  3 +++
 arch/powerpc/include/asm/static_call.h | 25 +
 arch/powerpc/kernel/Makefile   |  2 +-
 arch/powerpc/kernel/irq.c  |  2 +-
 arch/powerpc/kernel/setup-common.c |  4 
 arch/powerpc/kernel/static_call.c  | 16 
 7 files changed, 51 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/include/asm/static_call.h
 create mode 100644 arch/powerpc/kernel/static_call.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 36b72d972568..c3930ea63e59 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -247,6 +247,7 @@ config PPC
select HAVE_SOFTIRQ_ON_OWN_STACK
select HAVE_STACKPROTECTOR  if PPC32 && 
$(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2)
select HAVE_STACKPROTECTOR  if PPC64 && 
$(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13)
+   select HAVE_STATIC_CALL
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HUGETLB_PAGE_SIZE_VARIABLE   if PPC_BOOK3S_64 && HUGETLB_PAGE
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 764f2732a821..ac9712312b76 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -295,5 +296,7 @@ static inline void log_error(char *buf, unsigned int 
err_type, int fatal)
 #define machine_late_initcall(mach, fn)
__define_machine_initcall(mach, fn, 7)
 #define machine_late_initcall_sync(mach, fn)   __define_machine_initcall(mach, 
fn, 7s)
 
+DECLARE_STATIC_CALL(ppc_md_get_irq, *ppc_md.get_irq);
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_MACHDEP_H */
diff --git a/arch/powerpc/include/asm/static_call.h 
b/arch/powerpc/include/asm/static_call.h
new file mode 100644
index ..335ee4ceaef9
--- /dev/null
+++ b/arch/powerpc/include/asm/static_call.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_STATIC_CALL_H
+#define _ASM_POWERPC_STATIC_CALL_H
+
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)  \
+   asm(".pushsection .text, \"ax\" \n" \
+   ".align 4   \n" \
+   ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
+   STATIC_CALL_TRAMP_STR(name) ":  \n" \
+   "   b " #func " \n" \
+   ".type " STATIC_CALL_TRAMP_STR(name) ", @function   \n" \
+   ".size " STATIC_CALL_TRAMP_STR(name) ", . - " 
STATIC_CALL_TRAMP_STR(name) " \n" \
+   ".popsection\n")
+
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)   \
+   asm(".pushsection .text, \"ax\" \n" \
+   ".align 4   \n" \
+   ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
+   STATIC_CALL_TRAMP_STR(name) ":  \n" \
+   "   blr \n" \
+   ".type " STATIC_CALL_TRAMP_STR(name) ", @function   \n" 

Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-08-27 Thread Michael Ellerman
Nathan Chancellor  writes:
> On Thu, Aug 26, 2021 at 11:54:17AM -0700, Nathan Chancellor wrote:
>> On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
>> > Nathan Chancellor  writes:
>> > > On Tue, Apr 13, 2021 at 04:38:10PM +, Christophe Leroy wrote:
>> > >> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
>> > >> flexibility to GCC.
>> > ...
>> > >
>> > > This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
>> > > flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
>> > > klist_add_tail to trigger over and over on boot when compiling with
>> > > clang:
>> > >
...
>> > 
>> > This patch seems to fix it. Not sure if that's just papering over it 
>> > though.
>> > 
>> > diff --git a/arch/powerpc/include/asm/bug.h 
>> > b/arch/powerpc/include/asm/bug.h
>> > index 1ee0f22313ee..75fcb4370d96 100644
>> > --- a/arch/powerpc/include/asm/bug.h
>> > +++ b/arch/powerpc/include/asm/bug.h
>> > @@ -119,7 +119,7 @@ __label_warn_on:   
>> > \
>> >\
>> >WARN_ENTRY(PPC_TLNEI " %4, 0",  \
>> >   BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), 
>> > \
>> > - __label_warn_on, "r" (x));   \
>> > + __label_warn_on, "r" (!!(x))); \
>> >break;  \
>> >  __label_warn_on:  \
>> >__ret_warn_on = true;   \
>> > 
>> 
>> Thank you for the in-depth explanation and triage! I have gone ahead and
>> created a standalone reproducer that shows this based on the
>> preprocessed file and opened https://bugs.llvm.org/show_bug.cgi?id=51634
>> so the LLVM developers can take a look.
>
> Based on Eli Friedman's comment in the bug, would something like this
> work and not regress GCC? I noticed that the BUG_ON macro does a cast as
> well. Nick pointed out to me privately that we have run into what seems
> like a similar issue before in commit 6c58f25e6938 ("riscv/atomic: Fix
> sign extension for RV64I").

Yes, I read that comment this morning, and then landed at the same fix
via digging through the history of our bug code.

We in fact fixed a similar bug 16 years ago :}

32818c2eb6b8 ("[PATCH] ppc64: Fix issue with gcc 4.0 compiled kernels")

Which is when we first started adding the cast to long.


> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
> index 1ee0f22313ee..35022667f57d 100644
> --- a/arch/powerpc/include/asm/bug.h
> +++ b/arch/powerpc/include/asm/bug.h
> @@ -119,7 +119,7 @@ __label_warn_on:  
>   \
> \
> WARN_ENTRY(PPC_TLNEI " %4, 0",  \
>BUGFLAG_WARNING | 
> BUGFLAG_TAINT(TAINT_WARN), \
> -  __label_warn_on, "r" (x));   \
> +  __label_warn_on, "r" ((__force long)(x))); 
>   \
> break;  \
>  __label_warn_on:   \
> __ret_warn_on = true;   \


Yeah that fixes the clang build for me.

For GCC it seems to generate the same code in the simple cases:

void test_warn_on_ulong(unsigned long b)
{
WARN_ON(b);
}

void test_warn_on_int(int b)
{
WARN_ON(b);
}

I get:

0020 <.test_warn_on_ulong>:
  20:   0b 03 00 00 tdnei   r3,0
  24:   4e 80 00 20 blr
  28:   60 00 00 00 nop
  2c:   60 00 00 00 nop

0030 <.test_warn_on_int>:
  30:   0b 03 00 00 tdnei   r3,0
  34:   4e 80 00 20 blr

Both before and after the change. So that's good.

For:

void test_warn_on_int_addition(int b)
{
WARN_ON(b+1);
}

Before I get:

0040 <.test_warn_on_int_addition>:
  40:   38 63 00 01 addir3,r3,1
  44:   0b 03 00 00 tdnei   r3,0
  48:   4e 80 00 20 blr

vs after:

0040 <.test_warn_on_int_addition>:
  40:   38 63 00 01 addir3,r3,1
  44:   7c 63 07 b4 extsw   r3,r3
  48:   0b 03 00 00 tdnei   r3,0
  4c:   4e 80 00 20 blr


So there's an extra sign extension we don't need, but I guess we can
probably live with that.

cheers


Re: [PATCH v2 1/4] powerpc/64: handle MSR EE and RI in interrupt entry wrapper

2021-08-27 Thread Daniel Axtens
Hi,

> Similarly to the system call change in the previous patch, the mtmsrd to

I don't actually know what patch this was - I assume it's from a series
that has since been applied?

> enable RI can be combined with the mtmsrd to enable EE for interrupts
> which enable the latter, which tends to be the important synchronous
> interrupts (i.e., page faults).
>
> Do this by enabling EE and RI together at the beginning of the entry
> wrapper if PACA_IRQ_HARD_DIS is clear, and just enabling RI if it is set
> (which means something wanted EE=0).


> diff --git a/arch/powerpc/include/asm/interrupt.h 
> b/arch/powerpc/include/asm/interrupt.h
> index 6b800d3e2681..e3228a911b35 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -148,9 +148,21 @@ static inline void interrupt_enter_prepare(struct 
> pt_regs *regs, struct interrup
>  #endif
>  
>  #ifdef CONFIG_PPC64
> - if (irq_soft_mask_set_return(IRQS_ALL_DISABLED) == IRQS_ENABLED)
> + bool trace_enable = false;
> +
> + if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) {
> + if (irq_soft_mask_set_return(IRQS_DISABLED) == IRQS_ENABLED)

In the previous code we had IRQS_ALL_DISABLED, now we just have
IRQS_DISABLED. Is that intentional?

> + trace_enable = true;
> + } else {
> + irq_soft_mask_set(IRQS_DISABLED);
> + }
> + /* If the interrupt was taken with HARD_DIS set, don't enable MSR[EE] */
> + if (local_paca->irq_happened & PACA_IRQ_HARD_DIS)
> + __hard_RI_enable();
> + else
> + __hard_irq_enable();
> + if (trace_enable)
>   trace_hardirqs_off();
> - local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
>  
>   if (user_mode(regs)) {
>   CT_WARN_ON(ct_state() != CONTEXT_USER);


> @@ -901,11 +892,6 @@ INT_DEFINE_BEGIN(system_reset)
>   IVEC=0x100
>   IAREA=PACA_EXNMI
>   IVIRT=0 /* no virt entry point */
> - /*
> -  * MSR_RI is not enabled, because PACA_EXNMI and nmi stack is
> -  * being used, so a nested NMI exception would corrupt it.
> -  */
> - ISET_RI=0
>   ISTACK=0
>   IKVM_REAL=1
>  INT_DEFINE_END(system_reset)

> @@ -986,8 +972,6 @@ EXC_COMMON_BEGIN(system_reset_common)

Right before this change, there's a comment that reads:

/*
 * Increment paca->in_nmi then enable MSR_RI. SLB or MCE will be able
 * to recover, but nested NMI will notice in_nmi and not recover
...

You've taken out the bit that enables MSR_RI, which means the comment is
no longer accurate.

Beyond that, I'm still trying to follow all the various changes in code
flows. It seems to make at least vague sense: we move the setting of
MSR_RI from the early asm to interrupt*enter_prepare. I'm just
struggling to convince myself that when we hit the underlying handler
that the RI states all line up.

Kind regards,
Daniel



Re: [PATCH v15 10/12] swiotlb: Add restricted DMA pool initialization

2021-08-27 Thread Andy Shevchenko
On Thu, Jun 24, 2021 at 6:59 PM Claire Chang  wrote:
>
> Add the initialization function to create restricted DMA pools from
> matching reserved-memory nodes.
>
> Regardless of swiotlb setting, the restricted DMA pool is preferred if
> available.
>
> The restricted DMA pools provide a basic level of protection against the
> DMA overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system
> needs to provide a way to lock down the memory access, e.g., MPU.





> +static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> +   struct device *dev)
> +{
> +   struct io_tlb_mem *mem = rmem->priv;
> +   unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
> +
> +   /*
> +* Since multiple devices can share the same pool, the private data,
> +* io_tlb_mem struct, will be initialized by the first device attached
> +* to it.
> +*/

> +   if (!mem) {

Can it be rather

if (mem)
  goto out_assign;

or so?

> +   mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL);
> +   if (!mem)
> +   return -ENOMEM;
> +
> +   set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> +rmem->size >> PAGE_SHIFT);

Below you are using a macro from pfn.h, but not here, I think it's PFN_DOWN().

> +   swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
> +   mem->force_bounce = true;
> +   mem->for_alloc = true;
> +
> +   rmem->priv = mem;
> +
> +   if (IS_ENABLED(CONFIG_DEBUG_FS)) {
> +   mem->debugfs =
> +   debugfs_create_dir(rmem->name, debugfs_dir);
> +   swiotlb_create_debugfs_files(mem);
> +   }
> +   }
> +
> +   dev->dma_io_tlb_mem = mem;
> +
> +   return 0;
> +}
> +
> +static void rmem_swiotlb_device_release(struct reserved_mem *rmem,
> +   struct device *dev)
> +{
> +   dev->dma_io_tlb_mem = io_tlb_default_mem;
> +}
> +
> +static const struct reserved_mem_ops rmem_swiotlb_ops = {
> +   .device_init = rmem_swiotlb_device_init,
> +   .device_release = rmem_swiotlb_device_release,
> +};
> +
> +static int __init rmem_swiotlb_setup(struct reserved_mem *rmem)
> +{
> +   unsigned long node = rmem->fdt_node;
> +
> +   if (of_get_flat_dt_prop(node, "reusable", NULL) ||
> +   of_get_flat_dt_prop(node, "linux,cma-default", NULL) ||
> +   of_get_flat_dt_prop(node, "linux,dma-default", NULL) ||
> +   of_get_flat_dt_prop(node, "no-map", NULL))
> +   return -EINVAL;
> +
> +   if (PageHighMem(pfn_to_page(PHYS_PFN(rmem->base {
> +   pr_err("Restricted DMA pool must be accessible within the 
> linear mapping.");
> +   return -EINVAL;
> +   }
> +
> +   rmem->ops = _swiotlb_ops;
> +   pr_info("Reserved memory: created restricted DMA pool at %pa, size 
> %ld MiB\n",
> +   >base, (unsigned long)rmem->size / SZ_1M);

Oh là là, besides explicit casting that I believe can be avoided, %ld
!= unsigned long. Can you check the printk-formats.rst document?

> +   return 0;
> +}
> +
> +RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", rmem_swiotlb_setup);
>  #endif /* CONFIG_DMA_RESTRICTED_POOL */

-- 
With Best Regards,
Andy Shevchenko