Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices

2018-06-06 Thread Christoph Hellwig
On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote:
> Pls work on a long term solution. Short term needs can be served by
> enabling the iommu platform in qemu.

So, I spent some time looking at converting virtio to dma ops overrides,
and the current virtio spec, and the sad through I have to tell is that
both the spec and the Linux implementation are complete and utterly fucked
up.

Both in the flag naming and the implementation there is an implication
of DMA API == IOMMU, which is fundamentally wrong.

The DMA API does a few different things:

 a) address translation

This does include IOMMUs.  But it also includes random offsets
between PCI bars and system memory that we see on various
platforms.  Worse so some of these offsets might be based on
banks, e.g. on the broadcom bmips platform.  It also deals
with bitmask in physical addresses related to memory encryption
like AMD SEV.  I'd be really curious how for example the
Intel virtio based NIC is going to work on any of those
plaforms.

  b) coherency

On many architectures DMA is not cache coherent, and we need
to invalidate and/or write back cache lines before doing
DMA.  Again, I wonder how this is every going to work with
hardware based virtio implementations.  Even worse I think this
is actually broken at least for VIVT event for virtualized
implementations.  E.g. a KVM guest is going to access memory
using different virtual addresses than qemu, vhost might throw
in another different address space.

  c) bounce buffering

Many DMA implementations can not address all physical memory
due to addressing limitations.  In such cases we copy the
DMA memory into a known addressable bounc buffer and DMA
from there.

  d) flushing write combining buffers or similar

On some hardware platforms we need workarounds to e.g. read
from a certain mmio address to make sure DMA can actually
see memory written by the host.

All of this is bypassed by virtio by default despite generally being
platform issues, not particular to a given device.


Re: [PATCH 08/11] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver

2018-06-06 Thread Finn Thain
On Wed, 6 Jun 2018, Geert Uytterhoeven wrote:

> So 4.17 mac_defconfig won't work on PMU_68K_V1 machines? Perhaps that 
> should be fixed first.
> 

AFAICT there's nothing to fix.

$ grep PMU arch/m68k/configs/mac_defconfig 
CONFIG_ADB_PMU68K=y

This works fine for PMU_68K_V2 machines. But if you try to boot a 
PMU_68K_V1 machine, you will have trouble starting the kernel. Sometimes 
you can boot by fiddling with ADB devices, but not always. This is not a 
new problem. The via-pmu68k driver has never worked on PMU_68K_V1 
machines.

So as far as PMU_68K_V1 is concerned, there's no need to bisect any of the 
changes under review. To answer your question, "Shouldn't that be a 
separate patch?", all I can say is "not to my knowledge". But I'm probably 
missing something.

-- 


[v2, 10/10] dpaa_eth: add the get_ts_info interface for ethtool

2018-06-06 Thread Yangbo Lu
Added the get_ts_info interface for ethtool to check
the timestamping capability.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- Removed ifdef for hw timestamp.
---
 drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c |   39 
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
index 2f933b6..3184c8f 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c
@@ -32,6 +32,9 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include 
+#include 
+#include 
+#include 
 
 #include "dpaa_eth.h"
 #include "mac.h"
@@ -515,6 +518,41 @@ static int dpaa_set_rxnfc(struct net_device *dev, struct 
ethtool_rxnfc *cmd)
return ret;
 }
 
+static int dpaa_get_ts_info(struct net_device *net_dev,
+   struct ethtool_ts_info *info)
+{
+   struct device *dev = net_dev->dev.parent;
+   struct device_node *mac_node = dev->of_node;
+   struct device_node *fman_node = NULL, *ptp_node = NULL;
+   struct platform_device *ptp_dev = NULL;
+   struct qoriq_ptp *ptp = NULL;
+
+   info->phc_index = -1;
+
+   fman_node = of_get_parent(mac_node);
+   if (fman_node)
+   ptp_node = of_parse_phandle(fman_node, "ptimer-handle", 0);
+
+   if (ptp_node)
+   ptp_dev = of_find_device_by_node(ptp_node);
+
+   if (ptp_dev)
+   ptp = platform_get_drvdata(ptp_dev);
+
+   if (ptp)
+   info->phc_index = ptp->phc_index;
+
+   info->so_timestamping = SOF_TIMESTAMPING_TX_HARDWARE |
+   SOF_TIMESTAMPING_RX_HARDWARE |
+   SOF_TIMESTAMPING_RAW_HARDWARE;
+   info->tx_types = (1 << HWTSTAMP_TX_OFF) |
+(1 << HWTSTAMP_TX_ON);
+   info->rx_filters = (1 << HWTSTAMP_FILTER_NONE) |
+  (1 << HWTSTAMP_FILTER_ALL);
+
+   return 0;
+}
+
 const struct ethtool_ops dpaa_ethtool_ops = {
.get_drvinfo = dpaa_get_drvinfo,
.get_msglevel = dpaa_get_msglevel,
@@ -530,4 +568,5 @@ static int dpaa_set_rxnfc(struct net_device *dev, struct 
ethtool_rxnfc *cmd)
.set_link_ksettings = dpaa_set_link_ksettings,
.get_rxnfc = dpaa_get_rxnfc,
.set_rxnfc = dpaa_set_rxnfc,
+   .get_ts_info = dpaa_get_ts_info,
 };
-- 
1.7.1



[v2, 09/10] dpaa_eth: add support for hardware timestamping

2018-06-06 Thread Yangbo Lu
This patch is to add hardware timestamping support
for dpaa_eth. On Rx, timestamping is enabled for
all frames. On Tx, we only instruct the hardware
to timestamp the frames marked accordingly by the
stack.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- Removed ifdef for timestamp code.
- Minor fixes for code style.
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c |  101 ++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |3 +
 2 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index fd43f98..bd589ac 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -1168,7 +1168,7 @@ static int dpaa_eth_init_tx_port(struct fman_port *port, 
struct dpaa_fq *errq,
buf_prefix_content.priv_data_size = buf_layout->priv_data_size;
buf_prefix_content.pass_prs_result = true;
buf_prefix_content.pass_hash_result = true;
-   buf_prefix_content.pass_time_stamp = false;
+   buf_prefix_content.pass_time_stamp = true;
buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT;
 
params.specific_params.non_rx_params.err_fqid = errq->fqid;
@@ -1210,7 +1210,7 @@ static int dpaa_eth_init_rx_port(struct fman_port *port, 
struct dpaa_bp **bps,
buf_prefix_content.priv_data_size = buf_layout->priv_data_size;
buf_prefix_content.pass_prs_result = true;
buf_prefix_content.pass_hash_result = true;
-   buf_prefix_content.pass_time_stamp = false;
+   buf_prefix_content.pass_time_stamp = true;
buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT;
 
rx_p = _params.rx_params;
@@ -1592,6 +1592,16 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv *priv)
return 0;
 }
 
+static int dpaa_get_tstamp_ns(struct net_device *net_dev, u64 *ns,
+ struct fman_port *port, const void *data)
+{
+   if (!fman_port_get_tstamp_field(port, data, ns)) {
+   be64_to_cpus(ns);
+   return 0;
+   }
+   return -EINVAL;
+}
+
 /* Cleanup function for outgoing frame descriptors that were built on Tx path,
  * either contiguous frames or scatter/gather ones.
  * Skb freeing is not handled here.
@@ -1607,14 +1617,29 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv 
*priv)
 {
const enum dma_data_direction dma_dir = DMA_TO_DEVICE;
struct device *dev = priv->net_dev->dev.parent;
+   struct skb_shared_hwtstamps shhwtstamps;
dma_addr_t addr = qm_fd_addr(fd);
const struct qm_sg_entry *sgt;
struct sk_buff **skbh, *skb;
int nr_frags, i;
+   u64 ns;
 
skbh = (struct sk_buff **)phys_to_virt(addr);
skb = *skbh;
 
+   if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
+   memset(, 0, sizeof(shhwtstamps));
+
+   if (!dpaa_get_tstamp_ns(priv->net_dev, ,
+   priv->mac_dev->port[TX],
+   (void *)skbh)) {
+   shhwtstamps.hwtstamp = ns_to_ktime(ns);
+   skb_tstamp_tx(skb, );
+   } else {
+   dev_warn(dev, "dpaa_get_tstamp_ns failed!\n");
+   }
+   }
+
if (unlikely(qm_fd_get_format(fd) == qm_fd_sg)) {
nr_frags = skb_shinfo(skb)->nr_frags;
dma_unmap_single(dev, addr, qm_fd_get_offset(fd) +
@@ -2086,6 +2111,11 @@ static int dpaa_start_xmit(struct sk_buff *skb, struct 
net_device *net_dev)
if (unlikely(err < 0))
goto skb_to_fd_failed;
 
+   if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
+   fd.cmd |= FM_FD_CMD_UPD;
+   skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+   }
+
if (likely(dpaa_xmit(priv, percpu_stats, queue_mapping, ) == 0))
return NETDEV_TX_OK;
 
@@ -2227,6 +2257,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct 
qman_portal *portal,
struct qman_fq *fq,
const struct qm_dqrr_entry *dq)
 {
+   struct skb_shared_hwtstamps *shhwtstamps;
struct rtnl_link_stats64 *percpu_stats;
struct dpaa_percpu_priv *percpu_priv;
const struct qm_fd *fd = >fd;
@@ -2240,6 +2271,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct 
qman_portal *portal,
struct sk_buff *skb;
int *count_ptr;
void *vaddr;
+   u64 ns;
 
fd_status = be32_to_cpu(fd->status);
fd_format = qm_fd_get_format(fd);
@@ -2304,6 +2336,18 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct 
qman_portal *portal,
if (!skb)
return qman_cb_dqrr_consume;
 
+   if (priv->rx_tstamp) {
+   shhwtstamps = skb_hwtstamps(skb);
+ 

[v2, 08/10] fsl/fman: define frame description command UPD

2018-06-06 Thread Yangbo Lu
Defined frame description command FM_FD_CMD_UPD for
prepended data updating.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
---
 drivers/net/ethernet/freescale/fman/fman.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.h 
b/drivers/net/ethernet/freescale/fman/fman.h
index bfa02e0..935c317 100644
--- a/drivers/net/ethernet/freescale/fman/fman.h
+++ b/drivers/net/ethernet/freescale/fman/fman.h
@@ -41,6 +41,7 @@
 /* Frame queue Context Override */
 #define FM_FD_CMD_FCO   0x8000
 #define FM_FD_CMD_RPD   0x4000  /* Read Prepended Data */
+#define FM_FD_CMD_UPD  0x2000  /* Update Prepended Data */
 #define FM_FD_CMD_DTC   0x1000  /* Do L4 Checksum */
 
 /* TX-Port: Unsupported Format */
-- 
1.7.1



[v2, 07/10] fsl/fman_port: support getting timestamp field

2018-06-06 Thread Yangbo Lu
This patch is to add fman_port_get_tstamp_field() interface
to get timestamp field data.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
---
 drivers/net/ethernet/freescale/fman/fman_port.c |   12 
 drivers/net/ethernet/freescale/fman/fman_port.h |3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c 
b/drivers/net/ethernet/freescale/fman/fman_port.c
index ce6e24c..86f0094 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.c
+++ b/drivers/net/ethernet/freescale/fman/fman_port.c
@@ -1731,6 +1731,18 @@ int fman_port_get_hash_result_offset(struct fman_port 
*port, u32 *offset)
 }
 EXPORT_SYMBOL(fman_port_get_hash_result_offset);
 
+int fman_port_get_tstamp_field(struct fman_port *port, const void *data,
+  u64 *tstamp)
+{
+   if (port->buffer_offsets.time_stamp_offset == ILLEGAL_BASE)
+   return -EINVAL;
+
+   *tstamp = *(u64 *)(data + port->buffer_offsets.time_stamp_offset);
+
+   return 0;
+}
+EXPORT_SYMBOL(fman_port_get_tstamp_field);
+
 static int fman_port_probe(struct platform_device *of_dev)
 {
struct fman_port *port;
diff --git a/drivers/net/ethernet/freescale/fman/fman_port.h 
b/drivers/net/ethernet/freescale/fman/fman_port.h
index e86ca6a..d10e48d 100644
--- a/drivers/net/ethernet/freescale/fman/fman_port.h
+++ b/drivers/net/ethernet/freescale/fman/fman_port.h
@@ -153,6 +153,9 @@ int fman_port_cfg_buf_prefix_content(struct fman_port *port,
 
 int fman_port_get_hash_result_offset(struct fman_port *port, u32 *offset);
 
+int fman_port_get_tstamp_field(struct fman_port *port, const void *data,
+  u64 *tstamp);
+
 struct fman_port *fman_port_bind(struct device *dev);
 
 #endif /* __FMAN_PORT_H */
-- 
1.7.1



[v2, 06/10] fsl/fman: add set_tstamp interface

2018-06-06 Thread Yangbo Lu
This patch is to add set_tstamp interface for memac,
dtsec, and 10GEC controllers to configure HW timestamping.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
---
 drivers/net/ethernet/freescale/fman/fman_dtsec.c |   27 ++
 drivers/net/ethernet/freescale/fman/fman_dtsec.h |1 +
 drivers/net/ethernet/freescale/fman/fman_memac.c |5 
 drivers/net/ethernet/freescale/fman/fman_memac.h |1 +
 drivers/net/ethernet/freescale/fman/fman_tgec.c  |   21 +
 drivers/net/ethernet/freescale/fman/fman_tgec.h  |1 +
 drivers/net/ethernet/freescale/fman/mac.c|3 ++
 drivers/net/ethernet/freescale/fman/mac.h|1 +
 8 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.c 
b/drivers/net/ethernet/freescale/fman/fman_dtsec.c
index 57b1e2b..1ca543a 100644
--- a/drivers/net/ethernet/freescale/fman/fman_dtsec.c
+++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.c
@@ -123,11 +123,13 @@
 #define DTSEC_ECNTRL_R100M 0x0008
 #define DTSEC_ECNTRL_QSGMIIM   0x0001
 
+#define TCTRL_TTSE 0x0040
 #define TCTRL_GTS  0x0020
 
 #define RCTRL_PAL_MASK 0x001f
 #define RCTRL_PAL_SHIFT16
 #define RCTRL_GHTX 0x0400
+#define RCTRL_RTSE 0x0040
 #define RCTRL_GRS  0x0020
 #define RCTRL_MPROM0x0008
 #define RCTRL_RSF  0x0004
@@ -1136,6 +1138,31 @@ int dtsec_set_allmulti(struct fman_mac *dtsec, bool 
enable)
return 0;
 }
 
+int dtsec_set_tstamp(struct fman_mac *dtsec, bool enable)
+{
+   struct dtsec_regs __iomem *regs = dtsec->regs;
+   u32 rctrl, tctrl;
+
+   if (!is_init_done(dtsec->dtsec_drv_param))
+   return -EINVAL;
+
+   rctrl = ioread32be(>rctrl);
+   tctrl = ioread32be(>tctrl);
+
+   if (enable) {
+   rctrl |= RCTRL_RTSE;
+   tctrl |= TCTRL_TTSE;
+   } else {
+   rctrl &= ~RCTRL_RTSE;
+   tctrl &= ~TCTRL_TTSE;
+   }
+
+   iowrite32be(rctrl, >rctrl);
+   iowrite32be(tctrl, >tctrl);
+
+   return 0;
+}
+
 int dtsec_del_hash_mac_address(struct fman_mac *dtsec, enet_addr_t *eth_addr)
 {
struct dtsec_regs __iomem *regs = dtsec->regs;
diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.h 
b/drivers/net/ethernet/freescale/fman/fman_dtsec.h
index 1a689ad..5149d96 100644
--- a/drivers/net/ethernet/freescale/fman/fman_dtsec.h
+++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.h
@@ -56,5 +56,6 @@ int dtsec_set_exception(struct fman_mac *dtsec,
 int dtsec_del_hash_mac_address(struct fman_mac *dtsec, enet_addr_t *eth_addr);
 int dtsec_get_version(struct fman_mac *dtsec, u32 *mac_version);
 int dtsec_set_allmulti(struct fman_mac *dtsec, bool enable);
+int dtsec_set_tstamp(struct fman_mac *dtsec, bool enable);
 
 #endif /* __DTSEC_H */
diff --git a/drivers/net/ethernet/freescale/fman/fman_memac.c 
b/drivers/net/ethernet/freescale/fman/fman_memac.c
index 446a97b..bc6eb30 100644
--- a/drivers/net/ethernet/freescale/fman/fman_memac.c
+++ b/drivers/net/ethernet/freescale/fman/fman_memac.c
@@ -964,6 +964,11 @@ int memac_set_allmulti(struct fman_mac *memac, bool enable)
return 0;
 }
 
+int memac_set_tstamp(struct fman_mac *memac, bool enable)
+{
+   return 0; /* Always enabled. */
+}
+
 int memac_del_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr)
 {
struct memac_regs __iomem *regs = memac->regs;
diff --git a/drivers/net/ethernet/freescale/fman/fman_memac.h 
b/drivers/net/ethernet/freescale/fman/fman_memac.h
index b5a5033..b2c671e 100644
--- a/drivers/net/ethernet/freescale/fman/fman_memac.h
+++ b/drivers/net/ethernet/freescale/fman/fman_memac.h
@@ -58,5 +58,6 @@ int memac_set_exception(struct fman_mac *memac,
 int memac_add_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr);
 int memac_del_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr);
 int memac_set_allmulti(struct fman_mac *memac, bool enable);
+int memac_set_tstamp(struct fman_mac *memac, bool enable);
 
 #endif /* __MEMAC_H */
diff --git a/drivers/net/ethernet/freescale/fman/fman_tgec.c 
b/drivers/net/ethernet/freescale/fman/fman_tgec.c
index 284735d..4070593 100644
--- a/drivers/net/ethernet/freescale/fman/fman_tgec.c
+++ b/drivers/net/ethernet/freescale/fman/fman_tgec.c
@@ -44,6 +44,7 @@
 #define TGEC_TX_IPG_LENGTH_MASK0x03ff
 
 /* Command and Configuration Register (COMMAND_CONFIG) */
+#define CMD_CFG_EN_TIMESTAMP   0x0010
 #define CMD_CFG_NO_LEN_CHK 0x0002
 #define CMD_CFG_PAUSE_IGNORE   0x0100
 #define CMF_CFG_CRC_FWD0x0040
@@ -588,6 +589,26 @@ int tgec_set_allmulti(struct fman_mac *tgec, bool enable)
return 0;
 }
 
+int tgec_set_tstamp(struct 

[v2, 05/10] arm64: dts: fsl: move ptp timer out of fman

2018-06-06 Thread Yangbo Lu
This patch is to move ptp timer node out of fman.
Because ptp timer will be probed by ptp_qoriq driver,
it should be an independent device in case of conflict
memory mapping.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- Fixed address-cells for ptp-timer.
---
 arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi 
b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
index 4dd0676..a56a408 100644
--- a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
+++ b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi
@@ -11,13 +11,14 @@ fman0: fman@1a0 {
#size-cells = <1>;
cell-index = <0>;
compatible = "fsl,fman";
-   ranges = <0x0 0x0 0x1a0 0x10>;
-   reg = <0x0 0x1a0 0x0 0x10>;
+   ranges = <0x0 0x0 0x1a0 0xfe000>;
+   reg = <0x0 0x1a0 0x0 0xfe000>;
interrupts = ,
 ;
clocks = < 3 0>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x800 0x10>;
+   ptimer-handle = <_timer0>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -73,9 +74,10 @@ fman0: fman@1a0 {
compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
reg = <0xfd000 0x1000>;
};
+};
 
-   ptp_timer0: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer0: ptp-timer@1afe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x0 0x1afe000 0x0 0x1000>;
+   interrupts = ;
 };
-- 
1.7.1



[v2, 04/10] powerpc/mpc85xx: move ptp timer out of fman in dts

2018-06-06 Thread Yangbo Lu
This patch is to move ptp timer node out of fman.
Because ptp timer will be probed by ptp_qoriq driver,
it should be an independent device in case of conflict
memory mapping.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
---
 arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi   |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi   |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi  |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi  |   14 --
 arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi |   14 --
 5 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
index abd01d4..6b124f7 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi
@@ -37,12 +37,13 @@ fman0: fman@40 {
#size-cells = <1>;
cell-index = <0>;
compatible = "fsl,fman";
-   ranges = <0 0x40 0x10>;
-   reg = <0x40 0x10>;
+   ranges = <0 0x40 0xfe000>;
+   reg = <0x40 0xfe000>;
interrupts = <96 2 0 0>, <16 2 1 1>;
clocks = < 3 0>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x40 0xc>;
+   ptimer-handle = <_timer0>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -93,9 +94,10 @@ fman0: fman@40 {
reg = <0x87000 0x1000>;
status = "disabled";
};
+};
 
-   ptp_timer0: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer0: ptp-timer@4fe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x4fe000 0x1000>;
+   interrupts = <96 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
index debea75..b80aaf5 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi
@@ -37,12 +37,13 @@ fman1: fman@50 {
#size-cells = <1>;
cell-index = <1>;
compatible = "fsl,fman";
-   ranges = <0 0x50 0x10>;
-   reg = <0x50 0x10>;
+   ranges = <0 0x50 0xfe000>;
+   reg = <0x50 0xfe000>;
interrupts = <97 2 0 0>, <16 2 1 0>;
clocks = < 3 1>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x60 0xc>;
+   ptimer-handle = <_timer1>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -93,9 +94,10 @@ fman1: fman@50 {
reg = <0x87000 0x1000>;
status = "disabled";
};
+};
 
-   ptp_timer1: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer1: ptp-timer@5fe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x5fe000 0x1000>;
+   interrupts = <97 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
index 3a20e0d..d3720fd 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi
@@ -37,12 +37,13 @@ fman0: fman@40 {
#size-cells = <1>;
cell-index = <0>;
compatible = "fsl,fman";
-   ranges = <0 0x40 0x10>;
-   reg = <0x40 0x10>;
+   ranges = <0 0x40 0xfe000>;
+   reg = <0x40 0xfe000>;
interrupts = <96 2 0 0>, <16 2 1 1>;
clocks = < 3 0>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x800 0x10>;
+   ptimer-handle = <_timer0>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -98,9 +99,10 @@ fman0: fman@40 {
compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio";
reg = <0xfd000 0x1000>;
};
+};
 
-   ptp_timer0: ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-   };
+ptp_timer0: ptp-timer@4fe000 {
+   compatible = "fsl,fman-ptp-timer";
+   reg = <0x4fe000 0x1000>;
+   interrupts = <96 2 0 0>;
 };
diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi 
b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
index 82750ac..ae34c20 100644
--- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
+++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi
@@ -37,12 +37,13 @@ fman1: fman@50 {
#size-cells = <1>;
cell-index = <1>;
compatible = "fsl,fman";
-   ranges = <0 0x50 0x10>;
-   reg = <0x50 0x10>;
+   ranges = <0 0x50 0xfe000>;
+   reg = <0x50 0xfe000>;
interrupts = <97 2 0 0>, <16 2 1 0>;
clocks = < 3 1>;
clock-names = "fmanclk";
fsl,qman-channel-range = <0x820 0x10>;
+   ptimer-handle = <_timer1>;
 
muram@0 {
compatible = "fsl,fman-muram";
@@ -98,9 +99,10 @@ 

[v2, 03/10] dt-binding: ptp_qoriq: add DPAA FMan support

2018-06-06 Thread Yangbo Lu
This patch is to add bindings description for DPAA
FMan 1588 timer, and also remove its description in
fsl-fman dt-bindings document.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
---
 Documentation/devicetree/bindings/net/fsl-fman.txt |   25 +---
 .../devicetree/bindings/ptp/ptp-qoriq.txt  |   15 +--
 2 files changed, 13 insertions(+), 27 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/fsl-fman.txt 
b/Documentation/devicetree/bindings/net/fsl-fman.txt
index df873d1..74603dd 100644
--- a/Documentation/devicetree/bindings/net/fsl-fman.txt
+++ b/Documentation/devicetree/bindings/net/fsl-fman.txt
@@ -356,30 +356,7 @@ ethernet@e {
 
 FMan IEEE 1588 Node
 
-DESCRIPTION
-
-The FMan interface to support IEEE 1588
-
-
-PROPERTIES
-
-- compatible
-   Usage: required
-   Value type: 
-   Definition: A standard property.
-   Must include "fsl,fman-ptp-timer".
-
-- reg
-   Usage: required
-   Value type: 
-   Definition: A standard property.
-
-EXAMPLE
-
-ptp-timer@fe000 {
-   compatible = "fsl,fman-ptp-timer";
-   reg = <0xfe000 0x1000>;
-};
+Refer to Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
 
 =
 FMan MDIO Node
diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt 
b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
index 0f569d8..c5d0e79 100644
--- a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
+++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt
@@ -2,7 +2,8 @@
 
 General Properties:
 
-  - compatible   Should be "fsl,etsec-ptp"
+  - compatible   Should be "fsl,etsec-ptp" for eTSEC
+ Should be "fsl,fman-ptp-timer" for DPAA FMan
   - reg  Offset and length of the register set for the device
   - interrupts   There should be at least two interrupts. Some devices
  have as many as four PTP related interrupts.
@@ -43,14 +44,22 @@ Clock Properties:
   value, which will be directly written in those bits, that is why,
   according to reference manual, the next clock sources can be used:
 
+  For eTSEC,
   <0> - external high precision timer reference clock (TSEC_TMR_CLK
 input is used for this purpose);
   <1> - eTSEC system clock;
   <2> - eTSEC1 transmit clock;
   <3> - RTC clock input.
 
-  When this attribute is not used, eTSEC system clock will serve as
-  IEEE 1588 timer reference clock.
+  For DPAA FMan,
+  <0> - external high precision timer reference clock (TMR_1588_CLK)
+  <1> - MAC system clock (1/2 FMan clock)
+  <2> - reserved
+  <3> - RTC clock oscillator
+
+  When this attribute is not used, the IEEE 1588 timer reference clock
+  will use the eTSEC system clock (for Gianfar) or the MAC system
+  clock (for DPAA).
 
 Example:
 
-- 
1.7.1



[v2, 02/10] ptp: support DPAA FMan 1588 timer in ptp_qoriq

2018-06-06 Thread Yangbo Lu
This patch is to support DPAA (Data Path Acceleration Architecture)
1588 timer by adding "fsl,fman-ptp-timer" compatible, sharing
interrupt with FMan, adding FSL_DPAA_ETH dependency, and fixing
up register offset.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
---
 drivers/ptp/Kconfig   |2 +-
 drivers/ptp/ptp_qoriq.c   |  104 ++---
 include/linux/fsl/ptp_qoriq.h |   38 ---
 3 files changed, 98 insertions(+), 46 deletions(-)

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 474c988..d137c48 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -43,7 +43,7 @@ config PTP_1588_CLOCK_DTE
 
 config PTP_1588_CLOCK_QORIQ
tristate "Freescale QorIQ 1588 timer as PTP clock"
-   depends on GIANFAR
+   depends on GIANFAR || FSL_DPAA_ETH
depends on PTP_1588_CLOCK
default y
help
diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c
index 1468a16..c4e3545 100644
--- a/drivers/ptp/ptp_qoriq.c
+++ b/drivers/ptp/ptp_qoriq.c
@@ -39,11 +39,12 @@
 /* Caller must hold qoriq_ptp->lock. */
 static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
u64 ns;
u32 lo, hi;
 
-   lo = qoriq_read(_ptp->regs->tmr_cnt_l);
-   hi = qoriq_read(_ptp->regs->tmr_cnt_h);
+   lo = qoriq_read(>ctrl_regs->tmr_cnt_l);
+   hi = qoriq_read(>ctrl_regs->tmr_cnt_h);
ns = ((u64) hi) << 32;
ns |= lo;
return ns;
@@ -52,16 +53,18 @@ static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp)
 /* Caller must hold qoriq_ptp->lock. */
 static void tmr_cnt_write(struct qoriq_ptp *qoriq_ptp, u64 ns)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
u32 hi = ns >> 32;
u32 lo = ns & 0x;
 
-   qoriq_write(_ptp->regs->tmr_cnt_l, lo);
-   qoriq_write(_ptp->regs->tmr_cnt_h, hi);
+   qoriq_write(>ctrl_regs->tmr_cnt_l, lo);
+   qoriq_write(>ctrl_regs->tmr_cnt_h, hi);
 }
 
 /* Caller must hold qoriq_ptp->lock. */
 static void set_alarm(struct qoriq_ptp *qoriq_ptp)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
u64 ns;
u32 lo, hi;
 
@@ -70,16 +73,18 @@ static void set_alarm(struct qoriq_ptp *qoriq_ptp)
ns -= qoriq_ptp->tclk_period;
hi = ns >> 32;
lo = ns & 0x;
-   qoriq_write(_ptp->regs->tmr_alarm1_l, lo);
-   qoriq_write(_ptp->regs->tmr_alarm1_h, hi);
+   qoriq_write(>alarm_regs->tmr_alarm1_l, lo);
+   qoriq_write(>alarm_regs->tmr_alarm1_h, hi);
 }
 
 /* Caller must hold qoriq_ptp->lock. */
 static void set_fipers(struct qoriq_ptp *qoriq_ptp)
 {
+   struct qoriq_ptp_registers *regs = _ptp->regs;
+
set_alarm(qoriq_ptp);
-   qoriq_write(_ptp->regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
-   qoriq_write(_ptp->regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
+   qoriq_write(>fiper_regs->tmr_fiper1, qoriq_ptp->tmr_fiper1);
+   qoriq_write(>fiper_regs->tmr_fiper2, qoriq_ptp->tmr_fiper2);
 }
 
 /*
@@ -89,16 +94,17 @@ static void set_fipers(struct qoriq_ptp *qoriq_ptp)
 static irqreturn_t isr(int irq, void *priv)
 {
struct qoriq_ptp *qoriq_ptp = priv;
+   struct qoriq_ptp_registers *regs = _ptp->regs;
struct ptp_clock_event event;
u64 ns;
u32 ack = 0, lo, hi, mask, val;
 
-   val = qoriq_read(_ptp->regs->tmr_tevent);
+   val = qoriq_read(>ctrl_regs->tmr_tevent);
 
if (val & ETS1) {
ack |= ETS1;
-   hi = qoriq_read(_ptp->regs->tmr_etts1_h);
-   lo = qoriq_read(_ptp->regs->tmr_etts1_l);
+   hi = qoriq_read(>etts_regs->tmr_etts1_h);
+   lo = qoriq_read(>etts_regs->tmr_etts1_l);
event.type = PTP_CLOCK_EXTTS;
event.index = 0;
event.timestamp = ((u64) hi) << 32;
@@ -108,8 +114,8 @@ static irqreturn_t isr(int irq, void *priv)
 
if (val & ETS2) {
ack |= ETS2;
-   hi = qoriq_read(_ptp->regs->tmr_etts2_h);
-   lo = qoriq_read(_ptp->regs->tmr_etts2_l);
+   hi = qoriq_read(>etts_regs->tmr_etts2_h);
+   lo = qoriq_read(>etts_regs->tmr_etts2_l);
event.type = PTP_CLOCK_EXTTS;
event.index = 1;
event.timestamp = ((u64) hi) << 32;
@@ -130,16 +136,16 @@ static irqreturn_t isr(int irq, void *priv)
hi = ns >> 32;
lo = ns & 0x;
spin_lock(_ptp->lock);
-   qoriq_write(_ptp->regs->tmr_alarm2_l, lo);
-   qoriq_write(_ptp->regs->tmr_alarm2_h, hi);
+   qoriq_write(>alarm_regs->tmr_alarm2_l, lo);
+   qoriq_write(>alarm_regs->tmr_alarm2_h, hi);
spin_unlock(_ptp->lock);
qoriq_ptp->alarm_value = ns;
} else {
-

[v2, 01/10] fsl/fman: share the event interrupt

2018-06-06 Thread Yangbo Lu
This patch is to share fman event interrupt because
the 1588 timer driver will also use this interrupt.

Signed-off-by: Yangbo Lu 
---
Changes for v2:
- None.
---
 drivers/net/ethernet/freescale/fman/fman.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fman/fman.c 
b/drivers/net/ethernet/freescale/fman/fman.c
index 9530405..c415ac6 100644
--- a/drivers/net/ethernet/freescale/fman/fman.c
+++ b/drivers/net/ethernet/freescale/fman/fman.c
@@ -2801,7 +2801,8 @@ static irqreturn_t fman_irq(int irq, void *handle)
of_node_put(muram_node);
of_node_put(fm_node);
 
-   err = devm_request_irq(_dev->dev, irq, fman_irq, 0, "fman", fman);
+   err = devm_request_irq(_dev->dev, irq, fman_irq, IRQF_SHARED,
+  "fman", fman);
if (err < 0) {
dev_err(_dev->dev, "%s: irq %d allocation failed (error = 
%d)\n",
__func__, irq, err);
-- 
1.7.1



[v2, 00/10] Support DPAA PTP clock and timestamping

2018-06-06 Thread Yangbo Lu
This patchset is to support DPAA FMAN PTP clock and HW timestamping.
- The patch #1 to patch #5 are to support DPAA FMAN 1588 timer in
  ptp_qoriq driver.
- The patch #6 to patch #10 are to add HW timestamping support in
  DPAA ethernet driver.

Yangbo Lu (10):
  fsl/fman: share the event interrupt
  ptp: support DPAA FMan 1588 timer in ptp_qoriq
  dt-binding: ptp_qoriq: add DPAA FMan support
  powerpc/mpc85xx: move ptp timer out of fman in dts
  arm64: dts: fsl: move ptp timer out of fman
  fsl/fman: add set_tstamp interface
  fsl/fman_port: support getting timestamp field
  fsl/fman: define frame description command UPD
  dpaa_eth: add support for hardware timestamping
  dpaa_eth: add the get_ts_info interface for ethtool

 Documentation/devicetree/bindings/net/fsl-fman.txt |   25 +-
 .../devicetree/bindings/ptp/ptp-qoriq.txt  |   15 +++-
 arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi   |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi|   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi|   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi   |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi   |   14 ++-
 arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi  |   14 ++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c |  101 ++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |3 +
 drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c |   39 
 drivers/net/ethernet/freescale/fman/fman.c |3 +-
 drivers/net/ethernet/freescale/fman/fman.h |1 +
 drivers/net/ethernet/freescale/fman/fman_dtsec.c   |   27 +
 drivers/net/ethernet/freescale/fman/fman_dtsec.h   |1 +
 drivers/net/ethernet/freescale/fman/fman_memac.c   |5 +
 drivers/net/ethernet/freescale/fman/fman_memac.h   |1 +
 drivers/net/ethernet/freescale/fman/fman_port.c|   12 +++
 drivers/net/ethernet/freescale/fman/fman_port.h|3 +
 drivers/net/ethernet/freescale/fman/fman_tgec.c|   21 
 drivers/net/ethernet/freescale/fman/fman_tgec.h|1 +
 drivers/net/ethernet/freescale/fman/mac.c  |3 +
 drivers/net/ethernet/freescale/fman/mac.h  |1 +
 drivers/ptp/Kconfig|2 +-
 drivers/ptp/ptp_qoriq.c|  104 ---
 include/linux/fsl/ptp_qoriq.h  |   38 ++--
 26 files changed, 375 insertions(+), 115 deletions(-)



[PATCH v8 5/5] powerpc:selftest update memcmp_64 selftest for VMX implementation

2018-06-06 Thread wei . guo . simon
From: Simon Guo 

This patch reworked selftest memcmp_64 so that memcmp selftest can
cover more test cases.

It adds testcases for:
- memcmp over 4K bytes size.
- s1/s2 with different/random offset on 16 bytes boundary.
- enter/exit_vmx_ops pairness.

Signed-off-by: Simon Guo 
---
 .../selftests/powerpc/copyloops/asm/ppc_asm.h  |  4 +-
 .../selftests/powerpc/stringloops/asm/ppc-opcode.h | 39 +
 .../selftests/powerpc/stringloops/asm/ppc_asm.h| 25 ++
 .../testing/selftests/powerpc/stringloops/memcmp.c | 98 +-
 4 files changed, 142 insertions(+), 24 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h

diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h 
b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
index 5ffe04d..dfce161 100644
--- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h
@@ -36,11 +36,11 @@
li  r3,0
blr
 
-FUNC_START(enter_vmx_copy)
+FUNC_START(enter_vmx_ops)
li  r3,1
blr
 
-FUNC_START(exit_vmx_copy)
+FUNC_START(exit_vmx_ops)
blr
 
 FUNC_START(memcpy_power7)
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h 
b/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h
new file mode 100644
index 000..9de413c
--- /dev/null
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h
@@ -0,0 +1,39 @@
+/*
+ * Copyright 2009 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * provides masks and opcode images for use by code generation, emulation
+ * and for instructions that older assemblers might not know about
+ */
+#ifndef _ASM_POWERPC_PPC_OPCODE_H
+#define _ASM_POWERPC_PPC_OPCODE_H
+
+
+#  define stringify_in_c(...)  __VA_ARGS__
+#  define ASM_CONST(x) x
+
+
+#define PPC_INST_VCMPEQUD_RC   0x10c7
+#define PPC_INST_VCMPEQUB_RC   0x1006
+
+#define __PPC_RC21 (0x1 << 10)
+
+/* macros to insert fields into opcodes */
+#define ___PPC_RA(a)   (((a) & 0x1f) << 16)
+#define ___PPC_RB(b)   (((b) & 0x1f) << 11)
+#define ___PPC_RS(s)   (((s) & 0x1f) << 21)
+#define ___PPC_RT(t)   ___PPC_RS(t)
+
+#define VCMPEQUD_RC(vrt, vra, vrb) stringify_in_c(.long 
PPC_INST_VCMPEQUD_RC | \
+ ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+ ___PPC_RB(vrb) | __PPC_RC21)
+
+#define VCMPEQUB_RC(vrt, vra, vrb) stringify_in_c(.long 
PPC_INST_VCMPEQUB_RC | \
+ ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+ ___PPC_RB(vrb) | __PPC_RC21)
+
+#endif /* _ASM_POWERPC_PPC_OPCODE_H */
diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h 
b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
index 136242e..d2c0a91 100644
--- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
+++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h
@@ -1,4 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _PPC_ASM_H
+#define __PPC_ASM_H
 #include 
 
 #ifndef r1
@@ -6,3 +8,26 @@
 #endif
 
 #define _GLOBAL(A) FUNC_START(test_ ## A)
+#define _GLOBAL_TOC(A) FUNC_START(test_ ## A)
+
+#define CONFIG_ALTIVEC
+
+#define R14 r14
+#define R15 r15
+#define R16 r16
+#define R17 r17
+#define R18 r18
+#define R19 r19
+#define R20 r20
+#define R21 r21
+#define R22 r22
+#define R29 r29
+#define R30 r30
+#define R31 r31
+
+#define STACKFRAMESIZE 256
+#define STK_REG(i) (112 + ((i)-14)*8)
+
+#define BEGIN_FTR_SECTION
+#define END_FTR_SECTION_IFSET(val)
+#endif
diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp.c 
b/tools/testing/selftests/powerpc/stringloops/memcmp.c
index 8250db2..b5cf717 100644
--- a/tools/testing/selftests/powerpc/stringloops/memcmp.c
+++ b/tools/testing/selftests/powerpc/stringloops/memcmp.c
@@ -2,20 +2,40 @@
 #include 
 #include 
 #include 
+#include 
 #include "utils.h"
 
 #define SIZE 256
 #define ITERATIONS 1
 
+#define LARGE_SIZE (5 * 1024)
+#define LARGE_ITERATIONS 1000
+#define LARGE_MAX_OFFSET 32
+#define LARGE_SIZE_START 4096
+
+#define MAX_OFFSET_DIFF_S1_S2 48
+
+int vmx_count;
+int enter_vmx_ops(void)
+{
+   vmx_count++;
+   return 1;
+}
+
+void exit_vmx_ops(void)
+{
+   vmx_count--;
+}
 int test_memcmp(const void *s1, const void *s2, size_t n);
 
 /* test all offsets and lengths */
-static void test_one(char *s1, char *s2)
+static void test_one(char *s1, char *s2, unsigned long max_offset,
+   unsigned long size_start, unsigned long max_size)
 {
unsigned long offset, size;
 
-   for (offset = 0; offset < SIZE; offset++) {
-   for (size = 0; size < (SIZE-offset); size++) {
+   for (offset = 

[PATCH v8 4/5] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()

2018-06-06 Thread wei . guo . simon
From: Simon Guo 

This patch is based on the previous VMX patch on memcmp().

To optimize ppc64 memcmp() with VMX instruction, we need to think about
the VMX penalty brought with: If kernel uses VMX instruction, it needs
to save/restore current thread's VMX registers. There are 32 x 128 bits
VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.

The major concern regarding the memcmp() performance in kernel is KSM,
who will use memcmp() frequently to merge identical pages. So it will
make sense to take some measures/enhancement on KSM to see whether any
improvement can be done here.  Cyril Bur indicates that the memcmp() for
KSM has a higher possibility to fail (unmatch) early in previous bytes
in following mail.
https://patchwork.ozlabs.org/patch/817322/#1773629
And I am taking a follow-up on this with this patch.

Per some testing, it shows KSM memcmp() will fail early at previous 32
bytes.  More specifically:
- 76% cases will fail/unmatch before 16 bytes;
- 83% cases will fail/unmatch before 32 bytes;
- 84% cases will fail/unmatch before 64 bytes;
So 32 bytes looks a better choice than other bytes for pre-checking.

The early failure is also true for memcmp() for non-KSM case. With a
non-typical call load, it shows ~73% cases fail before first 32 bytes.

This patch adds a 32 bytes pre-checking firstly before jumping into VMX
operations, to avoid the unnecessary VMX penalty. It is not limited to
KSM case. And the testing shows ~20% improvement on memcmp() average
execution time with this patch.

And note the 32B pre-checking is only performed when the compare size
is long enough (>=4K currently) to allow VMX operation.

The detail data and analysis is at:
https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md

Signed-off-by: Simon Guo 
---
 arch/powerpc/lib/memcmp_64.S | 57 +++-
 1 file changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index be2f792..844d8e7 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -404,8 +404,27 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 #ifdef CONFIG_ALTIVEC
 .Lsameoffset_vmx_cmp:
/* Enter with src/dst addrs has the same offset with 8 bytes
-* align boundary
+* align boundary.
+*
+* There is an optimization based on following fact: memcmp()
+* prones to fail early at the first 32 bytes.
+* Before applying VMX instructions which will lead to 32x128bits
+* VMX regs load/restore penalty, we compare the first 32 bytes
+* so that we can catch the ~80% fail cases.
 */
+
+   li  r0,4
+   mtctr   r0
+.Lsameoffset_prechk_32B_loop:
+   LD  rA,0,r3
+   LD  rB,0,r4
+   cmpld   cr0,rA,rB
+   addir3,r3,8
+   addir4,r4,8
+   bne cr0,.LcmpAB_lightweight
+   addir5,r5,-8
+   bdnz.Lsameoffset_prechk_32B_loop
+
ENTER_VMX_OPS
beq cr1,.Llong_novmx_cmp
 
@@ -482,16 +501,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 #endif
 
 .Ldiffoffset_8bytes_make_align_start:
-#ifdef CONFIG_ALTIVEC
-BEGIN_FTR_SECTION
-   /* only do vmx ops when the size equal or greater than 4K bytes */
-   cmpdi   cr5,r5,VMX_THRESH
-   bge cr5,.Ldiffoffset_vmx_cmp
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
-
-.Ldiffoffset_novmx_cmp:
-#endif
-
/* now try to align s1 with 8 bytes */
rlwinm  r6,r3,3,26,28
beq .Ldiffoffset_align_s1_8bytes
@@ -515,6 +524,17 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
 .Ldiffoffset_align_s1_8bytes:
/* now s1 is aligned with 8 bytes. */
+#ifdef CONFIG_ALTIVEC
+BEGIN_FTR_SECTION
+   /* only do vmx ops when the size equal or greater than 4K bytes */
+   cmpdi   cr5,r5,VMX_THRESH
+   bge cr5,.Ldiffoffset_vmx_cmp
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
+
+.Ldiffoffset_novmx_cmp:
+#endif
+
+
cmpdi   cr5,r5,31
ble cr5,.Lcmp_lt32bytes
 
@@ -526,6 +546,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
 
 #ifdef CONFIG_ALTIVEC
 .Ldiffoffset_vmx_cmp:
+   /* perform a 32 bytes pre-checking before
+* enable VMX operations.
+*/
+   li  r0,4
+   mtctr   r0
+.Ldiffoffset_prechk_32B_loop:
+   LD  rA,0,r3
+   LD  rB,0,r4
+   cmpld   cr0,rA,rB
+   addir3,r3,8
+   addir4,r4,8
+   bne cr0,.LcmpAB_lightweight
+   addir5,r5,-8
+   bdnz.Ldiffoffset_prechk_32B_loop
+
ENTER_VMX_OPS
beq cr1,.Ldiffoffset_novmx_cmp
 
-- 
1.8.3.1



[PATCH v8 3/5] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2018-06-06 Thread wei . guo . simon
From: Simon Guo 

This patch add VMX primitives to do memcmp() in case the compare size
is equal or greater than 4K bytes. KSM feature can benefit from this.

Test result with following test program(replace the "^>" with ""):
--
># cat tools/testing/selftests/powerpc/stringloops/memcmp.c
>#include 
>#include 
>#include 
>#include 
>#include "utils.h"
>#define SIZE (1024 * 1024 * 900)
>#define ITERATIONS 40

int test_memcmp(const void *s1, const void *s2, size_t n);

static int testcase(void)
{
char *s1;
char *s2;
unsigned long i;

s1 = memalign(128, SIZE);
if (!s1) {
perror("memalign");
exit(1);
}

s2 = memalign(128, SIZE);
if (!s2) {
perror("memalign");
exit(1);
}

for (i = 0; i < SIZE; i++)  {
s1[i] = i & 0xff;
s2[i] = i & 0xff;
}
for (i = 0; i < ITERATIONS; i++) {
int ret = test_memcmp(s1, s2, SIZE);

if (ret) {
printf("return %d at[%ld]! should have returned 
zero\n", ret, i);
abort();
}
}

return 0;
}

int main(void)
{
return test_harness(testcase, "memcmp");
}
--
Without this patch (but with the first patch "powerpc/64: Align bytes
before fall back to .Lshort in powerpc64 memcmp()." in the series):
4.726728762 seconds time elapsed
  ( +-  3.54%)
With VMX patch:
4.234335473 seconds time elapsed
  ( +-  2.63%)
There is ~+10% improvement.

Testing with unaligned and different offset version (make s1 and s2 shift
random offset within 16 bytes) can archieve higher improvement than 10%..

Signed-off-by: Simon Guo 
---
 arch/powerpc/include/asm/asm-prototypes.h |   4 +-
 arch/powerpc/lib/copypage_power7.S|   4 +-
 arch/powerpc/lib/memcmp_64.S  | 241 +-
 arch/powerpc/lib/memcpy_power7.S  |   6 +-
 arch/powerpc/lib/vmx-helper.c |   4 +-
 5 files changed, 248 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index d9713ad..31fdcee 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -49,8 +49,8 @@ void __trace_hcall_exit(long opcode, unsigned long retval,
 /* VMX copying */
 int enter_vmx_usercopy(void);
 int exit_vmx_usercopy(void);
-int enter_vmx_copy(void);
-void * exit_vmx_copy(void *dest);
+int enter_vmx_ops(void);
+void *exit_vmx_ops(void *dest);
 
 /* Traps */
 long machine_check_early(struct pt_regs *regs);
diff --git a/arch/powerpc/lib/copypage_power7.S 
b/arch/powerpc/lib/copypage_power7.S
index 8fa73b7..e38f956 100644
--- a/arch/powerpc/lib/copypage_power7.S
+++ b/arch/powerpc/lib/copypage_power7.S
@@ -57,7 +57,7 @@ _GLOBAL(copypage_power7)
std r4,-STACKFRAMESIZE+STK_REG(R30)(r1)
std r0,16(r1)
stdur1,-STACKFRAMESIZE(r1)
-   bl  enter_vmx_copy
+   bl  enter_vmx_ops
cmpwi   r3,0
ld  r0,STACKFRAMESIZE+16(r1)
ld  r3,STK_REG(R31)(r1)
@@ -100,7 +100,7 @@ _GLOBAL(copypage_power7)
addir3,r3,128
bdnz1b
 
-   b   exit_vmx_copy   /* tail call optimise */
+   b   exit_vmx_ops/* tail call optimise */
 
 #else
li  r0,(PAGE_SIZE/128)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 5776f91..be2f792 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -9,6 +9,7 @@
  */
 #include 
 #include 
+#include 
 
 #define off8   r6
 #define off16  r7
@@ -27,12 +28,73 @@
 #define LH lhbrx
 #define LW lwbrx
 #define LD ldbrx
+#define LVSlvsr
+#define VPERM(_VRT,_VRA,_VRB,_VRC) \
+   vperm _VRT,_VRB,_VRA,_VRC
 #else
 #define LH lhzx
 #define LW lwzx
 #define LD ldx
+#define LVSlvsl
+#define VPERM(_VRT,_VRA,_VRB,_VRC) \
+   vperm _VRT,_VRA,_VRB,_VRC
 #endif
 
+#define VMX_THRESH 4096
+#define ENTER_VMX_OPS  \
+   mflrr0; \
+   std r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \
+   std r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \
+   std r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \
+   std r0,16(r1); \
+   stdur1,-STACKFRAMESIZE(r1); \
+   bl  enter_vmx_ops; \
+   cmpwi   cr1,r3,0; \
+   ld  r0,STACKFRAMESIZE+16(r1); \
+   ld  r3,STK_REG(R31)(r1); \
+   ld  r4,STK_REG(R30)(r1); \
+   ld  r5,STK_REG(R29)(r1); \
+   addir1,r1,STACKFRAMESIZE; \
+   mtlrr0
+
+#define EXIT_VMX_OPS \
+   mflrr0; \
+   std r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \
+   std r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \
+   std r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \
+  

[PATCH v8 2/5] powerpc: add vcmpequd/vcmpequb ppc instruction macro

2018-06-06 Thread wei . guo . simon
From: Simon Guo 

Some old tool chains don't know about instructions like vcmpequd.

This patch adds .long macro for vcmpequd and vcmpequb, which is
a preparation to optimize ppc64 memcmp with VMX instructions.

Signed-off-by: Simon Guo 
---
 arch/powerpc/include/asm/ppc-opcode.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 18883b8..1866a97 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -366,6 +366,8 @@
 #define PPC_INST_STFDX 0x7c0005ae
 #define PPC_INST_LVX   0x7cce
 #define PPC_INST_STVX  0x7c0001ce
+#define PPC_INST_VCMPEQUD  0x10c7
+#define PPC_INST_VCMPEQUB  0x1006
 
 /* macros to insert fields into opcodes */
 #define ___PPC_RA(a)   (((a) & 0x1f) << 16)
@@ -396,6 +398,7 @@
 #define __PPC_BI(s)(((s) & 0x1f) << 16)
 #define __PPC_CT(t)(((t) & 0x0f) << 21)
 #define __PPC_SPR(r)   r) & 0x1f) << 16) | r) >> 5) & 0x1f) << 11))
+#define __PPC_RC21 (0x1 << 10)
 
 /*
  * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
@@ -567,4 +570,12 @@
   ((IH & 0x7) << 21))
 #define PPC_INVALIDATE_ERATPPC_SLBIA(7)
 
+#define VCMPEQUD_RC(vrt, vra, vrb) stringify_in_c(.long PPC_INST_VCMPEQUD 
| \
+ ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+ ___PPC_RB(vrb) | __PPC_RC21)
+
+#define VCMPEQUB_RC(vrt, vra, vrb) stringify_in_c(.long PPC_INST_VCMPEQUB 
| \
+ ___PPC_RT(vrt) | ___PPC_RA(vra) | \
+ ___PPC_RB(vrb) | __PPC_RC21)
+
 #endif /* _ASM_POWERPC_PPC_OPCODE_H */
-- 
1.8.3.1



[PATCH v8 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()

2018-06-06 Thread wei . guo . simon
From: Simon Guo 

Currently memcmp() 64bytes version in powerpc will fall back to .Lshort
(compare per byte mode) if either src or dst address is not 8 bytes aligned.
It can be opmitized in 2 situations:

1) if both addresses are with the same offset with 8 bytes boundary:
memcmp() can compare the unaligned bytes within 8 bytes boundary firstly
and then compare the rest 8-bytes-aligned content with .Llong mode.

2)  If src/dst addrs are not with the same offset of 8 bytes boundary:
memcmp() can align src addr with 8 bytes, increment dst addr accordingly,
 then load src with aligned mode and load dst with unaligned mode.

This patch optmizes memcmp() behavior in the above 2 situations.

Tested with both little/big endian. Performance result below is based on
little endian.

Following is the test result with src/dst having the same offset case:
(a similar result was observed when src/dst having different offset):
(1) 256 bytes
Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp:
- without patch
29.773018302 seconds time elapsed   
   ( +- 0.09% )
- with patch
16.485568173 seconds time elapsed   
   ( +-  0.02% )
-> There is ~+80% percent improvement

(2) 32 bytes
To observe performance impact on < 32 bytes, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
---
 #include 
 #include "utils.h"

-#define SIZE 256
+#define SIZE 32
 #define ITERATIONS 1

 int test_memcmp(const void *s1, const void *s2, size_t n);


- Without patch
0.244746482 seconds time elapsed
  ( +-  0.36%)
- with patch
0.215069477 seconds time elapsed
  ( +-  0.51%)
-> There is ~+13% improvement

(3) 0~8 bytes
To observe <8 bytes performance impact, modify
tools/testing/selftests/powerpc/stringloops/memcmp.c with following:
---
 #include 
 #include "utils.h"

-#define SIZE 256
-#define ITERATIONS 1
+#define SIZE 8
+#define ITERATIONS 100

 int test_memcmp(const void *s1, const void *s2, size_t n);
---
- Without patch
   1.845642503 seconds time elapsed 
 ( +- 0.12% )
- With patch
   1.849767135 seconds time elapsed 
 ( +- 0.26% )
-> They are nearly the same. (-0.2%)

Signed-off-by: Simon Guo 
---
 arch/powerpc/lib/memcmp_64.S | 140 ---
 1 file changed, 133 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index d75d18b..5776f91 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -24,28 +24,41 @@
 #define rH r31
 
 #ifdef __LITTLE_ENDIAN__
+#define LH lhbrx
+#define LW lwbrx
 #define LD ldbrx
 #else
+#define LH lhzx
+#define LW lwzx
 #define LD ldx
 #endif
 
+/*
+ * There are 2 categories for memcmp:
+ * 1) src/dst has the same offset to the 8 bytes boundary. The handlers
+ * are named like .Lsameoffset_
+ * 2) src/dst has different offset to the 8 bytes boundary. The handlers
+ * are named like .Ldiffoffset_
+ */
 _GLOBAL(memcmp)
cmpdi   cr1,r5,0
 
-   /* Use the short loop if both strings are not 8B aligned */
-   or  r6,r3,r4
+   /* Use the short loop if the src/dst addresses are not
+* with the same offset of 8 bytes align boundary.
+*/
+   xor r6,r3,r4
andi.   r6,r6,7
 
-   /* Use the short loop if length is less than 32B */
-   cmpdi   cr6,r5,31
+   /* Fall back to short loop if compare at aligned addrs
+* with less than 8 bytes.
+*/
+   cmpdi   cr6,r5,7
 
beq cr1,.Lzero
-   bne .Lshort
-   bgt cr6,.Llong
+   bgt cr6,.Lno_short
 
 .Lshort:
mtctr   r5
-
 1: lbz rA,0(r3)
lbz rB,0(r4)
subf.   rC,rB,rA
@@ -78,11 +91,89 @@ _GLOBAL(memcmp)
li  r3,0
blr
 
+.Lno_short:
+   dcbt0,r3
+   dcbt0,r4
+   bne .Ldiffoffset_8bytes_make_align_start
+
+
+.Lsameoffset_8bytes_make_align_start:
+   /* attempt to compare bytes not aligned with 8 bytes so that
+* rest comparison can run based on 8 bytes alignment.
+*/
+   andi.   r6,r3,7
+
+   /* Try to compare the first double word which is not 8 bytes aligned:
+* load the first double word at (src & ~7UL) and shift left appropriate
+* bits before comparision.
+*/
+   rlwinm  r6,r3,3,26,28
+   beq .Lsameoffset_8bytes_aligned
+   clrrdi  r3,r3,3
+   clrrdi  r4,r4,3
+   LD  rA,0,r3
+   LD  rB,0,r4
+   sld rA,rA,r6
+   sld rB,rB,r6
+   cmpld   cr0,rA,rB
+   srwir6,r6,3
+   bne cr0,.LcmpAB_lightweight
+   subfic  r6,r6,8
+   subf.   

[PATCH v8 0/5] powerpc/64: memcmp() optimization

2018-06-06 Thread wei . guo . simon
From: Simon Guo 

There is some room to optimize memcmp() in powerpc 64 bits version for
following 2 cases:
(1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
memcmp() can align them and go with .Llong comparision mode without
fallback to .Lshort comparision mode do compare buffer byte by byte.
(2) VMX instructions can be used to speed up for large size comparision,
currently the threshold is set for 4K bytes. Notes the VMX instructions
will lead to VMX regs save/load penalty. This patch set includes a
patch to add a 32 bytes pre-checking to minimize the penalty.

It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
performance for POWER8). Thanks Cyril Bur's information.
This patch set also updates memcmp selftest case to make it compiled and
incorporate large size comparison case.

v7 -> v8:
- define memcmp with _GLOBAL_TOC() instead of _GLOBAL() to fix TOC issue.
add _GLOBAL_TOC() definition into selftest so that it can be compiled.
- use mfocrf/mtocrf instead of mcrf to save/restore CR0

v6 -> v7:
- add vcmpequd/vcmpequdb .long macro
- add CPU_FTR pair so that Power7 won't invoke Altivec instrs.
- rework some instructions for higher performance or more readable.

v5 -> v6:
- correct some comments/commit messsage.
- rename VMX_OPS_THRES to VMX_THRESH

v4 -> v5:
- Expand 32 bytes prechk to src/dst different offset case, and remove
KSM specific label/comment.

v3 -> v4:
- Add 32 bytes pre-checking before using VMX instructions.

v2 -> v3:
- add optimization for src/dst with different offset against 8 bytes
boundary.
- renamed some label names.
- reworked some comments from Cyril Bur, such as fill the pipeline, 
and use VMX when size == 4K.
- fix a bug of enter/exit_vmx_ops pairness issue. And revised test 
case to test whether enter/exit_vmx_ops are paired.

v1 -> v2:
- update 8bytes unaligned bytes comparison method.
- fix a VMX comparision bug.
- enhanced the original memcmp() selftest.
- add powerpc/64 to subject/commit message.


Simon Guo (5):
  powerpc/64: Align bytes before fall back to .Lshort in powerpc64
memcmp()
  powerpc: add vcmpequd/vcmpequb ppc instruction macro
  powerpc/64: enhance memcmp() with VMX instruction for long bytes
comparision
  powerpc/64: add 32 bytes prechecking before using VMX optimization on
memcmp()
  powerpc:selftest update memcmp_64 selftest for VMX implementation

 arch/powerpc/include/asm/asm-prototypes.h  |   4 +-
 arch/powerpc/include/asm/ppc-opcode.h  |  11 +
 arch/powerpc/lib/copypage_power7.S |   4 +-
 arch/powerpc/lib/memcmp_64.S   | 414 -
 arch/powerpc/lib/memcpy_power7.S   |   6 +-
 arch/powerpc/lib/vmx-helper.c  |   4 +-
 .../selftests/powerpc/copyloops/asm/ppc_asm.h  |   4 +-
 .../selftests/powerpc/stringloops/asm/ppc-opcode.h |  39 ++
 .../selftests/powerpc/stringloops/asm/ppc_asm.h|  25 ++
 .../testing/selftests/powerpc/stringloops/memcmp.c |  98 +++--
 10 files changed, 568 insertions(+), 41 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h

-- 
1.8.3.1



Re: [PATCH v4 1/2] powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64

2018-06-06 Thread Nicholas Piggin
On Wed,  6 Jun 2018 14:21:08 + (UTC)
Christophe Leroy  wrote:

> scaled cputime is only meaningfull when the processor has
> SPURR and/or PURR, which means only on PPC64.
> 
> Removing it on PPC32 significantly reduces the size of
> vtime_account_system() and vtime_account_idle() on an 8xx:
> 
> Before:
>  l F .text00a8 vtime_delta
> 0280 g F .text010c vtime_account_system
> 038c g F .text0048 vtime_account_idle
> 
> After:
> (vtime_delta gets inlined in the two functions)
> 01d8 g F .text00a0 vtime_account_system
> 0278 g F .text0038 vtime_account_idle
> 
> In terms of performance, we also get approximatly 5% improvement on task 
> switch:
> The following small benchmark app is run with perf stat:
> 
> void *thread(void *arg)
> {
>   int i;
> 
>   for (i = 0; i < atoi((char*)arg); i++)
>   pthread_yield();
> }
> 
> int main(int argc, char **argv)
> {
>   pthread_t th1, th2;
> 
>   pthread_create(, NULL, thread, argv[1]);
>   pthread_create(, NULL, thread, argv[1]);
>   pthread_join(th1, NULL);
>   pthread_join(th2, NULL);
> 
>   return 0;
> }
> 
> Before the patch:
> 
> ~# perf stat chrt -f 98 ./sched 10
> 
>  Performance counter stats for 'chrt -f 98 ./sched 10':
> 
>8622.166272  task-clock (msec) #0.955 CPUs utilized
> 200027  context-switches  #0.023 M/sec
> 
> After the patch:
> 
> ~# perf stat chrt -f 98 ./sched 10
> 
>  Performance counter stats for 'chrt -f 98 ./sched 10':
> 
>8207.090048  task-clock (msec) #0.958 CPUs utilized
> 200025  context-switches  #0.024 M/sec
> 
> Signed-off-by: Christophe Leroy 

This looks okay to me. Nice numbers.

> ---
>  v4:
>   - Using the correct symbol CONFIG_ARCH_HAS_SCALED_CPUTIME instead of 
> ARCH_HAS_SCALED_CPUTIME
>   - Grouped CONFIG_ARCH_HAS_SCALED_CPUTIME related code in dedicated 
> functions to reduce the number of #ifdefs
>   - Integrated read_spurr() directly into the related function.
>  v3: Rebased following modifications in xmon.c
>  v2: added ifdefs in xmon to fix compilation error
> 
>  arch/powerpc/Kconfig  |   2 +-
>  arch/powerpc/include/asm/accounting.h |   4 ++
>  arch/powerpc/include/asm/cputime.h|   1 -
>  arch/powerpc/kernel/time.c| 111 
> +-
>  arch/powerpc/xmon/xmon.c  |   4 ++
>  5 files changed, 77 insertions(+), 45 deletions(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index b62a16e2c7cc..735398fd390d 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -142,7 +142,7 @@ config PPC
>   select ARCH_HAS_PHYS_TO_DMA
>   select ARCH_HAS_PMEM_APIif PPC64
>   select ARCH_HAS_MEMBARRIER_CALLBACKS
> - select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
> + select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
> && PPC64

I wonder if we could make this depend on PPC_PSERIES or even
PPC_SPLPAR as well? (That would be for a later patch)

Thanks,
Nick


Re: powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap

2018-06-06 Thread Michael Ellerman
On Wed, 2018-06-06 at 01:40:08 UTC, Nicholas Piggin wrote:
> There is a typo in f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after
> set_pte and ptep_set_access_flags") config ifdef, which results in the
> necessary ptesync not being issued after vmalloc.
> 
> This causes random kernel faults in module load, bpf load, anywhere
> that vmalloc mappings are used.
> 
> After correcting the code, this survives a guest kernel booting
> hundreds of times where previously there would be a crash every few
> boots (I haven't noticed the crash on host, perhaps due to different
> TLB and page table walking behaviour in hardware).
> 
> A memory clobber is also added to the flush, just to be sure it won't
> be reordered with the pte set or the subsequent mapping access.
> 
> Fixes: f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after set_pte and 
> ptep_set_access_flags")
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/ff5bc793e47b537bf3e904fada585e

cheers


Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2018-06-06 Thread Segher Boessenkool
On Wed, Jun 06, 2018 at 02:42:27PM +0800, Simon Guo wrote:
> I now felt unformatable to use mcrf like:
> mcrf7,0
> 
> since I cannot 100% confident that compiler will not use CR7 or other
> CR# in exit_vmx_ops().

It wasn't clear to me this macro boils down to a function call.

You can use CR2,CR3,CR4, but you'll need to save and restore those at
the start and end of function then, which is just as nasty.

Better is to restructure some code so you don't need that CR field
there anymore.

> Can we switch back to mfocrf/mtocrf with correct CR0 value?
>mfocrf  r5,128
> ...
>mtocrf  128,r5

Sure, I'm not your boss ;-)  It seems a shame to me to have this 12 or
whatever cycle delay here, since the whole point of the patch is to
make things faster, that's all (but it still is faster, right, you
tested it).


Segher


[PATCH v4 1/2] powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64

2018-06-06 Thread Christophe Leroy
scaled cputime is only meaningfull when the processor has
SPURR and/or PURR, which means only on PPC64.

Removing it on PPC32 significantly reduces the size of
vtime_account_system() and vtime_account_idle() on an 8xx:

Before:
 l F .text  00a8 vtime_delta
0280 g F .text  010c vtime_account_system
038c g F .text  0048 vtime_account_idle

After:
(vtime_delta gets inlined in the two functions)
01d8 g F .text  00a0 vtime_account_system
0278 g F .text  0038 vtime_account_idle

In terms of performance, we also get approximatly 5% improvement on task switch:
The following small benchmark app is run with perf stat:

void *thread(void *arg)
{
int i;

for (i = 0; i < atoi((char*)arg); i++)
pthread_yield();
}

int main(int argc, char **argv)
{
pthread_t th1, th2;

pthread_create(, NULL, thread, argv[1]);
pthread_create(, NULL, thread, argv[1]);
pthread_join(th1, NULL);
pthread_join(th2, NULL);

return 0;
}

Before the patch:

~# perf stat chrt -f 98 ./sched 10

 Performance counter stats for 'chrt -f 98 ./sched 10':

   8622.166272  task-clock (msec) #0.955 CPUs utilized
200027  context-switches  #0.023 M/sec

After the patch:

~# perf stat chrt -f 98 ./sched 10

 Performance counter stats for 'chrt -f 98 ./sched 10':

   8207.090048  task-clock (msec) #0.958 CPUs utilized
200025  context-switches  #0.024 M/sec

Signed-off-by: Christophe Leroy 
---
 v4:
  - Using the correct symbol CONFIG_ARCH_HAS_SCALED_CPUTIME instead of 
ARCH_HAS_SCALED_CPUTIME
  - Grouped CONFIG_ARCH_HAS_SCALED_CPUTIME related code in dedicated functions 
to reduce the number of #ifdefs
  - Integrated read_spurr() directly into the related function.
 v3: Rebased following modifications in xmon.c
 v2: added ifdefs in xmon to fix compilation error

 arch/powerpc/Kconfig  |   2 +-
 arch/powerpc/include/asm/accounting.h |   4 ++
 arch/powerpc/include/asm/cputime.h|   1 -
 arch/powerpc/kernel/time.c| 111 +-
 arch/powerpc/xmon/xmon.c  |   4 ++
 5 files changed, 77 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b62a16e2c7cc..735398fd390d 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -142,7 +142,7 @@ config PPC
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_APIif PPC64
select ARCH_HAS_MEMBARRIER_CALLBACKS
-   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE
+   select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC64
select ARCH_HAS_SG_CHAIN
select ARCH_HAS_STRICT_KERNEL_RWX   if ((PPC_BOOK3S_64 || PPC32) && 
!RELOCATABLE && !HIBERNATION)
select ARCH_HAS_TICK_BROADCAST  if GENERIC_CLOCKEVENTS_BROADCAST
diff --git a/arch/powerpc/include/asm/accounting.h 
b/arch/powerpc/include/asm/accounting.h
index 3abcf98ed2e0..c607c5d835cc 100644
--- a/arch/powerpc/include/asm/accounting.h
+++ b/arch/powerpc/include/asm/accounting.h
@@ -15,8 +15,10 @@ struct cpu_accounting_data {
/* Accumulated cputime values to flush on ticks*/
unsigned long utime;
unsigned long stime;
+#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
unsigned long utime_scaled;
unsigned long stime_scaled;
+#endif
unsigned long gtime;
unsigned long hardirq_time;
unsigned long softirq_time;
@@ -25,8 +27,10 @@ struct cpu_accounting_data {
/* Internal counters */
unsigned long starttime;/* TB value snapshot */
unsigned long starttime_user;   /* TB value on exit to usermode */
+#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
unsigned long startspurr;   /* SPURR value snapshot */
unsigned long utime_sspurr; /* ->user_time when ->startspurr set */
+#endif
 };
 
 #endif
diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index bc4903badb3f..a48c7b5e5cf9 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -62,7 +62,6 @@ static inline void arch_vtime_task_switch(struct task_struct 
*prev)
struct cpu_accounting_data *acct0 = get_accounting(prev);
 
acct->starttime = acct0->starttime;
-   acct->startspurr = acct0->startspurr;
 }
 #endif
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 70f145e02487..7a9f4e2f22c8 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -171,19 +171,6 @@ static void calc_cputime_factors(void)
__cputime_usec_factor = res.result_low;
 }
 
-/*
- * Read the SPURR on systems that have it, otherwise the PURR,
- * or if that doesn't exist return the timebase value passed in.
- */
-static unsigned long read_spurr(unsigned long 

[PATCH v4 2/2] powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected

2018-06-06 Thread Christophe Leroy
If CONFIG_PPC_SPLPAR is not selected, steal_time will always
be NUL, so accounting it is pointless

Signed-off-by: Christophe Leroy 
---
 v4: removed the check in vtime_account_system(), the compiler removes the code 
regardless.
 v3: new

 arch/powerpc/kernel/time.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 7a9f4e2f22c8..eda78b1ed7d3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -412,9 +412,6 @@ void vtime_flush(struct task_struct *tsk)
if (acct->gtime)
account_guest_time(tsk, cputime_to_nsecs(acct->gtime));
 
-   if (acct->steal_time)
-   account_steal_time(cputime_to_nsecs(acct->steal_time));
-
if (acct->idle_time)
account_idle_time(cputime_to_nsecs(acct->idle_time));
 
@@ -431,13 +428,17 @@ void vtime_flush(struct task_struct *tsk)
 
acct->utime = 0;
acct->gtime = 0;
-   acct->steal_time = 0;
acct->idle_time = 0;
acct->stime = 0;
acct->hardirq_time = 0;
acct->softirq_time = 0;
 
vtime_flush_scaled(tsk, acct);
+
+   if (IS_ENABLED(CONFIG_PPC_SPLPAR) && acct->steal_time) {
+   account_steal_time(cputime_to_nsecs(acct->steal_time));
+   acct->steal_time = 0;
+   }
 }
 
 #else /* ! CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
-- 
2.13.3



RE: [PATCH 09/10] dpaa_eth: add support for hardware timestamping

2018-06-06 Thread Y.b. Lu
Hi Richard,

> -Original Message-
> From: Richard Cochran [mailto:richardcoch...@gmail.com]
> Sent: Tuesday, June 5, 2018 9:58 PM
> To: Y.b. Lu 
> Cc: net...@vger.kernel.org; Madalin-cristian Bucur
> ; Rob Herring ; Shawn Guo
> ; David S . Miller ;
> devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org
> Subject: Re: [PATCH 09/10] dpaa_eth: add support for hardware timestamping
> 
> On Tue, Jun 05, 2018 at 03:35:28AM +, Y.b. Lu wrote:
> > [Y.b. Lu] Actually these timestamping codes affected DPAA networking
> performance in our previous performance test.
> > That's why we used ifdef for it.
> 
> How much does time stamping hurt performance?
> 
> If the time stamping is compiled in but not enabled at run time, does it still
> affect performace?

[Y.b. Lu] I can't remember and find the old data since it had been a long time.
I just did the iperf test today between two 10G ports. I didn’t see any 
performance changes with timestamping code 
So, let's me remove the ifdef in next version.
Thanks a lot.


> 
> Thanks,
> Richard


[PATCH v2] powerpc: Add support for function error injection

2018-06-06 Thread Naveen N. Rao
We implement regs_set_return_value() and override_function_with_return() 
for this purpose.

On powerpc, a return from a function (blr) just branches to the location
contained in the link register. So, we can just update pt_regs rather
than redirecting execution to a dummy function that returns.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/error-injection.h | 13 +
 arch/powerpc/include/asm/ptrace.h  |  5 +
 arch/powerpc/lib/Makefile  |  2 ++
 arch/powerpc/lib/error-inject.c| 12 
 5 files changed, 33 insertions(+)
 create mode 100644 arch/powerpc/include/asm/error-injection.h
 create mode 100644 arch/powerpc/lib/error-inject.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index f674006dea2f..8e61bc0f25cb 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -186,6 +186,7 @@ config PPC
select HAVE_EBPF_JITif PPC64
select HAVE_EFFICIENT_UNALIGNED_ACCESS  if !(CPU_LITTLE_ENDIAN && 
POWER7_CPU)
select HAVE_FTRACE_MCOUNT_RECORD
+   select HAVE_FUNCTION_ERROR_INJECTION
select HAVE_FUNCTION_GRAPH_TRACER
select HAVE_FUNCTION_TRACER
select HAVE_GCC_PLUGINS
diff --git a/arch/powerpc/include/asm/error-injection.h 
b/arch/powerpc/include/asm/error-injection.h
new file mode 100644
index ..740c3075bdf4
--- /dev/null
+++ b/arch/powerpc/include/asm/error-injection.h
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#ifndef _ASM_ERROR_INJECTION_H
+#define _ASM_ERROR_INJECTION_H
+
+#include 
+#include 
+#include 
+#include 
+
+void override_function_with_return(struct pt_regs *regs);
+
+#endif /* _ASM_ERROR_INJECTION_H */
diff --git a/arch/powerpc/include/asm/ptrace.h 
b/arch/powerpc/include/asm/ptrace.h
index e4923686e43a..c0705296c2f0 100644
--- a/arch/powerpc/include/asm/ptrace.h
+++ b/arch/powerpc/include/asm/ptrace.h
@@ -101,6 +101,11 @@ static inline long regs_return_value(struct pt_regs *regs)
return -regs->gpr[3];
 }
 
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->gpr[3] = rc;
+}
+
 #ifdef __powerpc64__
 #define user_mode(regs) regs)->msr) >> MSR_PR_LG) & 0x1)
 #else
diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index 653901042ad7..e0ed195eaa4b 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -14,6 +14,8 @@ obj-y += string.o alloc.o code-patching.o feature-fixups.o
 
 obj-$(CONFIG_PPC32)+= div64.o copy_32.o crtsavres.o
 
+obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o
+
 # See corresponding test in arch/powerpc/Makefile
 # 64-bit linker creates .sfpr on demand for final link (vmlinux),
 # so it is only needed for modules, and only for older linkers which
diff --git a/arch/powerpc/lib/error-inject.c b/arch/powerpc/lib/error-inject.c
new file mode 100644
index ..6d3d57f68024
--- /dev/null
+++ b/arch/powerpc/lib/error-inject.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#include 
+#include 
+#include 
+
+void override_function_with_return(struct pt_regs *regs)
+{
+   /* Emulate 'blr' instruction */
+   regs->nip = regs->link;
+}
+NOKPROBE_SYMBOL(override_function_with_return);
-- 
2.17.0



Re: [PATCH 2/2] powerpc: Add support for function error injection

2018-06-06 Thread Naveen N. Rao

Naveen N. Rao wrote:

Michael Ellerman wrote:


I guess if it doesn't already apply to tip you should rebase it. You've
probably missed 4.18 anyway.


Oh ok. I just tried and it seems to apply just fine. I'll post v2 after 
giving this a quick test.


I didn't post a v2 since I have decided against using this approach. The 
reason for that is Masami's series to remove jprobes. The discussion 
there reminded me that we can't easily override functions with kprobes 
on powerpc. Though it works for this particular scenario, we would just 
be setting a bad example.


As such, I won't be changing generic code, but will simply make the 
necessary changes in powerpc code.


Sorry for the noise.

- Naveen




Re: [PATCH 08/11] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver

2018-06-06 Thread Geert Uytterhoeven
Hi Finn,

On Wed, Jun 6, 2018 at 8:57 AM, Finn Thain  wrote:
> On Mon, 4 Jun 2018, Geert Uytterhoeven wrote:
>> > Don't call pmu_shutdown() or pmu_restart() on early PowerBooks: the
>> > PMU device found in these PowerBooks isn't supported.
>>
>> Shouldn't that be a separate patch?
>>
>> > --- a/arch/m68k/mac/misc.c
>> > +++ b/arch/m68k/mac/misc.c
>>
>> > @@ -477,9 +445,8 @@ void mac_poweroff(void)
>> >macintosh_config->adb_type == MAC_ADB_CUDA) {
>> > cuda_shutdown();
>> >  #endif
>> > -#ifdef CONFIG_ADB_PMU68K
>> > -   } else if (macintosh_config->adb_type == MAC_ADB_PB1
>> > -   || macintosh_config->adb_type == MAC_ADB_PB2) {
>> > +#ifdef CONFIG_ADB_PMU
>> > +   } else if (macintosh_config->adb_type == MAC_ADB_PB2) {
>> > pmu_shutdown();
>> >  #endif
>> > }
>> > @@ -519,9 +486,8 @@ void mac_reset(void)
>> >macintosh_config->adb_type == MAC_ADB_CUDA) {
>> > cuda_restart();
>> >  #endif
>> > -#ifdef CONFIG_ADB_PMU68K
>> > -   } else if (macintosh_config->adb_type == MAC_ADB_PB1
>> > -   || macintosh_config->adb_type == MAC_ADB_PB2) {
>> > +#ifdef CONFIG_ADB_PMU
>> > +   } else if (macintosh_config->adb_type == MAC_ADB_PB2) {
>> > pmu_restart();
>> >  #endif
>> > } else if (CPU_IS_030) {
>>
>
> The stability problem here is bigger than just pmu_restart() and
> pmu_shutdown(), so those hunks are insufficient. You need to prevent the
> via-pmu68k driver from loading in the first place and to drop all the
> PMU_68K_V1 cases.
>
> But splitting this patch in that way creates potential merge conflicts,
> which is a hassle. E.g. this hunk:
>
> -   
> -   switch (macintosh_config->adb_type) {
> -   case MAC_ADB_PB1:
> -   pmu_kind = PMU_68K_V1;
> -   break;
> -   case MAC_ADB_PB2:
> -   pmu_kind = PMU_68K_V2;
> -   break;
> -   default:
> -   pmu_kind = PMU_UNKNOWN;
> -   return -ENODEV;
> -   }
> -   ...
>
> would get split over two patches.
>
> The way I see it, having no PMU driver for PMU_68K_V1 machines is a bug.
> And loading any of the existing PMU drivers on that hardware is also a
> bug. These bugs are equivalent in that either one means you can't use the
> keyboard, trackpad etc. So there's no value in cherry-picking the other
> bug.
>
> If you are using v4.17 on a PMU_68K_V1 machine, you probably already have
> CONFIG_ADB_PMU68K=n. With that config, here's nothing to be gained from
> bisecting these changes.
>
> Does that make sense? Did I overlook other possible reason(s) to split up
> this patch?

So 4.17 mac_defconfig won't work on PMU_68K_V1 machines?
Perhaps that should be fixed first.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH 08/11] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver

2018-06-06 Thread Finn Thain
On Mon, 4 Jun 2018, Geert Uytterhoeven wrote:

> 
> > Don't call pmu_shutdown() or pmu_restart() on early PowerBooks: the 
> > PMU device found in these PowerBooks isn't supported.
> 
> Shouldn't that be a separate patch?
> 
> > --- a/arch/m68k/mac/misc.c
> > +++ b/arch/m68k/mac/misc.c
> 
> > @@ -477,9 +445,8 @@ void mac_poweroff(void)
> >macintosh_config->adb_type == MAC_ADB_CUDA) {
> > cuda_shutdown();
> >  #endif
> > -#ifdef CONFIG_ADB_PMU68K
> > -   } else if (macintosh_config->adb_type == MAC_ADB_PB1
> > -   || macintosh_config->adb_type == MAC_ADB_PB2) {
> > +#ifdef CONFIG_ADB_PMU
> > +   } else if (macintosh_config->adb_type == MAC_ADB_PB2) {
> > pmu_shutdown();
> >  #endif
> > }
> > @@ -519,9 +486,8 @@ void mac_reset(void)
> >macintosh_config->adb_type == MAC_ADB_CUDA) {
> > cuda_restart();
> >  #endif
> > -#ifdef CONFIG_ADB_PMU68K
> > -   } else if (macintosh_config->adb_type == MAC_ADB_PB1
> > -   || macintosh_config->adb_type == MAC_ADB_PB2) {
> > +#ifdef CONFIG_ADB_PMU
> > +   } else if (macintosh_config->adb_type == MAC_ADB_PB2) {
> > pmu_restart();
> >  #endif
> > } else if (CPU_IS_030) {
> 

The stability problem here is bigger than just pmu_restart() and 
pmu_shutdown(), so those hunks are insufficient. You need to prevent the 
via-pmu68k driver from loading in the first place and to drop all the 
PMU_68K_V1 cases.

But splitting this patch in that way creates potential merge conflicts, 
which is a hassle. E.g. this hunk:

-   
-   switch (macintosh_config->adb_type) {
-   case MAC_ADB_PB1:
-   pmu_kind = PMU_68K_V1;
-   break;
-   case MAC_ADB_PB2:
-   pmu_kind = PMU_68K_V2;
-   break;
-   default:
-   pmu_kind = PMU_UNKNOWN;
-   return -ENODEV;
-   }
-   ...

would get split over two patches.

The way I see it, having no PMU driver for PMU_68K_V1 machines is a bug. 
And loading any of the existing PMU drivers on that hardware is also a 
bug. These bugs are equivalent in that either one means you can't use the 
keyboard, trackpad etc. So there's no value in cherry-picking the other 
bug.

If you are using v4.17 on a PMU_68K_V1 machine, you probably already have 
CONFIG_ADB_PMU68K=n. With that config, here's nothing to be gained from 
bisecting these changes.

Does that make sense? Did I overlook other possible reason(s) to split up 
this patch?

-- 


Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization

2018-06-06 Thread Simon Guo
Hi Naveen,
On Wed, Jun 06, 2018 at 12:06:09PM +0530, Naveen N. Rao wrote:
> Simon Guo wrote:
> >Hi Michael,
> >On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
> >>Hi Simon,
> >>
> >>wei.guo.si...@gmail.com writes:
> >>> From: Simon Guo 
> >>>
> >>> There is some room to optimize memcmp() in powerpc 64 bits version for
> >>> following 2 cases:
> >>> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> >>> memcmp() can align them and go with .Llong comparision mode without
> >>> fallback to .Lshort comparision mode do compare buffer byte by byte.
> >>> (2) VMX instructions can be used to speed up for large size comparision,
> >>> currently the threshold is set for 4K bytes. Notes the VMX instructions
> >>> will lead to VMX regs save/load penalty. This patch set includes a
> >>> patch to add a 32 bytes pre-checking to minimize the penalty.
> >>>
> >>> It did the similar with glibc commit dec4a7105e (powerpc:
> >>Improve memcmp > performance for POWER8). Thanks Cyril Bur's
> >>information.
> >>> This patch set also updates memcmp selftest case to make it compiled and
> >>> incorporate large size comparison case.
> >>
> >>I'm seeing a few crashes with this applied, I haven't had time to look
> >>into what is happening yet, sorry.
> >>
> >
> >The bug is due to memcmp() invokes a C function enter_vmx_ops()
> >who will load some PIC value based on r2.
> >
> >memcmp() doesn't use r2 and if the memcmp() is invoked from kernel
> >itself, everything is fine. But if memcmp() is invoked from
> >modules[test_user_copy], r2 will be required to be setup
> >correctly. Otherwise the enter_vmx_ops() will refer to an
> >incorrect/unexisting data location based on wrong r2 value.
> >
> >Following patch will fix this issue:
> >
> >diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
> >index 5eba49744a5a..24d093fa89bb 100644
> >--- a/arch/powerpc/lib/memcmp_64.S
> >+++ b/arch/powerpc/lib/memcmp_64.S
> >@@ -102,7 +102,7 @@
> >  * 2) src/dst has different offset to the 8 bytes boundary. The handlers
> >  * are named like .Ldiffoffset_
> >  */
> >-_GLOBAL(memcmp)
> >+_GLOBAL_TOC(memcmp)
> >cmpdi   cr1,r5,0
> >
> >/* Use the short loop if the src/dst addresses are not
> >--
> >
> >It means the memcmp() fun entry will have additional 2 instructions. Is there
> >any way to save these 2 instructions when the memcmp() is actually invoked
> >from kernel itself?
> 
> That will be the case. We will end up entering the function via the
> local entry point skipping the first two instructions. The Global
> entry point is only used for cross-module calls.
> 

Yes. Thanks :)

- Simon


Re: Problems building ppc images in v4.14.y and v4.16.y using gcc 7.3.0 / 8.1.0 from kernel.org

2018-06-06 Thread Christophe LEROY




Le 05/06/2018 à 21:47, Arnd Bergmann a écrit :

On Tue, Jun 5, 2018 at 6:06 PM, Guenter Roeck  wrote:

On Tue, Jun 05, 2018 at 04:31:00PM +0200, Arnd Bergmann wrote:

On Tue, Jun 5, 2018 at 3:52 PM, Guenter Roeck  wrote:

Hi Arnd,

when using the ppc64 compiler from kernel.org, I see the following problems
when trying to compile ppc:allnoconfig in v4.14.y or v4.16.y.

gcc 7.3.0: Compilation of kernel.cpu.o hangs

The problem goes away if I apply the following two patches (tested with
4.16.y)

17a2f1ced028 cpu/hotplug: Merge cpuhp_bp_states and cpuhp_ap_states
fcb3029a8d89 cpu/hotplug: Fix unused function warning


This is probably the same as
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84038

I thought I had included the fix in my builds.


Guess not.


I probably had it in one build and then forgot about it when I did a
rebuild of 7.3 :(

I'm still planning to do a new set of gcc-7.3 binaries (or maybe 7.4
if that gets
released soon) and should try to remember doing that.




I think it may have cached the flags from the other compiler version.
"make mrproper" prior to "make defconfig" took care of the issue.

However, that doesn't really help - I get lots of
 error: 'sys_spu_create' alias between functions of incompatible types
 error: 'strncpy' output truncated before terminating nul
if I try to use gcc 8.1.0.

Oh well. I'll try gcc 6.4.0 next.


On the upside, those two errors are just a result of arch/power/*/*.c getting
built with -Werror, they are warnings that gcc-8 introduced that we should
either shut up or fix.


They are fixed in next:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=2479bfc9bc600dcce7f932d52dcfa8d677c41f93
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=c95998811807d897ca112ea62d66716ed733d058

Christophe



 Arnd



Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision

2018-06-06 Thread Simon Guo
Hi segher,
On Wed, May 30, 2018 at 05:03:21PM +0800, Simon Guo wrote:
> On Wed, May 30, 2018 at 03:35:40AM -0500, Segher Boessenkool wrote:
> > On Wed, May 30, 2018 at 04:14:02PM +0800, Simon Guo wrote:
> > > Hi Segher,
> > > On Mon, May 28, 2018 at 06:05:59AM -0500, Segher Boessenkool wrote:
> > > > On Fri, May 25, 2018 at 12:07:34PM +0800, wei.guo.si...@gmail.com wrote:
> > > > > + /* save and restore cr0 */
> > > > > + mfocrf  r5,64
> > > > > + EXIT_VMX_OPS
> > > > > + mtocrf  64,r5
> > > > > + b   .LcmpAB_lightweight
> > > > 
> > > > That's cr1, not cr0.  You can use mcrf instead, it is cheaper (esp. if
> > > > you have it in a non-volatile CR field before so you need only one, if 
> > > > any).
> > > > 
> > > You are right :) How about using mtcr/mfcr instead, I think they are
> > > fast as well and more readable.
> > 
> > Those are much worse than m[ft]ocrf.
> > 
> > You probably should just shuffle things around so that EXIT_VMX_OPS
> > does not clobber the CR field you need to keep.
> Let me use mcrf then :)

I now felt unformatable to use mcrf like:
mcrf7,0

since I cannot 100% confident that compiler will not use CR7 or other
CR# in exit_vmx_ops().

Can we switch back to mfocrf/mtocrf with correct CR0 value?
   mfocrf  r5,128
...
   mtocrf  128,r5

Thanks,
- Simon


Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization

2018-06-06 Thread Naveen N. Rao

Simon Guo wrote:

Hi Michael,
On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:

Hi Simon,

wei.guo.si...@gmail.com writes:
> From: Simon Guo 
>
> There is some room to optimize memcmp() in powerpc 64 bits version for
> following 2 cases:
> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> memcmp() can align them and go with .Llong comparision mode without
> fallback to .Lshort comparision mode do compare buffer byte by byte.
> (2) VMX instructions can be used to speed up for large size comparision,
> currently the threshold is set for 4K bytes. Notes the VMX instructions
> will lead to VMX regs save/load penalty. This patch set includes a
> patch to add a 32 bytes pre-checking to minimize the penalty.
>
> It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
> performance for POWER8). Thanks Cyril Bur's information.

> This patch set also updates memcmp selftest case to make it compiled and
> incorporate large size comparison case.

I'm seeing a few crashes with this applied, I haven't had time to look
into what is happening yet, sorry.



The bug is due to memcmp() invokes a C function enter_vmx_ops() who will load 
some PIC value based on r2.


memcmp() doesn't use r2 and if the memcmp() is invoked from kernel
itself, everything is fine. But if memcmp() is invoked from modules[test_user_copy], 
r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() will refer 
to an incorrect/unexisting data location based on wrong r2 value.


Following patch will fix this issue:

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 5eba49744a5a..24d093fa89bb 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -102,7 +102,7 @@
  * 2) src/dst has different offset to the 8 bytes boundary. The handlers
  * are named like .Ldiffoffset_
  */
-_GLOBAL(memcmp)
+_GLOBAL_TOC(memcmp)
cmpdi   cr1,r5,0

/* Use the short loop if the src/dst addresses are not
--

It means the memcmp() fun entry will have additional 2 instructions. Is there
any way to save these 2 instructions when the memcmp() is actually invoked
from kernel itself?


That will be the case. We will end up entering the function via the 
local entry point skipping the first two instructions. The Global entry 
point is only used for cross-module calls.


- Naveen




Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization

2018-06-06 Thread Simon Guo
Hi Michael,
On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
> Hi Simon,
> 
> wei.guo.si...@gmail.com writes:
> > From: Simon Guo 
> >
> > There is some room to optimize memcmp() in powerpc 64 bits version for
> > following 2 cases:
> > (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> > memcmp() can align them and go with .Llong comparision mode without
> > fallback to .Lshort comparision mode do compare buffer byte by byte.
> > (2) VMX instructions can be used to speed up for large size comparision,
> > currently the threshold is set for 4K bytes. Notes the VMX instructions
> > will lead to VMX regs save/load penalty. This patch set includes a
> > patch to add a 32 bytes pre-checking to minimize the penalty.
> >
> > It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
> > performance for POWER8). Thanks Cyril Bur's information.
> > This patch set also updates memcmp selftest case to make it compiled and
> > incorporate large size comparison case.
> 
> I'm seeing a few crashes with this applied, I haven't had time to look
> into what is happening yet, sorry.
> 

The bug is due to memcmp() invokes a C function enter_vmx_ops() who will load 
some PIC value based on r2.

memcmp() doesn't use r2 and if the memcmp() is invoked from kernel
itself, everything is fine. But if memcmp() is invoked from 
modules[test_user_copy], 
r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() will 
refer 
to an incorrect/unexisting data location based on wrong r2 value.

Following patch will fix this issue:

diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 5eba49744a5a..24d093fa89bb 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -102,7 +102,7 @@
  * 2) src/dst has different offset to the 8 bytes boundary. The handlers
  * are named like .Ldiffoffset_
  */
-_GLOBAL(memcmp)
+_GLOBAL_TOC(memcmp)
cmpdi   cr1,r5,0

/* Use the short loop if the src/dst addresses are not
--

It means the memcmp() fun entry will have additional 2 instructions. Is there
any way to save these 2 instructions when the memcmp() is actually invoked
from kernel itself?

Again thanks for finding this issue.

Thanks,
- Simon

> [ 2471.300595] kselftest: Running tests in user
> [ 2471.302785] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 44883
> [ 2471.302892] Unable to handle kernel paging request for data at address 
> 0xc00818553005
> [ 2471.303014] Faulting instruction address: 0xc001f29c
> [ 2471.303119] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2471.303193] LE SMP NR_CPUS=2048 NUMA PowerNV
> [ 2471.303256] Modules linked in: test_user_copy(+) vxlan ip6_udp_tunnel 
> udp_tunnel 8021q bridge stp llc dummy test_printf test_firmware vmx_crypto 
> crct10dif_vpmsum crct10dif_common crc32c_vpmsum veth [last unloaded: 
> test_static_key_base]
> [ 2471.303532] CPU: 4 PID: 44883 Comm: modprobe Tainted: GW 
> 4.17.0-rc3-gcc7x-g7204012 #1
> [ 2471.303644] NIP:  c001f29c LR: c001f6e4 CTR: 
> 
> [ 2471.303754] REGS: c01fddc2b560 TRAP: 0300   Tainted: GW
>   (4.17.0-rc3-gcc7x-g7204012)
> [ 2471.303873] MSR:  92009033   CR: 
> 24222844  XER: 
> [ 2471.303996] CFAR: c001f6e0 DAR: c00818553005 DSISR: 4000 
> IRQMASK: 0 
> [ 2471.303996] GPR00: c001f6e4 c01fddc2b7e0 c00818529900 
> 0200 
> [ 2471.303996] GPR04: c01fe4b90020 ffe0  
> 03fe01b48000 
> [ 2471.303996] GPR08: 8000 c00818553005 c01fddc28000 
> c00818520df0 
> [ 2471.303996] GPR12: c009c430 c01fbc00 2000 
>  
> [ 2471.303996] GPR16: c01fddc2bc20 0030 c01f7ba0 
> 0001 
> [ 2471.303996] GPR20:  c0c772b0 c10b4018 
>  
> [ 2471.303996] GPR24:  c00818521c98  
> c01fe4b9 
> [ 2471.303996] GPR28: fff4 0200 92009033 
> 92009033 
> [ 2471.304930] NIP [c001f29c] msr_check_and_set+0x3c/0xc0
> [ 2471.305008] LR [c001f6e4] enable_kernel_altivec+0x44/0x100
> [ 2471.305084] Call Trace:
> [ 2471.305122] [c01fddc2b7e0] [c009baa8] 
> __copy_tofrom_user_base+0x9c/0x574 (unreliable)
> [ 2471.305240] [c01fddc2b860] [c001f6e4] 
> enable_kernel_altivec+0x44/0x100
> [ 2471.305336] [c01fddc2b890] [c009ce40] enter_vmx_ops+0x50/0x70
> [ 2471.305418] [c01fddc2b8b0] [c009c768] memcmp+0x338/0x680
> [ 2471.305501] [c01fddc2b9b0] [c00818520190] 
> test_user_copy_init+0x188/0xd14 [test_user_copy]
> [ 2471.305617] [c01fddc2ba60] [c000de20] 
> do_one_initcall+0x90/0x560
> [ 2471.305710] [c01fddc2bb30] [c0200630] do_init_module+0x90/0x260
> [