Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
On Thu, May 31, 2018 at 08:43:58PM +0300, Michael S. Tsirkin wrote: > Pls work on a long term solution. Short term needs can be served by > enabling the iommu platform in qemu. So, I spent some time looking at converting virtio to dma ops overrides, and the current virtio spec, and the sad through I have to tell is that both the spec and the Linux implementation are complete and utterly fucked up. Both in the flag naming and the implementation there is an implication of DMA API == IOMMU, which is fundamentally wrong. The DMA API does a few different things: a) address translation This does include IOMMUs. But it also includes random offsets between PCI bars and system memory that we see on various platforms. Worse so some of these offsets might be based on banks, e.g. on the broadcom bmips platform. It also deals with bitmask in physical addresses related to memory encryption like AMD SEV. I'd be really curious how for example the Intel virtio based NIC is going to work on any of those plaforms. b) coherency On many architectures DMA is not cache coherent, and we need to invalidate and/or write back cache lines before doing DMA. Again, I wonder how this is every going to work with hardware based virtio implementations. Even worse I think this is actually broken at least for VIVT event for virtualized implementations. E.g. a KVM guest is going to access memory using different virtual addresses than qemu, vhost might throw in another different address space. c) bounce buffering Many DMA implementations can not address all physical memory due to addressing limitations. In such cases we copy the DMA memory into a known addressable bounc buffer and DMA from there. d) flushing write combining buffers or similar On some hardware platforms we need workarounds to e.g. read from a certain mmio address to make sure DMA can actually see memory written by the host. All of this is bypassed by virtio by default despite generally being platform issues, not particular to a given device.
Re: [PATCH 08/11] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver
On Wed, 6 Jun 2018, Geert Uytterhoeven wrote: > So 4.17 mac_defconfig won't work on PMU_68K_V1 machines? Perhaps that > should be fixed first. > AFAICT there's nothing to fix. $ grep PMU arch/m68k/configs/mac_defconfig CONFIG_ADB_PMU68K=y This works fine for PMU_68K_V2 machines. But if you try to boot a PMU_68K_V1 machine, you will have trouble starting the kernel. Sometimes you can boot by fiddling with ADB devices, but not always. This is not a new problem. The via-pmu68k driver has never worked on PMU_68K_V1 machines. So as far as PMU_68K_V1 is concerned, there's no need to bisect any of the changes under review. To answer your question, "Shouldn't that be a separate patch?", all I can say is "not to my knowledge". But I'm probably missing something. --
[v2, 10/10] dpaa_eth: add the get_ts_info interface for ethtool
Added the get_ts_info interface for ethtool to check the timestamping capability. Signed-off-by: Yangbo Lu --- Changes for v2: - Removed ifdef for hw timestamp. --- drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c | 39 1 files changed, 39 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c index 2f933b6..3184c8f 100644 --- a/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c +++ b/drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c @@ -32,6 +32,9 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt #include +#include +#include +#include #include "dpaa_eth.h" #include "mac.h" @@ -515,6 +518,41 @@ static int dpaa_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd) return ret; } +static int dpaa_get_ts_info(struct net_device *net_dev, + struct ethtool_ts_info *info) +{ + struct device *dev = net_dev->dev.parent; + struct device_node *mac_node = dev->of_node; + struct device_node *fman_node = NULL, *ptp_node = NULL; + struct platform_device *ptp_dev = NULL; + struct qoriq_ptp *ptp = NULL; + + info->phc_index = -1; + + fman_node = of_get_parent(mac_node); + if (fman_node) + ptp_node = of_parse_phandle(fman_node, "ptimer-handle", 0); + + if (ptp_node) + ptp_dev = of_find_device_by_node(ptp_node); + + if (ptp_dev) + ptp = platform_get_drvdata(ptp_dev); + + if (ptp) + info->phc_index = ptp->phc_index; + + info->so_timestamping = SOF_TIMESTAMPING_TX_HARDWARE | + SOF_TIMESTAMPING_RX_HARDWARE | + SOF_TIMESTAMPING_RAW_HARDWARE; + info->tx_types = (1 << HWTSTAMP_TX_OFF) | +(1 << HWTSTAMP_TX_ON); + info->rx_filters = (1 << HWTSTAMP_FILTER_NONE) | + (1 << HWTSTAMP_FILTER_ALL); + + return 0; +} + const struct ethtool_ops dpaa_ethtool_ops = { .get_drvinfo = dpaa_get_drvinfo, .get_msglevel = dpaa_get_msglevel, @@ -530,4 +568,5 @@ static int dpaa_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd) .set_link_ksettings = dpaa_set_link_ksettings, .get_rxnfc = dpaa_get_rxnfc, .set_rxnfc = dpaa_set_rxnfc, + .get_ts_info = dpaa_get_ts_info, }; -- 1.7.1
[v2, 09/10] dpaa_eth: add support for hardware timestamping
This patch is to add hardware timestamping support for dpaa_eth. On Rx, timestamping is enabled for all frames. On Tx, we only instruct the hardware to timestamp the frames marked accordingly by the stack. Signed-off-by: Yangbo Lu --- Changes for v2: - Removed ifdef for timestamp code. - Minor fixes for code style. --- drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 101 ++- drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |3 + 2 files changed, 99 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c index fd43f98..bd589ac 100644 --- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c +++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c @@ -1168,7 +1168,7 @@ static int dpaa_eth_init_tx_port(struct fman_port *port, struct dpaa_fq *errq, buf_prefix_content.priv_data_size = buf_layout->priv_data_size; buf_prefix_content.pass_prs_result = true; buf_prefix_content.pass_hash_result = true; - buf_prefix_content.pass_time_stamp = false; + buf_prefix_content.pass_time_stamp = true; buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT; params.specific_params.non_rx_params.err_fqid = errq->fqid; @@ -1210,7 +1210,7 @@ static int dpaa_eth_init_rx_port(struct fman_port *port, struct dpaa_bp **bps, buf_prefix_content.priv_data_size = buf_layout->priv_data_size; buf_prefix_content.pass_prs_result = true; buf_prefix_content.pass_hash_result = true; - buf_prefix_content.pass_time_stamp = false; + buf_prefix_content.pass_time_stamp = true; buf_prefix_content.data_align = DPAA_FD_DATA_ALIGNMENT; rx_p = _params.rx_params; @@ -1592,6 +1592,16 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv *priv) return 0; } +static int dpaa_get_tstamp_ns(struct net_device *net_dev, u64 *ns, + struct fman_port *port, const void *data) +{ + if (!fman_port_get_tstamp_field(port, data, ns)) { + be64_to_cpus(ns); + return 0; + } + return -EINVAL; +} + /* Cleanup function for outgoing frame descriptors that were built on Tx path, * either contiguous frames or scatter/gather ones. * Skb freeing is not handled here. @@ -1607,14 +1617,29 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv *priv) { const enum dma_data_direction dma_dir = DMA_TO_DEVICE; struct device *dev = priv->net_dev->dev.parent; + struct skb_shared_hwtstamps shhwtstamps; dma_addr_t addr = qm_fd_addr(fd); const struct qm_sg_entry *sgt; struct sk_buff **skbh, *skb; int nr_frags, i; + u64 ns; skbh = (struct sk_buff **)phys_to_virt(addr); skb = *skbh; + if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) { + memset(, 0, sizeof(shhwtstamps)); + + if (!dpaa_get_tstamp_ns(priv->net_dev, , + priv->mac_dev->port[TX], + (void *)skbh)) { + shhwtstamps.hwtstamp = ns_to_ktime(ns); + skb_tstamp_tx(skb, ); + } else { + dev_warn(dev, "dpaa_get_tstamp_ns failed!\n"); + } + } + if (unlikely(qm_fd_get_format(fd) == qm_fd_sg)) { nr_frags = skb_shinfo(skb)->nr_frags; dma_unmap_single(dev, addr, qm_fd_get_offset(fd) + @@ -2086,6 +2111,11 @@ static int dpaa_start_xmit(struct sk_buff *skb, struct net_device *net_dev) if (unlikely(err < 0)) goto skb_to_fd_failed; + if (priv->tx_tstamp && skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) { + fd.cmd |= FM_FD_CMD_UPD; + skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; + } + if (likely(dpaa_xmit(priv, percpu_stats, queue_mapping, ) == 0)) return NETDEV_TX_OK; @@ -2227,6 +2257,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal, struct qman_fq *fq, const struct qm_dqrr_entry *dq) { + struct skb_shared_hwtstamps *shhwtstamps; struct rtnl_link_stats64 *percpu_stats; struct dpaa_percpu_priv *percpu_priv; const struct qm_fd *fd = >fd; @@ -2240,6 +2271,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal, struct sk_buff *skb; int *count_ptr; void *vaddr; + u64 ns; fd_status = be32_to_cpu(fd->status); fd_format = qm_fd_get_format(fd); @@ -2304,6 +2336,18 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal, if (!skb) return qman_cb_dqrr_consume; + if (priv->rx_tstamp) { + shhwtstamps = skb_hwtstamps(skb); +
[v2, 08/10] fsl/fman: define frame description command UPD
Defined frame description command FM_FD_CMD_UPD for prepended data updating. Signed-off-by: Yangbo Lu --- Changes for v2: - None. --- drivers/net/ethernet/freescale/fman/fman.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman.h b/drivers/net/ethernet/freescale/fman/fman.h index bfa02e0..935c317 100644 --- a/drivers/net/ethernet/freescale/fman/fman.h +++ b/drivers/net/ethernet/freescale/fman/fman.h @@ -41,6 +41,7 @@ /* Frame queue Context Override */ #define FM_FD_CMD_FCO 0x8000 #define FM_FD_CMD_RPD 0x4000 /* Read Prepended Data */ +#define FM_FD_CMD_UPD 0x2000 /* Update Prepended Data */ #define FM_FD_CMD_DTC 0x1000 /* Do L4 Checksum */ /* TX-Port: Unsupported Format */ -- 1.7.1
[v2, 07/10] fsl/fman_port: support getting timestamp field
This patch is to add fman_port_get_tstamp_field() interface to get timestamp field data. Signed-off-by: Yangbo Lu --- Changes for v2: - None. --- drivers/net/ethernet/freescale/fman/fman_port.c | 12 drivers/net/ethernet/freescale/fman/fman_port.h |3 +++ 2 files changed, 15 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman_port.c b/drivers/net/ethernet/freescale/fman/fman_port.c index ce6e24c..86f0094 100644 --- a/drivers/net/ethernet/freescale/fman/fman_port.c +++ b/drivers/net/ethernet/freescale/fman/fman_port.c @@ -1731,6 +1731,18 @@ int fman_port_get_hash_result_offset(struct fman_port *port, u32 *offset) } EXPORT_SYMBOL(fman_port_get_hash_result_offset); +int fman_port_get_tstamp_field(struct fman_port *port, const void *data, + u64 *tstamp) +{ + if (port->buffer_offsets.time_stamp_offset == ILLEGAL_BASE) + return -EINVAL; + + *tstamp = *(u64 *)(data + port->buffer_offsets.time_stamp_offset); + + return 0; +} +EXPORT_SYMBOL(fman_port_get_tstamp_field); + static int fman_port_probe(struct platform_device *of_dev) { struct fman_port *port; diff --git a/drivers/net/ethernet/freescale/fman/fman_port.h b/drivers/net/ethernet/freescale/fman/fman_port.h index e86ca6a..d10e48d 100644 --- a/drivers/net/ethernet/freescale/fman/fman_port.h +++ b/drivers/net/ethernet/freescale/fman/fman_port.h @@ -153,6 +153,9 @@ int fman_port_cfg_buf_prefix_content(struct fman_port *port, int fman_port_get_hash_result_offset(struct fman_port *port, u32 *offset); +int fman_port_get_tstamp_field(struct fman_port *port, const void *data, + u64 *tstamp); + struct fman_port *fman_port_bind(struct device *dev); #endif /* __FMAN_PORT_H */ -- 1.7.1
[v2, 06/10] fsl/fman: add set_tstamp interface
This patch is to add set_tstamp interface for memac, dtsec, and 10GEC controllers to configure HW timestamping. Signed-off-by: Yangbo Lu --- Changes for v2: - None. --- drivers/net/ethernet/freescale/fman/fman_dtsec.c | 27 ++ drivers/net/ethernet/freescale/fman/fman_dtsec.h |1 + drivers/net/ethernet/freescale/fman/fman_memac.c |5 drivers/net/ethernet/freescale/fman/fman_memac.h |1 + drivers/net/ethernet/freescale/fman/fman_tgec.c | 21 + drivers/net/ethernet/freescale/fman/fman_tgec.h |1 + drivers/net/ethernet/freescale/fman/mac.c|3 ++ drivers/net/ethernet/freescale/fman/mac.h|1 + 8 files changed, 60 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.c b/drivers/net/ethernet/freescale/fman/fman_dtsec.c index 57b1e2b..1ca543a 100644 --- a/drivers/net/ethernet/freescale/fman/fman_dtsec.c +++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.c @@ -123,11 +123,13 @@ #define DTSEC_ECNTRL_R100M 0x0008 #define DTSEC_ECNTRL_QSGMIIM 0x0001 +#define TCTRL_TTSE 0x0040 #define TCTRL_GTS 0x0020 #define RCTRL_PAL_MASK 0x001f #define RCTRL_PAL_SHIFT16 #define RCTRL_GHTX 0x0400 +#define RCTRL_RTSE 0x0040 #define RCTRL_GRS 0x0020 #define RCTRL_MPROM0x0008 #define RCTRL_RSF 0x0004 @@ -1136,6 +1138,31 @@ int dtsec_set_allmulti(struct fman_mac *dtsec, bool enable) return 0; } +int dtsec_set_tstamp(struct fman_mac *dtsec, bool enable) +{ + struct dtsec_regs __iomem *regs = dtsec->regs; + u32 rctrl, tctrl; + + if (!is_init_done(dtsec->dtsec_drv_param)) + return -EINVAL; + + rctrl = ioread32be(>rctrl); + tctrl = ioread32be(>tctrl); + + if (enable) { + rctrl |= RCTRL_RTSE; + tctrl |= TCTRL_TTSE; + } else { + rctrl &= ~RCTRL_RTSE; + tctrl &= ~TCTRL_TTSE; + } + + iowrite32be(rctrl, >rctrl); + iowrite32be(tctrl, >tctrl); + + return 0; +} + int dtsec_del_hash_mac_address(struct fman_mac *dtsec, enet_addr_t *eth_addr) { struct dtsec_regs __iomem *regs = dtsec->regs; diff --git a/drivers/net/ethernet/freescale/fman/fman_dtsec.h b/drivers/net/ethernet/freescale/fman/fman_dtsec.h index 1a689ad..5149d96 100644 --- a/drivers/net/ethernet/freescale/fman/fman_dtsec.h +++ b/drivers/net/ethernet/freescale/fman/fman_dtsec.h @@ -56,5 +56,6 @@ int dtsec_set_exception(struct fman_mac *dtsec, int dtsec_del_hash_mac_address(struct fman_mac *dtsec, enet_addr_t *eth_addr); int dtsec_get_version(struct fman_mac *dtsec, u32 *mac_version); int dtsec_set_allmulti(struct fman_mac *dtsec, bool enable); +int dtsec_set_tstamp(struct fman_mac *dtsec, bool enable); #endif /* __DTSEC_H */ diff --git a/drivers/net/ethernet/freescale/fman/fman_memac.c b/drivers/net/ethernet/freescale/fman/fman_memac.c index 446a97b..bc6eb30 100644 --- a/drivers/net/ethernet/freescale/fman/fman_memac.c +++ b/drivers/net/ethernet/freescale/fman/fman_memac.c @@ -964,6 +964,11 @@ int memac_set_allmulti(struct fman_mac *memac, bool enable) return 0; } +int memac_set_tstamp(struct fman_mac *memac, bool enable) +{ + return 0; /* Always enabled. */ +} + int memac_del_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr) { struct memac_regs __iomem *regs = memac->regs; diff --git a/drivers/net/ethernet/freescale/fman/fman_memac.h b/drivers/net/ethernet/freescale/fman/fman_memac.h index b5a5033..b2c671e 100644 --- a/drivers/net/ethernet/freescale/fman/fman_memac.h +++ b/drivers/net/ethernet/freescale/fman/fman_memac.h @@ -58,5 +58,6 @@ int memac_set_exception(struct fman_mac *memac, int memac_add_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr); int memac_del_hash_mac_address(struct fman_mac *memac, enet_addr_t *eth_addr); int memac_set_allmulti(struct fman_mac *memac, bool enable); +int memac_set_tstamp(struct fman_mac *memac, bool enable); #endif /* __MEMAC_H */ diff --git a/drivers/net/ethernet/freescale/fman/fman_tgec.c b/drivers/net/ethernet/freescale/fman/fman_tgec.c index 284735d..4070593 100644 --- a/drivers/net/ethernet/freescale/fman/fman_tgec.c +++ b/drivers/net/ethernet/freescale/fman/fman_tgec.c @@ -44,6 +44,7 @@ #define TGEC_TX_IPG_LENGTH_MASK0x03ff /* Command and Configuration Register (COMMAND_CONFIG) */ +#define CMD_CFG_EN_TIMESTAMP 0x0010 #define CMD_CFG_NO_LEN_CHK 0x0002 #define CMD_CFG_PAUSE_IGNORE 0x0100 #define CMF_CFG_CRC_FWD0x0040 @@ -588,6 +589,26 @@ int tgec_set_allmulti(struct fman_mac *tgec, bool enable) return 0; } +int tgec_set_tstamp(struct
[v2, 05/10] arm64: dts: fsl: move ptp timer out of fman
This patch is to move ptp timer node out of fman. Because ptp timer will be probed by ptp_qoriq driver, it should be an independent device in case of conflict memory mapping. Signed-off-by: Yangbo Lu --- Changes for v2: - Fixed address-cells for ptp-timer. --- arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi index 4dd0676..a56a408 100644 --- a/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi +++ b/arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi @@ -11,13 +11,14 @@ fman0: fman@1a0 { #size-cells = <1>; cell-index = <0>; compatible = "fsl,fman"; - ranges = <0x0 0x0 0x1a0 0x10>; - reg = <0x0 0x1a0 0x0 0x10>; + ranges = <0x0 0x0 0x1a0 0xfe000>; + reg = <0x0 0x1a0 0x0 0xfe000>; interrupts = , ; clocks = < 3 0>; clock-names = "fmanclk"; fsl,qman-channel-range = <0x800 0x10>; + ptimer-handle = <_timer0>; muram@0 { compatible = "fsl,fman-muram"; @@ -73,9 +74,10 @@ fman0: fman@1a0 { compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio"; reg = <0xfd000 0x1000>; }; +}; - ptp_timer0: ptp-timer@fe000 { - compatible = "fsl,fman-ptp-timer"; - reg = <0xfe000 0x1000>; - }; +ptp_timer0: ptp-timer@1afe000 { + compatible = "fsl,fman-ptp-timer"; + reg = <0x0 0x1afe000 0x0 0x1000>; + interrupts = ; }; -- 1.7.1
[v2, 04/10] powerpc/mpc85xx: move ptp timer out of fman in dts
This patch is to move ptp timer node out of fman. Because ptp timer will be probed by ptp_qoriq driver, it should be an independent device in case of conflict memory mapping. Signed-off-by: Yangbo Lu --- Changes for v2: - None. --- arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi | 14 -- arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi | 14 -- arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi | 14 -- arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi | 14 -- arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi | 14 -- 5 files changed, 40 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi index abd01d4..6b124f7 100644 --- a/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi +++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi @@ -37,12 +37,13 @@ fman0: fman@40 { #size-cells = <1>; cell-index = <0>; compatible = "fsl,fman"; - ranges = <0 0x40 0x10>; - reg = <0x40 0x10>; + ranges = <0 0x40 0xfe000>; + reg = <0x40 0xfe000>; interrupts = <96 2 0 0>, <16 2 1 1>; clocks = < 3 0>; clock-names = "fmanclk"; fsl,qman-channel-range = <0x40 0xc>; + ptimer-handle = <_timer0>; muram@0 { compatible = "fsl,fman-muram"; @@ -93,9 +94,10 @@ fman0: fman@40 { reg = <0x87000 0x1000>; status = "disabled"; }; +}; - ptp_timer0: ptp-timer@fe000 { - compatible = "fsl,fman-ptp-timer"; - reg = <0xfe000 0x1000>; - }; +ptp_timer0: ptp-timer@4fe000 { + compatible = "fsl,fman-ptp-timer"; + reg = <0x4fe000 0x1000>; + interrupts = <96 2 0 0>; }; diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi index debea75..b80aaf5 100644 --- a/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi +++ b/arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi @@ -37,12 +37,13 @@ fman1: fman@50 { #size-cells = <1>; cell-index = <1>; compatible = "fsl,fman"; - ranges = <0 0x50 0x10>; - reg = <0x50 0x10>; + ranges = <0 0x50 0xfe000>; + reg = <0x50 0xfe000>; interrupts = <97 2 0 0>, <16 2 1 0>; clocks = < 3 1>; clock-names = "fmanclk"; fsl,qman-channel-range = <0x60 0xc>; + ptimer-handle = <_timer1>; muram@0 { compatible = "fsl,fman-muram"; @@ -93,9 +94,10 @@ fman1: fman@50 { reg = <0x87000 0x1000>; status = "disabled"; }; +}; - ptp_timer1: ptp-timer@fe000 { - compatible = "fsl,fman-ptp-timer"; - reg = <0xfe000 0x1000>; - }; +ptp_timer1: ptp-timer@5fe000 { + compatible = "fsl,fman-ptp-timer"; + reg = <0x5fe000 0x1000>; + interrupts = <97 2 0 0>; }; diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi index 3a20e0d..d3720fd 100644 --- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi +++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi @@ -37,12 +37,13 @@ fman0: fman@40 { #size-cells = <1>; cell-index = <0>; compatible = "fsl,fman"; - ranges = <0 0x40 0x10>; - reg = <0x40 0x10>; + ranges = <0 0x40 0xfe000>; + reg = <0x40 0xfe000>; interrupts = <96 2 0 0>, <16 2 1 1>; clocks = < 3 0>; clock-names = "fmanclk"; fsl,qman-channel-range = <0x800 0x10>; + ptimer-handle = <_timer0>; muram@0 { compatible = "fsl,fman-muram"; @@ -98,9 +99,10 @@ fman0: fman@40 { compatible = "fsl,fman-memac-mdio", "fsl,fman-xmdio"; reg = <0xfd000 0x1000>; }; +}; - ptp_timer0: ptp-timer@fe000 { - compatible = "fsl,fman-ptp-timer"; - reg = <0xfe000 0x1000>; - }; +ptp_timer0: ptp-timer@4fe000 { + compatible = "fsl,fman-ptp-timer"; + reg = <0x4fe000 0x1000>; + interrupts = <96 2 0 0>; }; diff --git a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi index 82750ac..ae34c20 100644 --- a/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi +++ b/arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi @@ -37,12 +37,13 @@ fman1: fman@50 { #size-cells = <1>; cell-index = <1>; compatible = "fsl,fman"; - ranges = <0 0x50 0x10>; - reg = <0x50 0x10>; + ranges = <0 0x50 0xfe000>; + reg = <0x50 0xfe000>; interrupts = <97 2 0 0>, <16 2 1 0>; clocks = < 3 1>; clock-names = "fmanclk"; fsl,qman-channel-range = <0x820 0x10>; + ptimer-handle = <_timer1>; muram@0 { compatible = "fsl,fman-muram"; @@ -98,9 +99,10 @@
[v2, 03/10] dt-binding: ptp_qoriq: add DPAA FMan support
This patch is to add bindings description for DPAA FMan 1588 timer, and also remove its description in fsl-fman dt-bindings document. Signed-off-by: Yangbo Lu --- Changes for v2: - None. --- Documentation/devicetree/bindings/net/fsl-fman.txt | 25 +--- .../devicetree/bindings/ptp/ptp-qoriq.txt | 15 +-- 2 files changed, 13 insertions(+), 27 deletions(-) diff --git a/Documentation/devicetree/bindings/net/fsl-fman.txt b/Documentation/devicetree/bindings/net/fsl-fman.txt index df873d1..74603dd 100644 --- a/Documentation/devicetree/bindings/net/fsl-fman.txt +++ b/Documentation/devicetree/bindings/net/fsl-fman.txt @@ -356,30 +356,7 @@ ethernet@e { FMan IEEE 1588 Node -DESCRIPTION - -The FMan interface to support IEEE 1588 - - -PROPERTIES - -- compatible - Usage: required - Value type: - Definition: A standard property. - Must include "fsl,fman-ptp-timer". - -- reg - Usage: required - Value type: - Definition: A standard property. - -EXAMPLE - -ptp-timer@fe000 { - compatible = "fsl,fman-ptp-timer"; - reg = <0xfe000 0x1000>; -}; +Refer to Documentation/devicetree/bindings/ptp/ptp-qoriq.txt = FMan MDIO Node diff --git a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt index 0f569d8..c5d0e79 100644 --- a/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt +++ b/Documentation/devicetree/bindings/ptp/ptp-qoriq.txt @@ -2,7 +2,8 @@ General Properties: - - compatible Should be "fsl,etsec-ptp" + - compatible Should be "fsl,etsec-ptp" for eTSEC + Should be "fsl,fman-ptp-timer" for DPAA FMan - reg Offset and length of the register set for the device - interrupts There should be at least two interrupts. Some devices have as many as four PTP related interrupts. @@ -43,14 +44,22 @@ Clock Properties: value, which will be directly written in those bits, that is why, according to reference manual, the next clock sources can be used: + For eTSEC, <0> - external high precision timer reference clock (TSEC_TMR_CLK input is used for this purpose); <1> - eTSEC system clock; <2> - eTSEC1 transmit clock; <3> - RTC clock input. - When this attribute is not used, eTSEC system clock will serve as - IEEE 1588 timer reference clock. + For DPAA FMan, + <0> - external high precision timer reference clock (TMR_1588_CLK) + <1> - MAC system clock (1/2 FMan clock) + <2> - reserved + <3> - RTC clock oscillator + + When this attribute is not used, the IEEE 1588 timer reference clock + will use the eTSEC system clock (for Gianfar) or the MAC system + clock (for DPAA). Example: -- 1.7.1
[v2, 02/10] ptp: support DPAA FMan 1588 timer in ptp_qoriq
This patch is to support DPAA (Data Path Acceleration Architecture) 1588 timer by adding "fsl,fman-ptp-timer" compatible, sharing interrupt with FMan, adding FSL_DPAA_ETH dependency, and fixing up register offset. Signed-off-by: Yangbo Lu --- Changes for v2: - None. --- drivers/ptp/Kconfig |2 +- drivers/ptp/ptp_qoriq.c | 104 ++--- include/linux/fsl/ptp_qoriq.h | 38 --- 3 files changed, 98 insertions(+), 46 deletions(-) diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig index 474c988..d137c48 100644 --- a/drivers/ptp/Kconfig +++ b/drivers/ptp/Kconfig @@ -43,7 +43,7 @@ config PTP_1588_CLOCK_DTE config PTP_1588_CLOCK_QORIQ tristate "Freescale QorIQ 1588 timer as PTP clock" - depends on GIANFAR + depends on GIANFAR || FSL_DPAA_ETH depends on PTP_1588_CLOCK default y help diff --git a/drivers/ptp/ptp_qoriq.c b/drivers/ptp/ptp_qoriq.c index 1468a16..c4e3545 100644 --- a/drivers/ptp/ptp_qoriq.c +++ b/drivers/ptp/ptp_qoriq.c @@ -39,11 +39,12 @@ /* Caller must hold qoriq_ptp->lock. */ static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp) { + struct qoriq_ptp_registers *regs = _ptp->regs; u64 ns; u32 lo, hi; - lo = qoriq_read(_ptp->regs->tmr_cnt_l); - hi = qoriq_read(_ptp->regs->tmr_cnt_h); + lo = qoriq_read(>ctrl_regs->tmr_cnt_l); + hi = qoriq_read(>ctrl_regs->tmr_cnt_h); ns = ((u64) hi) << 32; ns |= lo; return ns; @@ -52,16 +53,18 @@ static u64 tmr_cnt_read(struct qoriq_ptp *qoriq_ptp) /* Caller must hold qoriq_ptp->lock. */ static void tmr_cnt_write(struct qoriq_ptp *qoriq_ptp, u64 ns) { + struct qoriq_ptp_registers *regs = _ptp->regs; u32 hi = ns >> 32; u32 lo = ns & 0x; - qoriq_write(_ptp->regs->tmr_cnt_l, lo); - qoriq_write(_ptp->regs->tmr_cnt_h, hi); + qoriq_write(>ctrl_regs->tmr_cnt_l, lo); + qoriq_write(>ctrl_regs->tmr_cnt_h, hi); } /* Caller must hold qoriq_ptp->lock. */ static void set_alarm(struct qoriq_ptp *qoriq_ptp) { + struct qoriq_ptp_registers *regs = _ptp->regs; u64 ns; u32 lo, hi; @@ -70,16 +73,18 @@ static void set_alarm(struct qoriq_ptp *qoriq_ptp) ns -= qoriq_ptp->tclk_period; hi = ns >> 32; lo = ns & 0x; - qoriq_write(_ptp->regs->tmr_alarm1_l, lo); - qoriq_write(_ptp->regs->tmr_alarm1_h, hi); + qoriq_write(>alarm_regs->tmr_alarm1_l, lo); + qoriq_write(>alarm_regs->tmr_alarm1_h, hi); } /* Caller must hold qoriq_ptp->lock. */ static void set_fipers(struct qoriq_ptp *qoriq_ptp) { + struct qoriq_ptp_registers *regs = _ptp->regs; + set_alarm(qoriq_ptp); - qoriq_write(_ptp->regs->tmr_fiper1, qoriq_ptp->tmr_fiper1); - qoriq_write(_ptp->regs->tmr_fiper2, qoriq_ptp->tmr_fiper2); + qoriq_write(>fiper_regs->tmr_fiper1, qoriq_ptp->tmr_fiper1); + qoriq_write(>fiper_regs->tmr_fiper2, qoriq_ptp->tmr_fiper2); } /* @@ -89,16 +94,17 @@ static void set_fipers(struct qoriq_ptp *qoriq_ptp) static irqreturn_t isr(int irq, void *priv) { struct qoriq_ptp *qoriq_ptp = priv; + struct qoriq_ptp_registers *regs = _ptp->regs; struct ptp_clock_event event; u64 ns; u32 ack = 0, lo, hi, mask, val; - val = qoriq_read(_ptp->regs->tmr_tevent); + val = qoriq_read(>ctrl_regs->tmr_tevent); if (val & ETS1) { ack |= ETS1; - hi = qoriq_read(_ptp->regs->tmr_etts1_h); - lo = qoriq_read(_ptp->regs->tmr_etts1_l); + hi = qoriq_read(>etts_regs->tmr_etts1_h); + lo = qoriq_read(>etts_regs->tmr_etts1_l); event.type = PTP_CLOCK_EXTTS; event.index = 0; event.timestamp = ((u64) hi) << 32; @@ -108,8 +114,8 @@ static irqreturn_t isr(int irq, void *priv) if (val & ETS2) { ack |= ETS2; - hi = qoriq_read(_ptp->regs->tmr_etts2_h); - lo = qoriq_read(_ptp->regs->tmr_etts2_l); + hi = qoriq_read(>etts_regs->tmr_etts2_h); + lo = qoriq_read(>etts_regs->tmr_etts2_l); event.type = PTP_CLOCK_EXTTS; event.index = 1; event.timestamp = ((u64) hi) << 32; @@ -130,16 +136,16 @@ static irqreturn_t isr(int irq, void *priv) hi = ns >> 32; lo = ns & 0x; spin_lock(_ptp->lock); - qoriq_write(_ptp->regs->tmr_alarm2_l, lo); - qoriq_write(_ptp->regs->tmr_alarm2_h, hi); + qoriq_write(>alarm_regs->tmr_alarm2_l, lo); + qoriq_write(>alarm_regs->tmr_alarm2_h, hi); spin_unlock(_ptp->lock); qoriq_ptp->alarm_value = ns; } else { -
[v2, 01/10] fsl/fman: share the event interrupt
This patch is to share fman event interrupt because the 1588 timer driver will also use this interrupt. Signed-off-by: Yangbo Lu --- Changes for v2: - None. --- drivers/net/ethernet/freescale/fman/fman.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/net/ethernet/freescale/fman/fman.c b/drivers/net/ethernet/freescale/fman/fman.c index 9530405..c415ac6 100644 --- a/drivers/net/ethernet/freescale/fman/fman.c +++ b/drivers/net/ethernet/freescale/fman/fman.c @@ -2801,7 +2801,8 @@ static irqreturn_t fman_irq(int irq, void *handle) of_node_put(muram_node); of_node_put(fm_node); - err = devm_request_irq(_dev->dev, irq, fman_irq, 0, "fman", fman); + err = devm_request_irq(_dev->dev, irq, fman_irq, IRQF_SHARED, + "fman", fman); if (err < 0) { dev_err(_dev->dev, "%s: irq %d allocation failed (error = %d)\n", __func__, irq, err); -- 1.7.1
[v2, 00/10] Support DPAA PTP clock and timestamping
This patchset is to support DPAA FMAN PTP clock and HW timestamping. - The patch #1 to patch #5 are to support DPAA FMAN 1588 timer in ptp_qoriq driver. - The patch #6 to patch #10 are to add HW timestamping support in DPAA ethernet driver. Yangbo Lu (10): fsl/fman: share the event interrupt ptp: support DPAA FMan 1588 timer in ptp_qoriq dt-binding: ptp_qoriq: add DPAA FMan support powerpc/mpc85xx: move ptp timer out of fman in dts arm64: dts: fsl: move ptp timer out of fman fsl/fman: add set_tstamp interface fsl/fman_port: support getting timestamp field fsl/fman: define frame description command UPD dpaa_eth: add support for hardware timestamping dpaa_eth: add the get_ts_info interface for ethtool Documentation/devicetree/bindings/net/fsl-fman.txt | 25 +- .../devicetree/bindings/ptp/ptp-qoriq.txt | 15 +++- arch/arm64/boot/dts/freescale/qoriq-fman3-0.dtsi | 14 ++- arch/powerpc/boot/dts/fsl/qoriq-fman-0.dtsi| 14 ++- arch/powerpc/boot/dts/fsl/qoriq-fman-1.dtsi| 14 ++- arch/powerpc/boot/dts/fsl/qoriq-fman3-0.dtsi | 14 ++- arch/powerpc/boot/dts/fsl/qoriq-fman3-1.dtsi | 14 ++- arch/powerpc/boot/dts/fsl/qoriq-fman3l-0.dtsi | 14 ++- drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 101 ++- drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |3 + drivers/net/ethernet/freescale/dpaa/dpaa_ethtool.c | 39 drivers/net/ethernet/freescale/fman/fman.c |3 +- drivers/net/ethernet/freescale/fman/fman.h |1 + drivers/net/ethernet/freescale/fman/fman_dtsec.c | 27 + drivers/net/ethernet/freescale/fman/fman_dtsec.h |1 + drivers/net/ethernet/freescale/fman/fman_memac.c |5 + drivers/net/ethernet/freescale/fman/fman_memac.h |1 + drivers/net/ethernet/freescale/fman/fman_port.c| 12 +++ drivers/net/ethernet/freescale/fman/fman_port.h|3 + drivers/net/ethernet/freescale/fman/fman_tgec.c| 21 drivers/net/ethernet/freescale/fman/fman_tgec.h|1 + drivers/net/ethernet/freescale/fman/mac.c |3 + drivers/net/ethernet/freescale/fman/mac.h |1 + drivers/ptp/Kconfig|2 +- drivers/ptp/ptp_qoriq.c| 104 --- include/linux/fsl/ptp_qoriq.h | 38 ++-- 26 files changed, 375 insertions(+), 115 deletions(-)
[PATCH v8 5/5] powerpc:selftest update memcmp_64 selftest for VMX implementation
From: Simon Guo This patch reworked selftest memcmp_64 so that memcmp selftest can cover more test cases. It adds testcases for: - memcmp over 4K bytes size. - s1/s2 with different/random offset on 16 bytes boundary. - enter/exit_vmx_ops pairness. Signed-off-by: Simon Guo --- .../selftests/powerpc/copyloops/asm/ppc_asm.h | 4 +- .../selftests/powerpc/stringloops/asm/ppc-opcode.h | 39 + .../selftests/powerpc/stringloops/asm/ppc_asm.h| 25 ++ .../testing/selftests/powerpc/stringloops/memcmp.c | 98 +- 4 files changed, 142 insertions(+), 24 deletions(-) create mode 100644 tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h diff --git a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h index 5ffe04d..dfce161 100644 --- a/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h +++ b/tools/testing/selftests/powerpc/copyloops/asm/ppc_asm.h @@ -36,11 +36,11 @@ li r3,0 blr -FUNC_START(enter_vmx_copy) +FUNC_START(enter_vmx_ops) li r3,1 blr -FUNC_START(exit_vmx_copy) +FUNC_START(exit_vmx_ops) blr FUNC_START(memcpy_power7) diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h new file mode 100644 index 000..9de413c --- /dev/null +++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h @@ -0,0 +1,39 @@ +/* + * Copyright 2009 Freescale Semiconductor, Inc. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * provides masks and opcode images for use by code generation, emulation + * and for instructions that older assemblers might not know about + */ +#ifndef _ASM_POWERPC_PPC_OPCODE_H +#define _ASM_POWERPC_PPC_OPCODE_H + + +# define stringify_in_c(...) __VA_ARGS__ +# define ASM_CONST(x) x + + +#define PPC_INST_VCMPEQUD_RC 0x10c7 +#define PPC_INST_VCMPEQUB_RC 0x1006 + +#define __PPC_RC21 (0x1 << 10) + +/* macros to insert fields into opcodes */ +#define ___PPC_RA(a) (((a) & 0x1f) << 16) +#define ___PPC_RB(b) (((b) & 0x1f) << 11) +#define ___PPC_RS(s) (((s) & 0x1f) << 21) +#define ___PPC_RT(t) ___PPC_RS(t) + +#define VCMPEQUD_RC(vrt, vra, vrb) stringify_in_c(.long PPC_INST_VCMPEQUD_RC | \ + ___PPC_RT(vrt) | ___PPC_RA(vra) | \ + ___PPC_RB(vrb) | __PPC_RC21) + +#define VCMPEQUB_RC(vrt, vra, vrb) stringify_in_c(.long PPC_INST_VCMPEQUB_RC | \ + ___PPC_RT(vrt) | ___PPC_RA(vra) | \ + ___PPC_RB(vrb) | __PPC_RC21) + +#endif /* _ASM_POWERPC_PPC_OPCODE_H */ diff --git a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h index 136242e..d2c0a91 100644 --- a/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h +++ b/tools/testing/selftests/powerpc/stringloops/asm/ppc_asm.h @@ -1,4 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _PPC_ASM_H +#define __PPC_ASM_H #include #ifndef r1 @@ -6,3 +8,26 @@ #endif #define _GLOBAL(A) FUNC_START(test_ ## A) +#define _GLOBAL_TOC(A) FUNC_START(test_ ## A) + +#define CONFIG_ALTIVEC + +#define R14 r14 +#define R15 r15 +#define R16 r16 +#define R17 r17 +#define R18 r18 +#define R19 r19 +#define R20 r20 +#define R21 r21 +#define R22 r22 +#define R29 r29 +#define R30 r30 +#define R31 r31 + +#define STACKFRAMESIZE 256 +#define STK_REG(i) (112 + ((i)-14)*8) + +#define BEGIN_FTR_SECTION +#define END_FTR_SECTION_IFSET(val) +#endif diff --git a/tools/testing/selftests/powerpc/stringloops/memcmp.c b/tools/testing/selftests/powerpc/stringloops/memcmp.c index 8250db2..b5cf717 100644 --- a/tools/testing/selftests/powerpc/stringloops/memcmp.c +++ b/tools/testing/selftests/powerpc/stringloops/memcmp.c @@ -2,20 +2,40 @@ #include #include #include +#include #include "utils.h" #define SIZE 256 #define ITERATIONS 1 +#define LARGE_SIZE (5 * 1024) +#define LARGE_ITERATIONS 1000 +#define LARGE_MAX_OFFSET 32 +#define LARGE_SIZE_START 4096 + +#define MAX_OFFSET_DIFF_S1_S2 48 + +int vmx_count; +int enter_vmx_ops(void) +{ + vmx_count++; + return 1; +} + +void exit_vmx_ops(void) +{ + vmx_count--; +} int test_memcmp(const void *s1, const void *s2, size_t n); /* test all offsets and lengths */ -static void test_one(char *s1, char *s2) +static void test_one(char *s1, char *s2, unsigned long max_offset, + unsigned long size_start, unsigned long max_size) { unsigned long offset, size; - for (offset = 0; offset < SIZE; offset++) { - for (size = 0; size < (SIZE-offset); size++) { + for (offset =
[PATCH v8 4/5] powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp()
From: Simon Guo This patch is based on the previous VMX patch on memcmp(). To optimize ppc64 memcmp() with VMX instruction, we need to think about the VMX penalty brought with: If kernel uses VMX instruction, it needs to save/restore current thread's VMX registers. There are 32 x 128 bits VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store. The major concern regarding the memcmp() performance in kernel is KSM, who will use memcmp() frequently to merge identical pages. So it will make sense to take some measures/enhancement on KSM to see whether any improvement can be done here. Cyril Bur indicates that the memcmp() for KSM has a higher possibility to fail (unmatch) early in previous bytes in following mail. https://patchwork.ozlabs.org/patch/817322/#1773629 And I am taking a follow-up on this with this patch. Per some testing, it shows KSM memcmp() will fail early at previous 32 bytes. More specifically: - 76% cases will fail/unmatch before 16 bytes; - 83% cases will fail/unmatch before 32 bytes; - 84% cases will fail/unmatch before 64 bytes; So 32 bytes looks a better choice than other bytes for pre-checking. The early failure is also true for memcmp() for non-KSM case. With a non-typical call load, it shows ~73% cases fail before first 32 bytes. This patch adds a 32 bytes pre-checking firstly before jumping into VMX operations, to avoid the unnecessary VMX penalty. It is not limited to KSM case. And the testing shows ~20% improvement on memcmp() average execution time with this patch. And note the 32B pre-checking is only performed when the compare size is long enough (>=4K currently) to allow VMX operation. The detail data and analysis is at: https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md Signed-off-by: Simon Guo --- arch/powerpc/lib/memcmp_64.S | 57 +++- 1 file changed, 46 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index be2f792..844d8e7 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -404,8 +404,27 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) #ifdef CONFIG_ALTIVEC .Lsameoffset_vmx_cmp: /* Enter with src/dst addrs has the same offset with 8 bytes -* align boundary +* align boundary. +* +* There is an optimization based on following fact: memcmp() +* prones to fail early at the first 32 bytes. +* Before applying VMX instructions which will lead to 32x128bits +* VMX regs load/restore penalty, we compare the first 32 bytes +* so that we can catch the ~80% fail cases. */ + + li r0,4 + mtctr r0 +.Lsameoffset_prechk_32B_loop: + LD rA,0,r3 + LD rB,0,r4 + cmpld cr0,rA,rB + addir3,r3,8 + addir4,r4,8 + bne cr0,.LcmpAB_lightweight + addir5,r5,-8 + bdnz.Lsameoffset_prechk_32B_loop + ENTER_VMX_OPS beq cr1,.Llong_novmx_cmp @@ -482,16 +501,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) #endif .Ldiffoffset_8bytes_make_align_start: -#ifdef CONFIG_ALTIVEC -BEGIN_FTR_SECTION - /* only do vmx ops when the size equal or greater than 4K bytes */ - cmpdi cr5,r5,VMX_THRESH - bge cr5,.Ldiffoffset_vmx_cmp -END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) - -.Ldiffoffset_novmx_cmp: -#endif - /* now try to align s1 with 8 bytes */ rlwinm r6,r3,3,26,28 beq .Ldiffoffset_align_s1_8bytes @@ -515,6 +524,17 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) .Ldiffoffset_align_s1_8bytes: /* now s1 is aligned with 8 bytes. */ +#ifdef CONFIG_ALTIVEC +BEGIN_FTR_SECTION + /* only do vmx ops when the size equal or greater than 4K bytes */ + cmpdi cr5,r5,VMX_THRESH + bge cr5,.Ldiffoffset_vmx_cmp +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) + +.Ldiffoffset_novmx_cmp: +#endif + + cmpdi cr5,r5,31 ble cr5,.Lcmp_lt32bytes @@ -526,6 +546,21 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S) #ifdef CONFIG_ALTIVEC .Ldiffoffset_vmx_cmp: + /* perform a 32 bytes pre-checking before +* enable VMX operations. +*/ + li r0,4 + mtctr r0 +.Ldiffoffset_prechk_32B_loop: + LD rA,0,r3 + LD rB,0,r4 + cmpld cr0,rA,rB + addir3,r3,8 + addir4,r4,8 + bne cr0,.LcmpAB_lightweight + addir5,r5,-8 + bdnz.Ldiffoffset_prechk_32B_loop + ENTER_VMX_OPS beq cr1,.Ldiffoffset_novmx_cmp -- 1.8.3.1
[PATCH v8 3/5] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
From: Simon Guo This patch add VMX primitives to do memcmp() in case the compare size is equal or greater than 4K bytes. KSM feature can benefit from this. Test result with following test program(replace the "^>" with ""): -- ># cat tools/testing/selftests/powerpc/stringloops/memcmp.c >#include >#include >#include >#include >#include "utils.h" >#define SIZE (1024 * 1024 * 900) >#define ITERATIONS 40 int test_memcmp(const void *s1, const void *s2, size_t n); static int testcase(void) { char *s1; char *s2; unsigned long i; s1 = memalign(128, SIZE); if (!s1) { perror("memalign"); exit(1); } s2 = memalign(128, SIZE); if (!s2) { perror("memalign"); exit(1); } for (i = 0; i < SIZE; i++) { s1[i] = i & 0xff; s2[i] = i & 0xff; } for (i = 0; i < ITERATIONS; i++) { int ret = test_memcmp(s1, s2, SIZE); if (ret) { printf("return %d at[%ld]! should have returned zero\n", ret, i); abort(); } } return 0; } int main(void) { return test_harness(testcase, "memcmp"); } -- Without this patch (but with the first patch "powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()." in the series): 4.726728762 seconds time elapsed ( +- 3.54%) With VMX patch: 4.234335473 seconds time elapsed ( +- 2.63%) There is ~+10% improvement. Testing with unaligned and different offset version (make s1 and s2 shift random offset within 16 bytes) can archieve higher improvement than 10%.. Signed-off-by: Simon Guo --- arch/powerpc/include/asm/asm-prototypes.h | 4 +- arch/powerpc/lib/copypage_power7.S| 4 +- arch/powerpc/lib/memcmp_64.S | 241 +- arch/powerpc/lib/memcpy_power7.S | 6 +- arch/powerpc/lib/vmx-helper.c | 4 +- 5 files changed, 248 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/asm-prototypes.h b/arch/powerpc/include/asm/asm-prototypes.h index d9713ad..31fdcee 100644 --- a/arch/powerpc/include/asm/asm-prototypes.h +++ b/arch/powerpc/include/asm/asm-prototypes.h @@ -49,8 +49,8 @@ void __trace_hcall_exit(long opcode, unsigned long retval, /* VMX copying */ int enter_vmx_usercopy(void); int exit_vmx_usercopy(void); -int enter_vmx_copy(void); -void * exit_vmx_copy(void *dest); +int enter_vmx_ops(void); +void *exit_vmx_ops(void *dest); /* Traps */ long machine_check_early(struct pt_regs *regs); diff --git a/arch/powerpc/lib/copypage_power7.S b/arch/powerpc/lib/copypage_power7.S index 8fa73b7..e38f956 100644 --- a/arch/powerpc/lib/copypage_power7.S +++ b/arch/powerpc/lib/copypage_power7.S @@ -57,7 +57,7 @@ _GLOBAL(copypage_power7) std r4,-STACKFRAMESIZE+STK_REG(R30)(r1) std r0,16(r1) stdur1,-STACKFRAMESIZE(r1) - bl enter_vmx_copy + bl enter_vmx_ops cmpwi r3,0 ld r0,STACKFRAMESIZE+16(r1) ld r3,STK_REG(R31)(r1) @@ -100,7 +100,7 @@ _GLOBAL(copypage_power7) addir3,r3,128 bdnz1b - b exit_vmx_copy /* tail call optimise */ + b exit_vmx_ops/* tail call optimise */ #else li r0,(PAGE_SIZE/128) diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index 5776f91..be2f792 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -9,6 +9,7 @@ */ #include #include +#include #define off8 r6 #define off16 r7 @@ -27,12 +28,73 @@ #define LH lhbrx #define LW lwbrx #define LD ldbrx +#define LVSlvsr +#define VPERM(_VRT,_VRA,_VRB,_VRC) \ + vperm _VRT,_VRB,_VRA,_VRC #else #define LH lhzx #define LW lwzx #define LD ldx +#define LVSlvsl +#define VPERM(_VRT,_VRA,_VRB,_VRC) \ + vperm _VRT,_VRA,_VRB,_VRC #endif +#define VMX_THRESH 4096 +#define ENTER_VMX_OPS \ + mflrr0; \ + std r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \ + std r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \ + std r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \ + std r0,16(r1); \ + stdur1,-STACKFRAMESIZE(r1); \ + bl enter_vmx_ops; \ + cmpwi cr1,r3,0; \ + ld r0,STACKFRAMESIZE+16(r1); \ + ld r3,STK_REG(R31)(r1); \ + ld r4,STK_REG(R30)(r1); \ + ld r5,STK_REG(R29)(r1); \ + addir1,r1,STACKFRAMESIZE; \ + mtlrr0 + +#define EXIT_VMX_OPS \ + mflrr0; \ + std r3,-STACKFRAMESIZE+STK_REG(R31)(r1); \ + std r4,-STACKFRAMESIZE+STK_REG(R30)(r1); \ + std r5,-STACKFRAMESIZE+STK_REG(R29)(r1); \ +
[PATCH v8 2/5] powerpc: add vcmpequd/vcmpequb ppc instruction macro
From: Simon Guo Some old tool chains don't know about instructions like vcmpequd. This patch adds .long macro for vcmpequd and vcmpequb, which is a preparation to optimize ppc64 memcmp with VMX instructions. Signed-off-by: Simon Guo --- arch/powerpc/include/asm/ppc-opcode.h | 11 +++ 1 file changed, 11 insertions(+) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index 18883b8..1866a97 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -366,6 +366,8 @@ #define PPC_INST_STFDX 0x7c0005ae #define PPC_INST_LVX 0x7cce #define PPC_INST_STVX 0x7c0001ce +#define PPC_INST_VCMPEQUD 0x10c7 +#define PPC_INST_VCMPEQUB 0x1006 /* macros to insert fields into opcodes */ #define ___PPC_RA(a) (((a) & 0x1f) << 16) @@ -396,6 +398,7 @@ #define __PPC_BI(s)(((s) & 0x1f) << 16) #define __PPC_CT(t)(((t) & 0x0f) << 21) #define __PPC_SPR(r) r) & 0x1f) << 16) | r) >> 5) & 0x1f) << 11)) +#define __PPC_RC21 (0x1 << 10) /* * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a @@ -567,4 +570,12 @@ ((IH & 0x7) << 21)) #define PPC_INVALIDATE_ERATPPC_SLBIA(7) +#define VCMPEQUD_RC(vrt, vra, vrb) stringify_in_c(.long PPC_INST_VCMPEQUD | \ + ___PPC_RT(vrt) | ___PPC_RA(vra) | \ + ___PPC_RB(vrb) | __PPC_RC21) + +#define VCMPEQUB_RC(vrt, vra, vrb) stringify_in_c(.long PPC_INST_VCMPEQUB | \ + ___PPC_RT(vrt) | ___PPC_RA(vra) | \ + ___PPC_RB(vrb) | __PPC_RC21) + #endif /* _ASM_POWERPC_PPC_OPCODE_H */ -- 1.8.3.1
[PATCH v8 1/5] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()
From: Simon Guo Currently memcmp() 64bytes version in powerpc will fall back to .Lshort (compare per byte mode) if either src or dst address is not 8 bytes aligned. It can be opmitized in 2 situations: 1) if both addresses are with the same offset with 8 bytes boundary: memcmp() can compare the unaligned bytes within 8 bytes boundary firstly and then compare the rest 8-bytes-aligned content with .Llong mode. 2) If src/dst addrs are not with the same offset of 8 bytes boundary: memcmp() can align src addr with 8 bytes, increment dst addr accordingly, then load src with aligned mode and load dst with unaligned mode. This patch optmizes memcmp() behavior in the above 2 situations. Tested with both little/big endian. Performance result below is based on little endian. Following is the test result with src/dst having the same offset case: (a similar result was observed when src/dst having different offset): (1) 256 bytes Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp: - without patch 29.773018302 seconds time elapsed ( +- 0.09% ) - with patch 16.485568173 seconds time elapsed ( +- 0.02% ) -> There is ~+80% percent improvement (2) 32 bytes To observe performance impact on < 32 bytes, modify tools/testing/selftests/powerpc/stringloops/memcmp.c with following: --- #include #include "utils.h" -#define SIZE 256 +#define SIZE 32 #define ITERATIONS 1 int test_memcmp(const void *s1, const void *s2, size_t n); - Without patch 0.244746482 seconds time elapsed ( +- 0.36%) - with patch 0.215069477 seconds time elapsed ( +- 0.51%) -> There is ~+13% improvement (3) 0~8 bytes To observe <8 bytes performance impact, modify tools/testing/selftests/powerpc/stringloops/memcmp.c with following: --- #include #include "utils.h" -#define SIZE 256 -#define ITERATIONS 1 +#define SIZE 8 +#define ITERATIONS 100 int test_memcmp(const void *s1, const void *s2, size_t n); --- - Without patch 1.845642503 seconds time elapsed ( +- 0.12% ) - With patch 1.849767135 seconds time elapsed ( +- 0.26% ) -> They are nearly the same. (-0.2%) Signed-off-by: Simon Guo --- arch/powerpc/lib/memcmp_64.S | 140 --- 1 file changed, 133 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index d75d18b..5776f91 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -24,28 +24,41 @@ #define rH r31 #ifdef __LITTLE_ENDIAN__ +#define LH lhbrx +#define LW lwbrx #define LD ldbrx #else +#define LH lhzx +#define LW lwzx #define LD ldx #endif +/* + * There are 2 categories for memcmp: + * 1) src/dst has the same offset to the 8 bytes boundary. The handlers + * are named like .Lsameoffset_ + * 2) src/dst has different offset to the 8 bytes boundary. The handlers + * are named like .Ldiffoffset_ + */ _GLOBAL(memcmp) cmpdi cr1,r5,0 - /* Use the short loop if both strings are not 8B aligned */ - or r6,r3,r4 + /* Use the short loop if the src/dst addresses are not +* with the same offset of 8 bytes align boundary. +*/ + xor r6,r3,r4 andi. r6,r6,7 - /* Use the short loop if length is less than 32B */ - cmpdi cr6,r5,31 + /* Fall back to short loop if compare at aligned addrs +* with less than 8 bytes. +*/ + cmpdi cr6,r5,7 beq cr1,.Lzero - bne .Lshort - bgt cr6,.Llong + bgt cr6,.Lno_short .Lshort: mtctr r5 - 1: lbz rA,0(r3) lbz rB,0(r4) subf. rC,rB,rA @@ -78,11 +91,89 @@ _GLOBAL(memcmp) li r3,0 blr +.Lno_short: + dcbt0,r3 + dcbt0,r4 + bne .Ldiffoffset_8bytes_make_align_start + + +.Lsameoffset_8bytes_make_align_start: + /* attempt to compare bytes not aligned with 8 bytes so that +* rest comparison can run based on 8 bytes alignment. +*/ + andi. r6,r3,7 + + /* Try to compare the first double word which is not 8 bytes aligned: +* load the first double word at (src & ~7UL) and shift left appropriate +* bits before comparision. +*/ + rlwinm r6,r3,3,26,28 + beq .Lsameoffset_8bytes_aligned + clrrdi r3,r3,3 + clrrdi r4,r4,3 + LD rA,0,r3 + LD rB,0,r4 + sld rA,rA,r6 + sld rB,rB,r6 + cmpld cr0,rA,rB + srwir6,r6,3 + bne cr0,.LcmpAB_lightweight + subfic r6,r6,8 + subf.
[PATCH v8 0/5] powerpc/64: memcmp() optimization
From: Simon Guo There is some room to optimize memcmp() in powerpc 64 bits version for following 2 cases: (1) Even src/dst addresses are not aligned with 8 bytes at the beginning, memcmp() can align them and go with .Llong comparision mode without fallback to .Lshort comparision mode do compare buffer byte by byte. (2) VMX instructions can be used to speed up for large size comparision, currently the threshold is set for 4K bytes. Notes the VMX instructions will lead to VMX regs save/load penalty. This patch set includes a patch to add a 32 bytes pre-checking to minimize the penalty. It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp performance for POWER8). Thanks Cyril Bur's information. This patch set also updates memcmp selftest case to make it compiled and incorporate large size comparison case. v7 -> v8: - define memcmp with _GLOBAL_TOC() instead of _GLOBAL() to fix TOC issue. add _GLOBAL_TOC() definition into selftest so that it can be compiled. - use mfocrf/mtocrf instead of mcrf to save/restore CR0 v6 -> v7: - add vcmpequd/vcmpequdb .long macro - add CPU_FTR pair so that Power7 won't invoke Altivec instrs. - rework some instructions for higher performance or more readable. v5 -> v6: - correct some comments/commit messsage. - rename VMX_OPS_THRES to VMX_THRESH v4 -> v5: - Expand 32 bytes prechk to src/dst different offset case, and remove KSM specific label/comment. v3 -> v4: - Add 32 bytes pre-checking before using VMX instructions. v2 -> v3: - add optimization for src/dst with different offset against 8 bytes boundary. - renamed some label names. - reworked some comments from Cyril Bur, such as fill the pipeline, and use VMX when size == 4K. - fix a bug of enter/exit_vmx_ops pairness issue. And revised test case to test whether enter/exit_vmx_ops are paired. v1 -> v2: - update 8bytes unaligned bytes comparison method. - fix a VMX comparision bug. - enhanced the original memcmp() selftest. - add powerpc/64 to subject/commit message. Simon Guo (5): powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() powerpc: add vcmpequd/vcmpequb ppc instruction macro powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision powerpc/64: add 32 bytes prechecking before using VMX optimization on memcmp() powerpc:selftest update memcmp_64 selftest for VMX implementation arch/powerpc/include/asm/asm-prototypes.h | 4 +- arch/powerpc/include/asm/ppc-opcode.h | 11 + arch/powerpc/lib/copypage_power7.S | 4 +- arch/powerpc/lib/memcmp_64.S | 414 - arch/powerpc/lib/memcpy_power7.S | 6 +- arch/powerpc/lib/vmx-helper.c | 4 +- .../selftests/powerpc/copyloops/asm/ppc_asm.h | 4 +- .../selftests/powerpc/stringloops/asm/ppc-opcode.h | 39 ++ .../selftests/powerpc/stringloops/asm/ppc_asm.h| 25 ++ .../testing/selftests/powerpc/stringloops/memcmp.c | 98 +++-- 10 files changed, 568 insertions(+), 41 deletions(-) create mode 100644 tools/testing/selftests/powerpc/stringloops/asm/ppc-opcode.h -- 1.8.3.1
Re: [PATCH v4 1/2] powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64
On Wed, 6 Jun 2018 14:21:08 + (UTC) Christophe Leroy wrote: > scaled cputime is only meaningfull when the processor has > SPURR and/or PURR, which means only on PPC64. > > Removing it on PPC32 significantly reduces the size of > vtime_account_system() and vtime_account_idle() on an 8xx: > > Before: > l F .text00a8 vtime_delta > 0280 g F .text010c vtime_account_system > 038c g F .text0048 vtime_account_idle > > After: > (vtime_delta gets inlined in the two functions) > 01d8 g F .text00a0 vtime_account_system > 0278 g F .text0038 vtime_account_idle > > In terms of performance, we also get approximatly 5% improvement on task > switch: > The following small benchmark app is run with perf stat: > > void *thread(void *arg) > { > int i; > > for (i = 0; i < atoi((char*)arg); i++) > pthread_yield(); > } > > int main(int argc, char **argv) > { > pthread_t th1, th2; > > pthread_create(, NULL, thread, argv[1]); > pthread_create(, NULL, thread, argv[1]); > pthread_join(th1, NULL); > pthread_join(th2, NULL); > > return 0; > } > > Before the patch: > > ~# perf stat chrt -f 98 ./sched 10 > > Performance counter stats for 'chrt -f 98 ./sched 10': > >8622.166272 task-clock (msec) #0.955 CPUs utilized > 200027 context-switches #0.023 M/sec > > After the patch: > > ~# perf stat chrt -f 98 ./sched 10 > > Performance counter stats for 'chrt -f 98 ./sched 10': > >8207.090048 task-clock (msec) #0.958 CPUs utilized > 200025 context-switches #0.024 M/sec > > Signed-off-by: Christophe Leroy This looks okay to me. Nice numbers. > --- > v4: > - Using the correct symbol CONFIG_ARCH_HAS_SCALED_CPUTIME instead of > ARCH_HAS_SCALED_CPUTIME > - Grouped CONFIG_ARCH_HAS_SCALED_CPUTIME related code in dedicated > functions to reduce the number of #ifdefs > - Integrated read_spurr() directly into the related function. > v3: Rebased following modifications in xmon.c > v2: added ifdefs in xmon to fix compilation error > > arch/powerpc/Kconfig | 2 +- > arch/powerpc/include/asm/accounting.h | 4 ++ > arch/powerpc/include/asm/cputime.h| 1 - > arch/powerpc/kernel/time.c| 111 > +- > arch/powerpc/xmon/xmon.c | 4 ++ > 5 files changed, 77 insertions(+), 45 deletions(-) > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index b62a16e2c7cc..735398fd390d 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -142,7 +142,7 @@ config PPC > select ARCH_HAS_PHYS_TO_DMA > select ARCH_HAS_PMEM_APIif PPC64 > select ARCH_HAS_MEMBARRIER_CALLBACKS > - select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE > + select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE > && PPC64 I wonder if we could make this depend on PPC_PSERIES or even PPC_SPLPAR as well? (That would be for a later patch) Thanks, Nick
Re: powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
On Wed, 2018-06-06 at 01:40:08 UTC, Nicholas Piggin wrote: > There is a typo in f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after > set_pte and ptep_set_access_flags") config ifdef, which results in the > necessary ptesync not being issued after vmalloc. > > This causes random kernel faults in module load, bpf load, anywhere > that vmalloc mappings are used. > > After correcting the code, this survives a guest kernel booting > hundreds of times where previously there would be a crash every few > boots (I haven't noticed the crash on host, perhaps due to different > TLB and page table walking behaviour in hardware). > > A memory clobber is also added to the flush, just to be sure it won't > be reordered with the pte set or the subsequent mapping access. > > Fixes: f1cb8f9beb ("powerpc/64s/radix: avoid ptesync after set_pte and > ptep_set_access_flags") > Signed-off-by: Nicholas Piggin Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/ff5bc793e47b537bf3e904fada585e cheers
Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
On Wed, Jun 06, 2018 at 02:42:27PM +0800, Simon Guo wrote: > I now felt unformatable to use mcrf like: > mcrf7,0 > > since I cannot 100% confident that compiler will not use CR7 or other > CR# in exit_vmx_ops(). It wasn't clear to me this macro boils down to a function call. You can use CR2,CR3,CR4, but you'll need to save and restore those at the start and end of function then, which is just as nasty. Better is to restructure some code so you don't need that CR field there anymore. > Can we switch back to mfocrf/mtocrf with correct CR0 value? >mfocrf r5,128 > ... >mtocrf 128,r5 Sure, I'm not your boss ;-) It seems a shame to me to have this 12 or whatever cycle delay here, since the whole point of the patch is to make things faster, that's all (but it still is faster, right, you tested it). Segher
[PATCH v4 1/2] powerpc/time: Only set CONFIG_ARCH_HAS_SCALED_CPUTIME on PPC64
scaled cputime is only meaningfull when the processor has SPURR and/or PURR, which means only on PPC64. Removing it on PPC32 significantly reduces the size of vtime_account_system() and vtime_account_idle() on an 8xx: Before: l F .text 00a8 vtime_delta 0280 g F .text 010c vtime_account_system 038c g F .text 0048 vtime_account_idle After: (vtime_delta gets inlined in the two functions) 01d8 g F .text 00a0 vtime_account_system 0278 g F .text 0038 vtime_account_idle In terms of performance, we also get approximatly 5% improvement on task switch: The following small benchmark app is run with perf stat: void *thread(void *arg) { int i; for (i = 0; i < atoi((char*)arg); i++) pthread_yield(); } int main(int argc, char **argv) { pthread_t th1, th2; pthread_create(, NULL, thread, argv[1]); pthread_create(, NULL, thread, argv[1]); pthread_join(th1, NULL); pthread_join(th2, NULL); return 0; } Before the patch: ~# perf stat chrt -f 98 ./sched 10 Performance counter stats for 'chrt -f 98 ./sched 10': 8622.166272 task-clock (msec) #0.955 CPUs utilized 200027 context-switches #0.023 M/sec After the patch: ~# perf stat chrt -f 98 ./sched 10 Performance counter stats for 'chrt -f 98 ./sched 10': 8207.090048 task-clock (msec) #0.958 CPUs utilized 200025 context-switches #0.024 M/sec Signed-off-by: Christophe Leroy --- v4: - Using the correct symbol CONFIG_ARCH_HAS_SCALED_CPUTIME instead of ARCH_HAS_SCALED_CPUTIME - Grouped CONFIG_ARCH_HAS_SCALED_CPUTIME related code in dedicated functions to reduce the number of #ifdefs - Integrated read_spurr() directly into the related function. v3: Rebased following modifications in xmon.c v2: added ifdefs in xmon to fix compilation error arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/accounting.h | 4 ++ arch/powerpc/include/asm/cputime.h| 1 - arch/powerpc/kernel/time.c| 111 +- arch/powerpc/xmon/xmon.c | 4 ++ 5 files changed, 77 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index b62a16e2c7cc..735398fd390d 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -142,7 +142,7 @@ config PPC select ARCH_HAS_PHYS_TO_DMA select ARCH_HAS_PMEM_APIif PPC64 select ARCH_HAS_MEMBARRIER_CALLBACKS - select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE + select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC64 select ARCH_HAS_SG_CHAIN select ARCH_HAS_STRICT_KERNEL_RWX if ((PPC_BOOK3S_64 || PPC32) && !RELOCATABLE && !HIBERNATION) select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/include/asm/accounting.h index 3abcf98ed2e0..c607c5d835cc 100644 --- a/arch/powerpc/include/asm/accounting.h +++ b/arch/powerpc/include/asm/accounting.h @@ -15,8 +15,10 @@ struct cpu_accounting_data { /* Accumulated cputime values to flush on ticks*/ unsigned long utime; unsigned long stime; +#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME unsigned long utime_scaled; unsigned long stime_scaled; +#endif unsigned long gtime; unsigned long hardirq_time; unsigned long softirq_time; @@ -25,8 +27,10 @@ struct cpu_accounting_data { /* Internal counters */ unsigned long starttime;/* TB value snapshot */ unsigned long starttime_user; /* TB value on exit to usermode */ +#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME unsigned long startspurr; /* SPURR value snapshot */ unsigned long utime_sspurr; /* ->user_time when ->startspurr set */ +#endif }; #endif diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h index bc4903badb3f..a48c7b5e5cf9 100644 --- a/arch/powerpc/include/asm/cputime.h +++ b/arch/powerpc/include/asm/cputime.h @@ -62,7 +62,6 @@ static inline void arch_vtime_task_switch(struct task_struct *prev) struct cpu_accounting_data *acct0 = get_accounting(prev); acct->starttime = acct0->starttime; - acct->startspurr = acct0->startspurr; } #endif diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 70f145e02487..7a9f4e2f22c8 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -171,19 +171,6 @@ static void calc_cputime_factors(void) __cputime_usec_factor = res.result_low; } -/* - * Read the SPURR on systems that have it, otherwise the PURR, - * or if that doesn't exist return the timebase value passed in. - */ -static unsigned long read_spurr(unsigned long
[PATCH v4 2/2] powerpc/time: no steal_time when CONFIG_PPC_SPLPAR is not selected
If CONFIG_PPC_SPLPAR is not selected, steal_time will always be NUL, so accounting it is pointless Signed-off-by: Christophe Leroy --- v4: removed the check in vtime_account_system(), the compiler removes the code regardless. v3: new arch/powerpc/kernel/time.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 7a9f4e2f22c8..eda78b1ed7d3 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -412,9 +412,6 @@ void vtime_flush(struct task_struct *tsk) if (acct->gtime) account_guest_time(tsk, cputime_to_nsecs(acct->gtime)); - if (acct->steal_time) - account_steal_time(cputime_to_nsecs(acct->steal_time)); - if (acct->idle_time) account_idle_time(cputime_to_nsecs(acct->idle_time)); @@ -431,13 +428,17 @@ void vtime_flush(struct task_struct *tsk) acct->utime = 0; acct->gtime = 0; - acct->steal_time = 0; acct->idle_time = 0; acct->stime = 0; acct->hardirq_time = 0; acct->softirq_time = 0; vtime_flush_scaled(tsk, acct); + + if (IS_ENABLED(CONFIG_PPC_SPLPAR) && acct->steal_time) { + account_steal_time(cputime_to_nsecs(acct->steal_time)); + acct->steal_time = 0; + } } #else /* ! CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */ -- 2.13.3
RE: [PATCH 09/10] dpaa_eth: add support for hardware timestamping
Hi Richard, > -Original Message- > From: Richard Cochran [mailto:richardcoch...@gmail.com] > Sent: Tuesday, June 5, 2018 9:58 PM > To: Y.b. Lu > Cc: net...@vger.kernel.org; Madalin-cristian Bucur > ; Rob Herring ; Shawn Guo > ; David S . Miller ; > devicet...@vger.kernel.org; linuxppc-dev@lists.ozlabs.org; > linux-arm-ker...@lists.infradead.org; linux-ker...@vger.kernel.org > Subject: Re: [PATCH 09/10] dpaa_eth: add support for hardware timestamping > > On Tue, Jun 05, 2018 at 03:35:28AM +, Y.b. Lu wrote: > > [Y.b. Lu] Actually these timestamping codes affected DPAA networking > performance in our previous performance test. > > That's why we used ifdef for it. > > How much does time stamping hurt performance? > > If the time stamping is compiled in but not enabled at run time, does it still > affect performace? [Y.b. Lu] I can't remember and find the old data since it had been a long time. I just did the iperf test today between two 10G ports. I didn’t see any performance changes with timestamping code So, let's me remove the ifdef in next version. Thanks a lot. > > Thanks, > Richard
[PATCH v2] powerpc: Add support for function error injection
We implement regs_set_return_value() and override_function_with_return() for this purpose. On powerpc, a return from a function (blr) just branches to the location contained in the link register. So, we can just update pt_regs rather than redirecting execution to a dummy function that returns. Signed-off-by: Naveen N. Rao --- arch/powerpc/Kconfig | 1 + arch/powerpc/include/asm/error-injection.h | 13 + arch/powerpc/include/asm/ptrace.h | 5 + arch/powerpc/lib/Makefile | 2 ++ arch/powerpc/lib/error-inject.c| 12 5 files changed, 33 insertions(+) create mode 100644 arch/powerpc/include/asm/error-injection.h create mode 100644 arch/powerpc/lib/error-inject.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index f674006dea2f..8e61bc0f25cb 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -186,6 +186,7 @@ config PPC select HAVE_EBPF_JITif PPC64 select HAVE_EFFICIENT_UNALIGNED_ACCESS if !(CPU_LITTLE_ENDIAN && POWER7_CPU) select HAVE_FTRACE_MCOUNT_RECORD + select HAVE_FUNCTION_ERROR_INJECTION select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER select HAVE_GCC_PLUGINS diff --git a/arch/powerpc/include/asm/error-injection.h b/arch/powerpc/include/asm/error-injection.h new file mode 100644 index ..740c3075bdf4 --- /dev/null +++ b/arch/powerpc/include/asm/error-injection.h @@ -0,0 +1,13 @@ +// SPDX-License-Identifier: GPL-2.0+ + +#ifndef _ASM_ERROR_INJECTION_H +#define _ASM_ERROR_INJECTION_H + +#include +#include +#include +#include + +void override_function_with_return(struct pt_regs *regs); + +#endif /* _ASM_ERROR_INJECTION_H */ diff --git a/arch/powerpc/include/asm/ptrace.h b/arch/powerpc/include/asm/ptrace.h index e4923686e43a..c0705296c2f0 100644 --- a/arch/powerpc/include/asm/ptrace.h +++ b/arch/powerpc/include/asm/ptrace.h @@ -101,6 +101,11 @@ static inline long regs_return_value(struct pt_regs *regs) return -regs->gpr[3]; } +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->gpr[3] = rc; +} + #ifdef __powerpc64__ #define user_mode(regs) regs)->msr) >> MSR_PR_LG) & 0x1) #else diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile index 653901042ad7..e0ed195eaa4b 100644 --- a/arch/powerpc/lib/Makefile +++ b/arch/powerpc/lib/Makefile @@ -14,6 +14,8 @@ obj-y += string.o alloc.o code-patching.o feature-fixups.o obj-$(CONFIG_PPC32)+= div64.o copy_32.o crtsavres.o +obj-$(CONFIG_FUNCTION_ERROR_INJECTION) += error-inject.o + # See corresponding test in arch/powerpc/Makefile # 64-bit linker creates .sfpr on demand for final link (vmlinux), # so it is only needed for modules, and only for older linkers which diff --git a/arch/powerpc/lib/error-inject.c b/arch/powerpc/lib/error-inject.c new file mode 100644 index ..6d3d57f68024 --- /dev/null +++ b/arch/powerpc/lib/error-inject.c @@ -0,0 +1,12 @@ +// SPDX-License-Identifier: GPL-2.0+ + +#include +#include +#include + +void override_function_with_return(struct pt_regs *regs) +{ + /* Emulate 'blr' instruction */ + regs->nip = regs->link; +} +NOKPROBE_SYMBOL(override_function_with_return); -- 2.17.0
Re: [PATCH 2/2] powerpc: Add support for function error injection
Naveen N. Rao wrote: Michael Ellerman wrote: I guess if it doesn't already apply to tip you should rebase it. You've probably missed 4.18 anyway. Oh ok. I just tried and it seems to apply just fine. I'll post v2 after giving this a quick test. I didn't post a v2 since I have decided against using this approach. The reason for that is Masami's series to remove jprobes. The discussion there reminded me that we can't easily override functions with kprobes on powerpc. Though it works for this particular scenario, we would just be setting a bad example. As such, I won't be changing generic code, but will simply make the necessary changes in powerpc code. Sorry for the noise. - Naveen
Re: [PATCH 08/11] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver
Hi Finn, On Wed, Jun 6, 2018 at 8:57 AM, Finn Thain wrote: > On Mon, 4 Jun 2018, Geert Uytterhoeven wrote: >> > Don't call pmu_shutdown() or pmu_restart() on early PowerBooks: the >> > PMU device found in these PowerBooks isn't supported. >> >> Shouldn't that be a separate patch? >> >> > --- a/arch/m68k/mac/misc.c >> > +++ b/arch/m68k/mac/misc.c >> >> > @@ -477,9 +445,8 @@ void mac_poweroff(void) >> >macintosh_config->adb_type == MAC_ADB_CUDA) { >> > cuda_shutdown(); >> > #endif >> > -#ifdef CONFIG_ADB_PMU68K >> > - } else if (macintosh_config->adb_type == MAC_ADB_PB1 >> > - || macintosh_config->adb_type == MAC_ADB_PB2) { >> > +#ifdef CONFIG_ADB_PMU >> > + } else if (macintosh_config->adb_type == MAC_ADB_PB2) { >> > pmu_shutdown(); >> > #endif >> > } >> > @@ -519,9 +486,8 @@ void mac_reset(void) >> >macintosh_config->adb_type == MAC_ADB_CUDA) { >> > cuda_restart(); >> > #endif >> > -#ifdef CONFIG_ADB_PMU68K >> > - } else if (macintosh_config->adb_type == MAC_ADB_PB1 >> > - || macintosh_config->adb_type == MAC_ADB_PB2) { >> > +#ifdef CONFIG_ADB_PMU >> > + } else if (macintosh_config->adb_type == MAC_ADB_PB2) { >> > pmu_restart(); >> > #endif >> > } else if (CPU_IS_030) { >> > > The stability problem here is bigger than just pmu_restart() and > pmu_shutdown(), so those hunks are insufficient. You need to prevent the > via-pmu68k driver from loading in the first place and to drop all the > PMU_68K_V1 cases. > > But splitting this patch in that way creates potential merge conflicts, > which is a hassle. E.g. this hunk: > > - > - switch (macintosh_config->adb_type) { > - case MAC_ADB_PB1: > - pmu_kind = PMU_68K_V1; > - break; > - case MAC_ADB_PB2: > - pmu_kind = PMU_68K_V2; > - break; > - default: > - pmu_kind = PMU_UNKNOWN; > - return -ENODEV; > - } > - ... > > would get split over two patches. > > The way I see it, having no PMU driver for PMU_68K_V1 machines is a bug. > And loading any of the existing PMU drivers on that hardware is also a > bug. These bugs are equivalent in that either one means you can't use the > keyboard, trackpad etc. So there's no value in cherry-picking the other > bug. > > If you are using v4.17 on a PMU_68K_V1 machine, you probably already have > CONFIG_ADB_PMU68K=n. With that config, here's nothing to be gained from > bisecting these changes. > > Does that make sense? Did I overlook other possible reason(s) to split up > this patch? So 4.17 mac_defconfig won't work on PMU_68K_V1 machines? Perhaps that should be fixed first. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [PATCH 08/11] macintosh/via-pmu: Replace via-pmu68k driver with via-pmu driver
On Mon, 4 Jun 2018, Geert Uytterhoeven wrote: > > > Don't call pmu_shutdown() or pmu_restart() on early PowerBooks: the > > PMU device found in these PowerBooks isn't supported. > > Shouldn't that be a separate patch? > > > --- a/arch/m68k/mac/misc.c > > +++ b/arch/m68k/mac/misc.c > > > @@ -477,9 +445,8 @@ void mac_poweroff(void) > >macintosh_config->adb_type == MAC_ADB_CUDA) { > > cuda_shutdown(); > > #endif > > -#ifdef CONFIG_ADB_PMU68K > > - } else if (macintosh_config->adb_type == MAC_ADB_PB1 > > - || macintosh_config->adb_type == MAC_ADB_PB2) { > > +#ifdef CONFIG_ADB_PMU > > + } else if (macintosh_config->adb_type == MAC_ADB_PB2) { > > pmu_shutdown(); > > #endif > > } > > @@ -519,9 +486,8 @@ void mac_reset(void) > >macintosh_config->adb_type == MAC_ADB_CUDA) { > > cuda_restart(); > > #endif > > -#ifdef CONFIG_ADB_PMU68K > > - } else if (macintosh_config->adb_type == MAC_ADB_PB1 > > - || macintosh_config->adb_type == MAC_ADB_PB2) { > > +#ifdef CONFIG_ADB_PMU > > + } else if (macintosh_config->adb_type == MAC_ADB_PB2) { > > pmu_restart(); > > #endif > > } else if (CPU_IS_030) { > The stability problem here is bigger than just pmu_restart() and pmu_shutdown(), so those hunks are insufficient. You need to prevent the via-pmu68k driver from loading in the first place and to drop all the PMU_68K_V1 cases. But splitting this patch in that way creates potential merge conflicts, which is a hassle. E.g. this hunk: - - switch (macintosh_config->adb_type) { - case MAC_ADB_PB1: - pmu_kind = PMU_68K_V1; - break; - case MAC_ADB_PB2: - pmu_kind = PMU_68K_V2; - break; - default: - pmu_kind = PMU_UNKNOWN; - return -ENODEV; - } - ... would get split over two patches. The way I see it, having no PMU driver for PMU_68K_V1 machines is a bug. And loading any of the existing PMU drivers on that hardware is also a bug. These bugs are equivalent in that either one means you can't use the keyboard, trackpad etc. So there's no value in cherry-picking the other bug. If you are using v4.17 on a PMU_68K_V1 machine, you probably already have CONFIG_ADB_PMU68K=n. With that config, here's nothing to be gained from bisecting these changes. Does that make sense? Did I overlook other possible reason(s) to split up this patch? --
Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
Hi Naveen, On Wed, Jun 06, 2018 at 12:06:09PM +0530, Naveen N. Rao wrote: > Simon Guo wrote: > >Hi Michael, > >On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote: > >>Hi Simon, > >> > >>wei.guo.si...@gmail.com writes: > >>> From: Simon Guo > >>> > >>> There is some room to optimize memcmp() in powerpc 64 bits version for > >>> following 2 cases: > >>> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning, > >>> memcmp() can align them and go with .Llong comparision mode without > >>> fallback to .Lshort comparision mode do compare buffer byte by byte. > >>> (2) VMX instructions can be used to speed up for large size comparision, > >>> currently the threshold is set for 4K bytes. Notes the VMX instructions > >>> will lead to VMX regs save/load penalty. This patch set includes a > >>> patch to add a 32 bytes pre-checking to minimize the penalty. > >>> > >>> It did the similar with glibc commit dec4a7105e (powerpc: > >>Improve memcmp > performance for POWER8). Thanks Cyril Bur's > >>information. > >>> This patch set also updates memcmp selftest case to make it compiled and > >>> incorporate large size comparison case. > >> > >>I'm seeing a few crashes with this applied, I haven't had time to look > >>into what is happening yet, sorry. > >> > > > >The bug is due to memcmp() invokes a C function enter_vmx_ops() > >who will load some PIC value based on r2. > > > >memcmp() doesn't use r2 and if the memcmp() is invoked from kernel > >itself, everything is fine. But if memcmp() is invoked from > >modules[test_user_copy], r2 will be required to be setup > >correctly. Otherwise the enter_vmx_ops() will refer to an > >incorrect/unexisting data location based on wrong r2 value. > > > >Following patch will fix this issue: > > > >diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S > >index 5eba49744a5a..24d093fa89bb 100644 > >--- a/arch/powerpc/lib/memcmp_64.S > >+++ b/arch/powerpc/lib/memcmp_64.S > >@@ -102,7 +102,7 @@ > > * 2) src/dst has different offset to the 8 bytes boundary. The handlers > > * are named like .Ldiffoffset_ > > */ > >-_GLOBAL(memcmp) > >+_GLOBAL_TOC(memcmp) > >cmpdi cr1,r5,0 > > > >/* Use the short loop if the src/dst addresses are not > >-- > > > >It means the memcmp() fun entry will have additional 2 instructions. Is there > >any way to save these 2 instructions when the memcmp() is actually invoked > >from kernel itself? > > That will be the case. We will end up entering the function via the > local entry point skipping the first two instructions. The Global > entry point is only used for cross-module calls. > Yes. Thanks :) - Simon
Re: Problems building ppc images in v4.14.y and v4.16.y using gcc 7.3.0 / 8.1.0 from kernel.org
Le 05/06/2018 à 21:47, Arnd Bergmann a écrit : On Tue, Jun 5, 2018 at 6:06 PM, Guenter Roeck wrote: On Tue, Jun 05, 2018 at 04:31:00PM +0200, Arnd Bergmann wrote: On Tue, Jun 5, 2018 at 3:52 PM, Guenter Roeck wrote: Hi Arnd, when using the ppc64 compiler from kernel.org, I see the following problems when trying to compile ppc:allnoconfig in v4.14.y or v4.16.y. gcc 7.3.0: Compilation of kernel.cpu.o hangs The problem goes away if I apply the following two patches (tested with 4.16.y) 17a2f1ced028 cpu/hotplug: Merge cpuhp_bp_states and cpuhp_ap_states fcb3029a8d89 cpu/hotplug: Fix unused function warning This is probably the same as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84038 I thought I had included the fix in my builds. Guess not. I probably had it in one build and then forgot about it when I did a rebuild of 7.3 :( I'm still planning to do a new set of gcc-7.3 binaries (or maybe 7.4 if that gets released soon) and should try to remember doing that. I think it may have cached the flags from the other compiler version. "make mrproper" prior to "make defconfig" took care of the issue. However, that doesn't really help - I get lots of error: 'sys_spu_create' alias between functions of incompatible types error: 'strncpy' output truncated before terminating nul if I try to use gcc 8.1.0. Oh well. I'll try gcc 6.4.0 next. On the upside, those two errors are just a result of arch/power/*/*.c getting built with -Werror, they are warnings that gcc-8 introduced that we should either shut up or fix. They are fixed in next: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=2479bfc9bc600dcce7f932d52dcfa8d677c41f93 https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=c95998811807d897ca112ea62d66716ed733d058 Christophe Arnd
Re: [PATCH v6 2/4] powerpc/64: enhance memcmp() with VMX instruction for long bytes comparision
Hi segher, On Wed, May 30, 2018 at 05:03:21PM +0800, Simon Guo wrote: > On Wed, May 30, 2018 at 03:35:40AM -0500, Segher Boessenkool wrote: > > On Wed, May 30, 2018 at 04:14:02PM +0800, Simon Guo wrote: > > > Hi Segher, > > > On Mon, May 28, 2018 at 06:05:59AM -0500, Segher Boessenkool wrote: > > > > On Fri, May 25, 2018 at 12:07:34PM +0800, wei.guo.si...@gmail.com wrote: > > > > > + /* save and restore cr0 */ > > > > > + mfocrf r5,64 > > > > > + EXIT_VMX_OPS > > > > > + mtocrf 64,r5 > > > > > + b .LcmpAB_lightweight > > > > > > > > That's cr1, not cr0. You can use mcrf instead, it is cheaper (esp. if > > > > you have it in a non-volatile CR field before so you need only one, if > > > > any). > > > > > > > You are right :) How about using mtcr/mfcr instead, I think they are > > > fast as well and more readable. > > > > Those are much worse than m[ft]ocrf. > > > > You probably should just shuffle things around so that EXIT_VMX_OPS > > does not clobber the CR field you need to keep. > Let me use mcrf then :) I now felt unformatable to use mcrf like: mcrf7,0 since I cannot 100% confident that compiler will not use CR7 or other CR# in exit_vmx_ops(). Can we switch back to mfocrf/mtocrf with correct CR0 value? mfocrf r5,128 ... mtocrf 128,r5 Thanks, - Simon
Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
Simon Guo wrote: Hi Michael, On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote: Hi Simon, wei.guo.si...@gmail.com writes: > From: Simon Guo > > There is some room to optimize memcmp() in powerpc 64 bits version for > following 2 cases: > (1) Even src/dst addresses are not aligned with 8 bytes at the beginning, > memcmp() can align them and go with .Llong comparision mode without > fallback to .Lshort comparision mode do compare buffer byte by byte. > (2) VMX instructions can be used to speed up for large size comparision, > currently the threshold is set for 4K bytes. Notes the VMX instructions > will lead to VMX regs save/load penalty. This patch set includes a > patch to add a 32 bytes pre-checking to minimize the penalty. > > It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp > performance for POWER8). Thanks Cyril Bur's information. > This patch set also updates memcmp selftest case to make it compiled and > incorporate large size comparison case. I'm seeing a few crashes with this applied, I haven't had time to look into what is happening yet, sorry. The bug is due to memcmp() invokes a C function enter_vmx_ops() who will load some PIC value based on r2. memcmp() doesn't use r2 and if the memcmp() is invoked from kernel itself, everything is fine. But if memcmp() is invoked from modules[test_user_copy], r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() will refer to an incorrect/unexisting data location based on wrong r2 value. Following patch will fix this issue: diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index 5eba49744a5a..24d093fa89bb 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -102,7 +102,7 @@ * 2) src/dst has different offset to the 8 bytes boundary. The handlers * are named like .Ldiffoffset_ */ -_GLOBAL(memcmp) +_GLOBAL_TOC(memcmp) cmpdi cr1,r5,0 /* Use the short loop if the src/dst addresses are not -- It means the memcmp() fun entry will have additional 2 instructions. Is there any way to save these 2 instructions when the memcmp() is actually invoked from kernel itself? That will be the case. We will end up entering the function via the local entry point skipping the first two instructions. The Global entry point is only used for cross-module calls. - Naveen
Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
Hi Michael, On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote: > Hi Simon, > > wei.guo.si...@gmail.com writes: > > From: Simon Guo > > > > There is some room to optimize memcmp() in powerpc 64 bits version for > > following 2 cases: > > (1) Even src/dst addresses are not aligned with 8 bytes at the beginning, > > memcmp() can align them and go with .Llong comparision mode without > > fallback to .Lshort comparision mode do compare buffer byte by byte. > > (2) VMX instructions can be used to speed up for large size comparision, > > currently the threshold is set for 4K bytes. Notes the VMX instructions > > will lead to VMX regs save/load penalty. This patch set includes a > > patch to add a 32 bytes pre-checking to minimize the penalty. > > > > It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp > > performance for POWER8). Thanks Cyril Bur's information. > > This patch set also updates memcmp selftest case to make it compiled and > > incorporate large size comparison case. > > I'm seeing a few crashes with this applied, I haven't had time to look > into what is happening yet, sorry. > The bug is due to memcmp() invokes a C function enter_vmx_ops() who will load some PIC value based on r2. memcmp() doesn't use r2 and if the memcmp() is invoked from kernel itself, everything is fine. But if memcmp() is invoked from modules[test_user_copy], r2 will be required to be setup correctly. Otherwise the enter_vmx_ops() will refer to an incorrect/unexisting data location based on wrong r2 value. Following patch will fix this issue: diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index 5eba49744a5a..24d093fa89bb 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -102,7 +102,7 @@ * 2) src/dst has different offset to the 8 bytes boundary. The handlers * are named like .Ldiffoffset_ */ -_GLOBAL(memcmp) +_GLOBAL_TOC(memcmp) cmpdi cr1,r5,0 /* Use the short loop if the src/dst addresses are not -- It means the memcmp() fun entry will have additional 2 instructions. Is there any way to save these 2 instructions when the memcmp() is actually invoked from kernel itself? Again thanks for finding this issue. Thanks, - Simon > [ 2471.300595] kselftest: Running tests in user > [ 2471.302785] calling test_user_copy_init+0x0/0xd14 [test_user_copy] @ 44883 > [ 2471.302892] Unable to handle kernel paging request for data at address > 0xc00818553005 > [ 2471.303014] Faulting instruction address: 0xc001f29c > [ 2471.303119] Oops: Kernel access of bad area, sig: 11 [#1] > [ 2471.303193] LE SMP NR_CPUS=2048 NUMA PowerNV > [ 2471.303256] Modules linked in: test_user_copy(+) vxlan ip6_udp_tunnel > udp_tunnel 8021q bridge stp llc dummy test_printf test_firmware vmx_crypto > crct10dif_vpmsum crct10dif_common crc32c_vpmsum veth [last unloaded: > test_static_key_base] > [ 2471.303532] CPU: 4 PID: 44883 Comm: modprobe Tainted: GW > 4.17.0-rc3-gcc7x-g7204012 #1 > [ 2471.303644] NIP: c001f29c LR: c001f6e4 CTR: > > [ 2471.303754] REGS: c01fddc2b560 TRAP: 0300 Tainted: GW > (4.17.0-rc3-gcc7x-g7204012) > [ 2471.303873] MSR: 92009033 CR: > 24222844 XER: > [ 2471.303996] CFAR: c001f6e0 DAR: c00818553005 DSISR: 4000 > IRQMASK: 0 > [ 2471.303996] GPR00: c001f6e4 c01fddc2b7e0 c00818529900 > 0200 > [ 2471.303996] GPR04: c01fe4b90020 ffe0 > 03fe01b48000 > [ 2471.303996] GPR08: 8000 c00818553005 c01fddc28000 > c00818520df0 > [ 2471.303996] GPR12: c009c430 c01fbc00 2000 > > [ 2471.303996] GPR16: c01fddc2bc20 0030 c01f7ba0 > 0001 > [ 2471.303996] GPR20: c0c772b0 c10b4018 > > [ 2471.303996] GPR24: c00818521c98 > c01fe4b9 > [ 2471.303996] GPR28: fff4 0200 92009033 > 92009033 > [ 2471.304930] NIP [c001f29c] msr_check_and_set+0x3c/0xc0 > [ 2471.305008] LR [c001f6e4] enable_kernel_altivec+0x44/0x100 > [ 2471.305084] Call Trace: > [ 2471.305122] [c01fddc2b7e0] [c009baa8] > __copy_tofrom_user_base+0x9c/0x574 (unreliable) > [ 2471.305240] [c01fddc2b860] [c001f6e4] > enable_kernel_altivec+0x44/0x100 > [ 2471.305336] [c01fddc2b890] [c009ce40] enter_vmx_ops+0x50/0x70 > [ 2471.305418] [c01fddc2b8b0] [c009c768] memcmp+0x338/0x680 > [ 2471.305501] [c01fddc2b9b0] [c00818520190] > test_user_copy_init+0x188/0xd14 [test_user_copy] > [ 2471.305617] [c01fddc2ba60] [c000de20] > do_one_initcall+0x90/0x560 > [ 2471.305710] [c01fddc2bb30] [c0200630] do_init_module+0x90/0x260 > [