[RFC v02 1/5] PowerCap: Documentation
Added power cap framework documentation. This explains the use of power capping framework, sysfs and programming interface. There are two documents: Documentation/powercap/PowerCappingFramework.txt: Explains use case and API in details. Documentation/ABI/testing/sysfs-class-powercap: Explains ABIs. Reviewed-by: Len Brown Signed-off-by: Srinivas Pandruvada Signed-off-by: Jacob Pan Signed-off-by: Arjan van de Ven --- Documentation/ABI/testing/sysfs-class-powercap | 165 ++ Documentation/powercap/PowerCappingFramework.txt | 686 +++ 2 files changed, 851 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-class-powercap create mode 100644 Documentation/powercap/PowerCappingFramework.txt diff --git a/Documentation/ABI/testing/sysfs-class-powercap b/Documentation/ABI/testing/sysfs-class-powercap new file mode 100644 index 000..0e5d6e4 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-powercap @@ -0,0 +1,165 @@ +What: /sys/class/power_cap/ +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + The power_cap/ class sub directory belongs to the power cap + subsystem. Refer to + Documentation/powercap/PowerCappingFramework.txt for details. + +What: /sys/class/power_cap/controller_name +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + The /sys/class/power_cap/controller_name directories correspond + to each controller under power_cap control. Here controller_name + is a unique name under /sys/class_power_cap. Each + controller_name directory contains one or more power zones. + +What: /sys/class/power_cap/controller_name/type +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + For controller type is "controller". This allows user space + to differentiate between a controller device from a power zone + device. + +What: /sys/class/power_cap/controller_name/power_zone +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + A Controller can have one or more power zones. A power zone is + an abstraction of devices, which can be independently monitored + and controlled. + +What: /sys/class/power_cap/controller_name/power_zone/power_zone +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + A power zone can have one or more power zones as children. + This child power zone provides monitoring and control for + a subset of device under parent. E.g. if there is parent + power zone for whole CPU package, each CPU cores in it can be + a child power zone. + +What: /sys/class/power_cap/controller_name/power_zone/name +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + Specifies the name of this power zone. + + +What: /sys/class/power_cap/controller_name/power_zone/type +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + For power zone type is "power-zone". + + +What: /sys/class/power_cap/controller_name/power_zone/energy_uj +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + Current energy counter in micro-joules. Write "0" to reset. + If the counter can not be reset, then this attribute is + read-only. + +What: /sys/class/power_cap/controller_name/power_zone/ + max_energy_range_uj +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + Range of the above energy counter in micro-joules. + + +What: /sys/class/power_cap/controller_name/power_zone/power_uw +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + Current power in micro-watts. Write "0" to reset. + If the value can not be reset, then the attribute is read + only. + +What: /sys/class/power_cap/controller_name/power_zone/ + max_power_range_uw +Date: August 2013 +KernelVersion: 3.12 +Contact: linux...@vger.kernel.org +Description: + Range of the above power value in micro-watts. + +What: /sys/class/power_cap/controller_name/power_zone/ + constraint_X_name +Date: August 2013 +KernelVersion: 3.12 +Contact:
Re: [PATCH v3 3/5] devicetree: create a separate binding description for sata_highbank
On Aug 7, 2013, at 10:52 AM, Mark Langsdorf wrote: > The Calxeda sata_highbank driver has been adding its descriptions to the > ahci driver. Separate them properly. > > Signed-off-by: Mark Langsdorf > Acked-by: Rob Herring > --- > Changes from v2 > Fixed some indenting. > Changes from v1 > None. > > .../devicetree/bindings/ata/ahci-platform.txt | 18 +++- > .../devicetree/bindings/ata/sata_highbank.txt | 32 ++ > 2 files changed, 36 insertions(+), 14 deletions(-) > create mode 100644 Documentation/devicetree/bindings/ata/sata_highbank.txt > > diff --git a/Documentation/devicetree/bindings/ata/ahci-platform.txt > b/Documentation/devicetree/bindings/ata/ahci-platform.txt > index 3ec0c5c..89de156 100644 > --- a/Documentation/devicetree/bindings/ata/ahci-platform.txt > +++ b/Documentation/devicetree/bindings/ata/ahci-platform.txt > @@ -4,27 +4,17 @@ SATA nodes are defined to describe on-chip Serial ATA > controllers. > Each SATA controller should have its own node. > > Required properties: > -- compatible: compatible list, contains "calxeda,hb-ahci" or > "snps,spear-ahci" > +- compatible: compatible list, contains "snps,spear-ahci" > - interrupts: > - reg : > > Optional properties: > -- calxeda,port-phys: phandle-combophy and lane assignment, which maps each > - SATA port to a combophy and a lane within that > - combophy > -- calxeda,sgpio-gpio: phandle-gpio bank, bit offset, and default on or off, > - which indicates that the driver supports SGPIO > - indicator lights using the indicated GPIOs > -- calxeda,led-order : a u32 array that map port numbers to offsets within the > - SGPIO bitstream. > - dma-coherent : Present if dma operations are coherent > > Example: > sata@ffe08000 { > - compatible = "calxeda,hb-ahci"; > -reg = <0xffe08000 0x1000>; > -interrupts = <115>; > - calxeda,port-phys = < 0 0 1 > - 2 3>; > + compatible = "snps,spear-ahci"; > + reg = <0xffe08000 0x1000>; > + interrupts = <115>; > > }; > diff --git a/Documentation/devicetree/bindings/ata/sata_highbank.txt > b/Documentation/devicetree/bindings/ata/sata_highbank.txt > new file mode 100644 > index 000..1ac6d3d > --- /dev/null > +++ b/Documentation/devicetree/bindings/ata/sata_highbank.txt > @@ -0,0 +1,32 @@ > +* Calxeda AHCI SATA Controller > + > +SATA nodes are defined to describe on-chip Serial ATA controllers. > +The Calxeda SATA controller mostly conforms to the AHCI interface > +with some special extensions to add functionality. > +Each SATA controller should have its own node. > + > +Required properties: > +- compatible: compatible list, contains "calxeda,hb-ahci" > +- interrupts: > +- reg : > + > +Optional properties: > +- dma-coherent : Present if dma operations are coherent > +- calxeda,port-phys: phandle-combophy and lane assignment, which maps each > + SATA port to a combophy and a lane within that > + combophy > +- calxeda,sgpio-gpio: phandle-gpio bank, bit offset, and default on or off, > + which indicates that the driver supports SGPIO > + indicator lights using the indicated GPIOs > +- calxeda,led-order : a u32 array that map port numbers to offsets within the > + SGPIO bitstream. nit: whitespace after : > + > +Example: > +sata@ffe08000 { > + compatible = "calxeda,hb-ahci"; > + reg = <0xffe08000 0x1000>; > + interrupts = <115>; > + calxeda,port-phys = < 0 0 1 > + 2 3>; > + Its probably good to show all optional props (dma-coherent, calxeda,sgpio-gpios, & calxeda,led-order) in the example. > +}; > -- > 1.8.1.2 > > -- > To unsubscribe from this list: send the line "unsubscribe devicetree" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Employee of Qualcomm Innovation Center, Inc. Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] gcc feature request: Moving blocks into sections
On Wed, 2013-08-07 at 12:03 -0400, Mathieu Desnoyers wrote: > You might want to try creating a global array of counters (accessible > both from C for printout and assembly for update). > > Index the array from assembly using: (2f - 1f) > > 1: > jmp ...; > 2: > > And put an atomic increment of the counter. This increment instruction > should be located prior to the jmp for obvious reasons. > > You'll end up with the sums you're looking for at indexes 2 and 5 of the > array. After I post the patches, feel free to knock yourself out. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: at91/dt: split sam9x5 peripheral definitions
Dear Boris BREZILLON, On Wed, 7 Aug 2013 12:14:26 +0200, Boris BREZILLON wrote: > This patch splits the sam9x5 peripheral definitions into: > - a common base for all sam9x5 SoCs (at91sam9x5.dtsi) > - several optional peripheral definitions which will be included by specific > sam9x5 SoCs (at91sam9x5_'periph name'.dtsi) > > This provides a better representation of the real hardware (drop unneeded > dt nodes) and avoids future peripheral id conflict (lcdc and isi both use > peripheral id 25). > > Signed-off-by: Boris BREZILLON > --- > arch/arm/boot/dts/at91sam9g25.dtsi |2 + > arch/arm/boot/dts/at91sam9g35.dtsi |1 + > arch/arm/boot/dts/at91sam9x25.dtsi | 24 ++- > arch/arm/boot/dts/at91sam9x35.dtsi |1 + > arch/arm/boot/dts/at91sam9x5.dtsi| 67 > -- > arch/arm/boot/dts/at91sam9x5_macb0.dtsi | 56 + > arch/arm/boot/dts/at91sam9x5_macb1.dtsi | 44 > arch/arm/boot/dts/at91sam9x5_usart3.dtsi | 51 +++ > 8 files changed, 158 insertions(+), 88 deletions(-) > create mode 100644 arch/arm/boot/dts/at91sam9x5_macb0.dtsi > create mode 100644 arch/arm/boot/dts/at91sam9x5_macb1.dtsi > create mode 100644 arch/arm/boot/dts/at91sam9x5_usart3.dtsi Hum, do we really want to have .dtsi files per peripheral? I might have overlooked this, but I think it's the first time we would have this in arch/arm/boot/dts. Thomas -- Thomas Petazzoni, Free Electrons Kernel, drivers, real-time and embedded Linux development, consulting, training and support. http://free-electrons.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: Tree for Aug 7
On Wed, Aug 07, 2013 at 10:29:18AM +0200, Sedat Dilek wrote: > On Wed, Aug 7, 2013 at 7:54 AM, Stephen Rothwell > wrote: > > Hi all, > > > > Changes since 20130806: > > > > The ext4 tree lost its build failure. > > > > The mvebu tree gained a build failure so I used the version from > > next-20130806. > > > > The akpm tree gained conflicts against the ext4 tree. > > > > > > > > [ CC some netdev and wireless folks ] > > Yesterday, I discovered an issue with net-next. > The patch in [1] fixed the problems in my network/wifi environment. > Hannes confirmed that virtio_net are solved, too. > Today's next-20130807 still needs it for affected people. > > - Sedat - > > [1] http://marc.info/?l=linux-netdev=137582524017840=2 > [2] http://marc.info/?l=linux-netdev=137583048219416=2 > [3] http://marc.info/?t=13757971288=1=2 Could you please try the attached patch. It limits parsing the ethernet header (by calling eth_type_trans()) to cases when the configured protocol is ETH_P_ALL, so at least for 802.1X this should fix the problem. The idea behind this patch is that users setting the protocol to something else probably do know better and so should be left alone. Best wishes, Phil diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index bbe1ece..66bc79c 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -1932,8 +1932,6 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, ph.raw = frame; - skb->protocol = proto; - skb->dev = dev; skb->priority = po->sk.sk_priority; skb->mark = po->sk.sk_mark; sock_tx_timestamp(>sk, _shinfo(skb)->tx_flags); @@ -2002,13 +2000,18 @@ static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb, if (unlikely(err)) return err; - if (dev->type == ARPHRD_ETHER) - skb->protocol = eth_type_trans(skb, dev); - data += dev->hard_header_len; to_write -= dev->hard_header_len; } + if (dev->type == ARPHRD_ETHER && + proto = htons(ETH_P_ALL)) { + skb->protocol = eth_type_trans(skb, dev); + } else { + skb->protocol = proto; + skb->dev = dev; + } + max_frame_len = dev->mtu + dev->hard_header_len; if (skb->protocol == htons(ETH_P_8021Q)) max_frame_len += VLAN_HLEN; @@ -2331,15 +2334,17 @@ static int packet_snd(struct socket *sock, sock_tx_timestamp(sk, _shinfo(skb)->tx_flags); - if (dev->type == ARPHRD_ETHER) { + if (dev->type == ARPHRD_ETHER && + proto == htons(ETH_P_ALL)) { skb->protocol = eth_type_trans(skb, dev); - if (skb->protocol == htons(ETH_P_8021Q)) - reserve += VLAN_HLEN; } else { skb->protocol = proto; skb->dev = dev; } + if (skb->protocol == htons(ETH_P_8021Q)) + reserve += VLAN_HLEN; + if (!gso_type && (len > dev->mtu + reserve + extra_len)) { err = -EMSGSIZE; goto out_free;
Re: WARNING: CPU: 26 PID: 93793 at fs/ext4/inode.c:230 ext4_evict_inode+0x4c9/0x500 [ext4]() still in 3.11-rc3
On 08/07/2013 08:33 AM, Jan Kara wrote: On Wed 07-08-13 08:27:32, Guenter Roeck wrote: On 08/07/2013 08:20 AM, Jan Kara wrote: On Thu 01-08-13 20:58:46, Davidlohr Bueso wrote: On Thu, 2013-08-01 at 22:33 +0200, Jan Kara wrote: Hi, On Thu 01-08-13 13:14:19, Davidlohr Bueso wrote: FYI I'm seeing loads of the following messages with Linus' latest 3.11-rc3 (which includes 822dbba33458cd6ad) Thanks for notice. I see you are running reaim to trigger this. What workload? After re-running the workloads one by one, I finally hit the issue again with 'dbase'. FWIW I'm using ramdisks + ext4. Hum, I'm not able to reproduce this with current Linus' kernel - commit e4ef108fcde0b97ed38923ba1ea06c7a152bab9e - I've tried with ramdisk but no luck. Are you using some special mount options? I don't see this commit in the upstream kernel ? It is Linus's merge of Tejun's libata fix from Tuesday... I tried reproducing the problem on the same system I had seen 822dbba33458cd6ad on, with the same workload. It has now been running since last Friday, but I have not seen any problems. Ah, OK, so it may be fixed after all. If you happen to see it again, please let me know. Thanks! At least the problem I found, yes. The problem Davidlohr found may be a different one. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] hwmon: (lm90) Add power control
On 08/07/2013 03:35 AM, Wei Ni wrote: > On 08/07/2013 04:45 PM, Alexander Shiyan wrote: >>> On 08/07/2013 03:50 PM, Guenter Roeck wrote: On 08/07/2013 12:32 AM, Wei Ni wrote: > On 08/07/2013 03:27 PM, Alexander Shiyan wrote: >>> The device lm90 can be controlled by the vdd rail. >>> Adding the power control support to power on/off the vdd rail. >>> And make sure that power is enabled before accessing the device. >>> >>> Signed-off-by: Wei Ni >>> --- >>> drivers/hwmon/lm90.c | 52 >>> ++ >> [...] >>> + if (!data->lm90_reg) { >>> + data->lm90_reg = regulator_get(>dev, "vdd"); >>> + if (IS_ERR_OR_NULL(data->lm90_reg)) { >>> + if (PTR_ERR(data->lm90_reg) == -ENODEV) >>> + dev_info(>dev, >>> +"No regulator found for vdd. >>> Assuming vdd is always powered."); >>> + else >>> + dev_warn(>dev, >>> +"Error [%ld] in getting the >>> regulator handle for vdd.\n", >>> +PTR_ERR(data->lm90_reg)); >>> + data->lm90_reg = NULL; >>> + mutex_unlock(>update_lock); >>> + return -ENODEV; >>> + } >>> + } >>> + if (is_enable) { >>> + ret = regulator_enable(data->lm90_reg); >>> + msleep(POWER_ON_DELAY); >> >> Can this delay be handled directly from regulator? > > I think it should be handled in the device driver. > Because there have different delay time to wait devices stable. > Then why does no other caller of regulator_enable() need this ? I don't think lm90 is so much different to other users of regulator functionality. >>> >>> May be I'm wrong. I noticed that in lm90 SPEC, the max of "SMBus Clock >>> Low Time" is 25ms, so I supposed that it may need about 20ms to stable >>> after power on. >>> >>> Anyway, if I remove this delay, the driver also works fine, so I will >>> remove it in my next patch. >> >> I originally had in mind that regulator API contain own delay option. >> E.g. reg-fixed-voltage && gpio-regulator contains "startup-delay-us" >> property. > > As I know the "startup-delay-us" is used for the regulator device, not > the consumer devices. Yes, the regulator should encoded its own startup delay. Each individual device should handle its own requirements for delay after power is stable. > In this patch, msleep(POWER_ON_DELAY) was used to wait the lm90 stable, > but it seems it's unnecessary now :) No, the driver needs to handle this properly. If the datasheet says a delay is needed, it is. It's probably working because in your tests the supply just happens to be on already. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] gcc feature request: Moving blocks into sections
* Steven Rostedt (rost...@goodmis.org) wrote: > On Wed, 2013-08-07 at 07:06 +0200, Ondřej Bílka wrote: > > > Add short_counter,long_counter and before increment counter before each > > jump. That way we will know how many short/long jumps were taken. > > That's not trivial at all. The jump is a single location (in an asm > goto() statement) that happens to be inlined through out the kernel. The > assembler decides if it will be a short or long jump. How do you add a > counter to count the difference? You might want to try creating a global array of counters (accessible both from C for printout and assembly for update). Index the array from assembly using: (2f - 1f) 1: jmp ...; 2: And put an atomic increment of the counter. This increment instruction should be located prior to the jmp for obvious reasons. You'll end up with the sums you're looking for at indexes 2 and 5 of the array. Thanks, Mathieu > > The output I gave is from the boot up code that converts the jmp back to > a nop (or in this case, the default nop to the ideal nop). It knows the > size by reading the op code. This is a static analysis, not a running > one. It's no trivial task to have a counter for each jump. > > There is a way though. If we enable all the jumps (all tracepoints, and > other users of jumplabel), record the trace and then compare the trace > to the output that shows which ones were short jumps, and all others are > long jumps. > > I'll post the patches soon and you can have fun doing the compare :-) > > Actually, I'm working on the 4 patches of the series that is more about > clean ups and safety checks than the jmp conversion. That is not > controversial, and I'll be posting them for 3.12 soon. > > After that, I'll post the updated patches that have the conversion as > well as the counter, for RFC and for others to play with. > > -- Steve > > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ARM: dt: t114 dalmore: add dt entry for nct1008
On 08/07/2013 12:52 AM, Wei Ni wrote: > Enable thermal sensor nct1008 for t114 dalmore. Wei, I assume this patch doesn't depend on any of the other LM90-related patches you've sent; I can simply apply it right away? Is the LM90 DT binding fully documented somewhere, including the vdd-supply property? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: dts: Fix memory node in skeleton64.dtsi
On Wed, Aug 07, 2013 at 08:23:06AM +0200, Gregory CLEMENT wrote: > On 07/08/2013 03:33, Stepan Moskovchenko wrote: > > Update the reg property of the memory node in > > skeleton64.dtsi to reflect the fact that the root node uses > > address-cells=2 and size-cells=2. > > Good catch > > Acked-by: Gregory CLEMENT Since we introduced the file, and I can't think of any other tree that should take it, so I'll go ahead and take it. thx, Jason. > > Change-Id: Ie9b61166143969e020ceebc51e9a384405d8c0f2 > > Signed-off-by: Stepan Moskovchenko > > --- > > arch/arm/boot/dts/skeleton64.dtsi |2 +- > > 1 files changed, 1 insertions(+), 1 deletions(-) > > > > diff --git a/arch/arm/boot/dts/skeleton64.dtsi > > b/arch/arm/boot/dts/skeleton64.dtsi > > index 1599415..b5d7f36 100644 > > --- a/arch/arm/boot/dts/skeleton64.dtsi > > +++ b/arch/arm/boot/dts/skeleton64.dtsi > > @@ -9,5 +9,5 @@ > > #size-cells = <2>; > > chosen { }; > > aliases { }; > > - memory { device_type = "memory"; reg = <0 0>; }; > > + memory { device_type = "memory"; reg = <0 0 0 0>; }; > > }; > > > > > -- > Gregory Clement, Free Electrons > Kernel, drivers, real-time and embedded Linux > development, consulting, training and support. > http://free-electrons.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] pinctrl: msm: Add support for MSM TLMM pinmux
On 08/06/2013 05:45 PM, Hanumant Singh wrote: > On 7/31/2013 5:17 PM, Hanumant Singh wrote: >> On 7/31/2013 2:06 PM, Stephen Warren wrote: >>> On 07/31/2013 01:46 PM, Hanumant Singh wrote: On 7/30/2013 8:59 PM, Stephen Warren wrote: > On 07/30/2013 06:13 PM, Hanumant Singh wrote: >> On 7/30/2013 5:08 PM, Stephen Warren wrote: >>> On 07/30/2013 06:01 PM, Hanumant Singh wrote: On 7/30/2013 2:22 PM, Stephen Warren wrote: > On 07/30/2013 03:10 PM, hanumant wrote: > ... >> We actually have the same TLMM pinmux used by several socs of a >> family. >> The number of pins on each soc may vary. >> Also a given soc gets used in a number of boards. >> The device tree for a given soc is split into the different >> boards >> that >> its in ie the boards inherit a common soc.dtsi but have separate >> dts. >> The boards for the same soc may use different pin groups for >> accomplishing a function, since we have multiple i2c, spi uart >> etc >> peripheral instances on a soc. A different instance of each of >> the >> above >> peripherals, can be used in different boards, utilizing different >> or subset of same pin groups. >> Thus I would need to have multiple C files for one soc, based >> on the >> boards that it goes into. > > The pinctrl driver should be exposing the raw capabilities of > the HW. > All the board-specific configuration should be expressed in DT. > So, the > driver shouldn't have to know anything about different boards at > compile-time. > I agree, so I wanted to keep the pin grouping information in DT, we already have a board based differentiation of dts files in DT, for the same soc. >>> >>> That's the opposite of what I was saying. Pin groups are a >>> feature of >>> the SoC design, not the board. >>> >> Sorry I guess I wasn't clear. >> Right now I have a soc-pinctrl.dtsi containing pin groupings. >> This will be "inherited" by soc-boardtype.dts. >> The pinctrl client device nodes in soc-boardtype.dts will point to >> pin >> groupings in soc-pinctrl.dtsi that are valid for that particular >> boardtype. >> Is this a valid design? > > OK, so you have two types of child node inside the pinctrl DT node; > some > define the pin groups the SoC has (in soc.dtsi) and some define > pinctrl > states that reference the pin group nodes and are referenced by the > client nodes. > > That's probably fine. However, I'd still question putting the pin > group > nodes in DT at all; I'm not convinced it's better than just putting > those into the driver itself. You end up with the same data tables > after > parsing the DT anyway. > Any feedback for the rest of the patch? >>> >>> I'm certainly waiting for this aspect of the patch to be resolved; I >>> think it will impact the rest of the patch so much that it's not worth >>> reviewing until we decide on where to represent the pin groups (some DT >>> parsing could would be removed if we put the pin group definitions into >>> the driver, hence wouldn't need to be reviewed, and likewise there's be >>> some new tables to review). >>> >> >> I am trying to look at examples of what you are suggesting. >> I was looking at the exynos implementation, and just from a brief glance >> it seems like there too the pin grouping is being specified in the >> device tree, using what looks like labels of the pins. >> The labels are matched to group structures in soc specific files? >> >> By having the pin groupings in DT I am able to reuse the driver without >> any SOC based code bloat. >> As I mentioned earlier, we have entire families of SOCs using the same >> TLMM hardware. >> Its not a guarantee that for a given TLMM version, >> the pin groupings on that hardware are the same for every SOC that its >> in. Its infact most likely that I wont be able to use the pin groupings >> from one SOC to the next even if they both use the same TLMM. >> It will very quickly lead to a bloat of >> pinctrl-.c (containing the pin groupings replicated for each >> soc) >> which use TLMM version specific register programming implementation >> pinctrl-tlmm-.c >> and the DT parsing and interface to framework (which remains unchanged). >> pinctrl-msm.c. >> >> Thanks >> Hanumant >> > > Any comments on this? No. As I said, I personally want to see all the pingroups defined in the pinctrl driver. But, if someone else acks/... the patches without it, I probably won't nack it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at
[PATCH v3 3/5] devicetree: create a separate binding description for sata_highbank
The Calxeda sata_highbank driver has been adding its descriptions to the ahci driver. Separate them properly. Signed-off-by: Mark Langsdorf Acked-by: Rob Herring --- Changes from v2 Fixed some indenting. Changes from v1 None. .../devicetree/bindings/ata/ahci-platform.txt | 18 +++- .../devicetree/bindings/ata/sata_highbank.txt | 32 ++ 2 files changed, 36 insertions(+), 14 deletions(-) create mode 100644 Documentation/devicetree/bindings/ata/sata_highbank.txt diff --git a/Documentation/devicetree/bindings/ata/ahci-platform.txt b/Documentation/devicetree/bindings/ata/ahci-platform.txt index 3ec0c5c..89de156 100644 --- a/Documentation/devicetree/bindings/ata/ahci-platform.txt +++ b/Documentation/devicetree/bindings/ata/ahci-platform.txt @@ -4,27 +4,17 @@ SATA nodes are defined to describe on-chip Serial ATA controllers. Each SATA controller should have its own node. Required properties: -- compatible: compatible list, contains "calxeda,hb-ahci" or "snps,spear-ahci" +- compatible: compatible list, contains "snps,spear-ahci" - interrupts: - reg : Optional properties: -- calxeda,port-phys: phandle-combophy and lane assignment, which maps each - SATA port to a combophy and a lane within that - combophy -- calxeda,sgpio-gpio: phandle-gpio bank, bit offset, and default on or off, - which indicates that the driver supports SGPIO - indicator lights using the indicated GPIOs -- calxeda,led-order : a u32 array that map port numbers to offsets within the - SGPIO bitstream. - dma-coherent : Present if dma operations are coherent Example: sata@ffe08000 { - compatible = "calxeda,hb-ahci"; -reg = <0xffe08000 0x1000>; -interrupts = <115>; - calxeda,port-phys = < 0 0 1 -2 3>; + compatible = "snps,spear-ahci"; + reg = <0xffe08000 0x1000>; + interrupts = <115>; }; diff --git a/Documentation/devicetree/bindings/ata/sata_highbank.txt b/Documentation/devicetree/bindings/ata/sata_highbank.txt new file mode 100644 index 000..1ac6d3d --- /dev/null +++ b/Documentation/devicetree/bindings/ata/sata_highbank.txt @@ -0,0 +1,32 @@ +* Calxeda AHCI SATA Controller + +SATA nodes are defined to describe on-chip Serial ATA controllers. +The Calxeda SATA controller mostly conforms to the AHCI interface +with some special extensions to add functionality. +Each SATA controller should have its own node. + +Required properties: +- compatible: compatible list, contains "calxeda,hb-ahci" +- interrupts: +- reg : + +Optional properties: +- dma-coherent : Present if dma operations are coherent +- calxeda,port-phys: phandle-combophy and lane assignment, which maps each + SATA port to a combophy and a lane within that + combophy +- calxeda,sgpio-gpio: phandle-gpio bank, bit offset, and default on or off, + which indicates that the driver supports SGPIO + indicator lights using the indicated GPIOs +- calxeda,led-order : a u32 array that map port numbers to offsets within the + SGPIO bitstream. + +Example: +sata@ffe08000 { + compatible = "calxeda,hb-ahci"; + reg = <0xffe08000 0x1000>; + interrupts = <115>; + calxeda,port-phys = < 0 0 1 +2 3>; + +}; -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 5/5] sata, highbank: send extra clock cycles in SGPIO patterns
Some SGPIO PICs don't follow the standard very well and expect a certain number of clock cycles or port frames in each SGPIO pattern. Add two optional parameters in the DTB that can provide the number of extra clock cycles to be sent before and after SGPIO pattern. Read those parameters from the DTB and send the extra clock cycles. Signed-off-by: Mark Langsdorf Acked-by: Rob Herring --- Changes from v2 None. Changes from v1 Added an example to the bindings. Forced the pre-clocks and post-clocks values to 0 if there is an error while reading them or the values aren't in the DTB. Documentation/devicetree/bindings/ata/sata_highbank.txt | 6 ++ drivers/ata/sata_highbank.c | 13 + 2 files changed, 19 insertions(+) diff --git a/Documentation/devicetree/bindings/ata/sata_highbank.txt b/Documentation/devicetree/bindings/ata/sata_highbank.txt index fdbd4476..6124a32 100644 --- a/Documentation/devicetree/bindings/ata/sata_highbank.txt +++ b/Documentation/devicetree/bindings/ata/sata_highbank.txt @@ -23,6 +23,10 @@ Optional properties: - calxeda,tx-atten : a u32 array that contains TX attenuation override codes, one per port. The upper 3 bytes are always 0 and thus ignored. +- calxeda,pre-clocks : a u32 that indicates the number of additional clock + cycles to transmit before sending an SGPIO pattern +- calxeda,post-clocks: a u32 that indicates the number of additional clock + cycles to transmit after sending an SGPIO pattern Example: sata@ffe08000 { @@ -32,4 +36,6 @@ Example: calxeda,port-phys = < 0 0 1 2 3>; calxeda,tx-atten = <0xff 22 0xff 0xff 23>; + calxeda,pre-clocks = <10>; + calxeda,post-clocks = <0>; }; diff --git a/drivers/ata/sata_highbank.c b/drivers/ata/sata_highbank.c index a7c8038..7f5e5d9 100644 --- a/drivers/ata/sata_highbank.c +++ b/drivers/ata/sata_highbank.c @@ -84,6 +84,9 @@ static DEFINE_SPINLOCK(sgpio_lock); struct ecx_plat_data { u32 n_ports; + /* number of extra clocks that the SGPIO PIC controller expects */ + u32 pre_clocks; + u32 post_clocks; unsignedsgpio_gpio[SGPIO_PINS]; u32 sgpio_pattern; u32 port_to_sgpio[SGPIO_PORTS]; @@ -160,6 +163,9 @@ static ssize_t ecx_transmit_led_message(struct ata_port *ap, u32 state, spin_lock_irqsave(_lock, flags); ecx_parse_sgpio(pdata, ap->port_no, state); sgpio_out = pdata->sgpio_pattern; + for (i = 0; i < pdata->pre_clocks; i++) + ecx_led_cycle_clock(pdata); + gpio_set_value(pdata->sgpio_gpio[SLOAD], 1); ecx_led_cycle_clock(pdata); gpio_set_value(pdata->sgpio_gpio[SLOAD], 0); @@ -172,6 +178,8 @@ static ssize_t ecx_transmit_led_message(struct ata_port *ap, u32 state, sgpio_out >>= 1; ecx_led_cycle_clock(pdata); } + for (i = 0; i < pdata->post_clocks; i++) + ecx_led_cycle_clock(pdata); /* save off new led state for port/slot */ emp->led_state = state; @@ -206,6 +214,11 @@ static void highbank_set_em_messages(struct device *dev, of_property_read_u32_array(np, "calxeda,led-order", pdata->port_to_sgpio, pdata->n_ports); + if (of_property_read_u32(np, "calxeda,pre-clocks", >pre_clocks)) + pdata->pre_clocks = 0; + if (of_property_read_u32(np, "calxeda,post-clocks", + >post_clocks)) + pdata->post_clocks = 0; /* store em_loc */ hpriv->em_loc = 0; -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: unused swap offset / bad page map.
void __lru_cache_add(struct page *page) { struct pagevec *pvec = _cpu_var(lru_add_pvec); page_cache_get(page); if (!pagevec_space(pvec)) __pagevec_lru_add(pvec); pagevec_add(pvec, page); put_cpu_var(lru_add_pvec); } I added a printk, and found that pagevec_add frequently returns 0. Is that ok ? What happens to 'page' in this case ? Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 1/5] sata, highbank: fix ordering of SGPIO signals
The ACTIVITY and ERROR signals were reversed in the original commit. Fix that so that hard drive activity does not show up on the error light, and attempts to indicate that the hard drive is failing do not show up as hard drive activity. This fixes a fairly serious functional bug in the driver, but failing to apply this patch will not cause any stability issues on the system. Signed-off-by: Mark Langsdorf --- Changes from v2 Further rewords of the commit message. Changes from v1 Expanded commit message explaining the problems with the unpatched code. drivers/ata/sata_highbank.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/ata/sata_highbank.c b/drivers/ata/sata_highbank.c index d047d92..e9a4f46 100644 --- a/drivers/ata/sata_highbank.c +++ b/drivers/ata/sata_highbank.c @@ -86,11 +86,11 @@ struct ecx_plat_data { #define SGPIO_SIGNALS 3 #define ECX_ACTIVITY_BITS 0x30 -#define ECX_ACTIVITY_SHIFT 2 +#define ECX_ACTIVITY_SHIFT 0 #define ECX_LOCATE_BITS0x8 #define ECX_LOCATE_SHIFT 1 #define ECX_FAULT_BITS 0x40 -#define ECX_FAULT_SHIFT0 +#define ECX_FAULT_SHIFT2 static inline int sgpio_bit_shift(struct ecx_plat_data *pdata, u32 port, u32 shift) { -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v3 4/5] sata, highbank: set tx_atten override bits
Some board designs do not drive the SATA transmit lines within the specification. The ECME can provide override settings, on a per board basis, to bring the transmit lines within spec. Read those settings from the DTB and program them in. Signed-off-by: Mark Langsdorf --- Changes from v2 None. Changes from v1 Clarified that the array is a u32 array. Added an example in the bindings. .../devicetree/bindings/ata/sata_highbank.txt | 5 +- drivers/ata/sata_highbank.c| 58 +- 2 files changed, 49 insertions(+), 14 deletions(-) diff --git a/Documentation/devicetree/bindings/ata/sata_highbank.txt b/Documentation/devicetree/bindings/ata/sata_highbank.txt index 1ac6d3d..fdbd4476 100644 --- a/Documentation/devicetree/bindings/ata/sata_highbank.txt +++ b/Documentation/devicetree/bindings/ata/sata_highbank.txt @@ -20,6 +20,9 @@ Optional properties: indicator lights using the indicated GPIOs - calxeda,led-order : a u32 array that map port numbers to offsets within the SGPIO bitstream. +- calxeda,tx-atten : a u32 array that contains TX attenuation override + codes, one per port. The upper 3 bytes are always + 0 and thus ignored. Example: sata@ffe08000 { @@ -28,5 +31,5 @@ Example: interrupts = <115>; calxeda,port-phys = < 0 0 1 2 3>; - + calxeda,tx-atten = <0xff 22 0xff 0xff 23>; }; diff --git a/drivers/ata/sata_highbank.c b/drivers/ata/sata_highbank.c index 8b40025..a7c8038 100644 --- a/drivers/ata/sata_highbank.c +++ b/drivers/ata/sata_highbank.c @@ -46,14 +46,19 @@ #define CR_BUSY0x0001 #define CR_START 0x0001 #define CR_WR_RDN 0x0002 +#define CPHY_TX_INPUT_STS 0x2001 #define CPHY_RX_INPUT_STS 0x2002 -#define CPHY_SATA_OVERRIDE 0x4000 -#define CPHY_OVERRIDE 0x2005 +#define CPHY_SATA_TX_OVERRIDE 0x8000 +#define CPHY_SATA_RX_OVERRIDE 0x4000 +#define CPHY_TX_OVERRIDE 0x2004 +#define CPHY_RX_OVERRIDE 0x2005 #define SPHY_LANE 0x100 #define SPHY_HALF_RATE 0x0001 #define CPHY_SATA_DPLL_MODE0x0700 #define CPHY_SATA_DPLL_SHIFT 8 #define CPHY_SATA_DPLL_RESET (1 << 11) +#define CPHY_SATA_TX_ATTEN 0x1c00 +#define CPHY_SATA_TX_ATTEN_SHIFT 10 #define CPHY_PHY_COUNT 6 #define CPHY_LANE_COUNT4 #define CPHY_PORT_COUNT(CPHY_PHY_COUNT * CPHY_LANE_COUNT) @@ -66,6 +71,7 @@ struct phy_lane_info { void __iomem *phy_base; u8 lane_mapping; u8 phy_devs; + u8 tx_atten; }; static struct phy_lane_info port_data[CPHY_PORT_COUNT]; @@ -76,7 +82,6 @@ static DEFINE_SPINLOCK(sgpio_lock); #define SGPIO_PINS 3 #define SGPIO_PORTS8 -/* can be cast as an ahci_host_priv for compatibility with most functions */ struct ecx_plat_data { u32 n_ports; unsignedsgpio_gpio[SGPIO_PINS]; @@ -259,8 +264,27 @@ static void highbank_cphy_disable_overrides(u8 sata_port) if (unlikely(port_data[sata_port].phy_base == NULL)) return; tmp = combo_phy_read(sata_port, CPHY_RX_INPUT_STS + lane * SPHY_LANE); - tmp &= ~CPHY_SATA_OVERRIDE; - combo_phy_write(sata_port, CPHY_OVERRIDE + lane * SPHY_LANE, tmp); + tmp &= ~CPHY_SATA_RX_OVERRIDE; + combo_phy_write(sata_port, CPHY_RX_OVERRIDE + lane * SPHY_LANE, tmp); +} + +static void cphy_override_tx_attenuation(u8 sata_port, u32 val) +{ + u8 lane = port_data[sata_port].lane_mapping; + u32 tmp; + + if (val & 0x8) + return; + + tmp = combo_phy_read(sata_port, CPHY_TX_INPUT_STS + lane * SPHY_LANE); + tmp &= ~CPHY_SATA_TX_OVERRIDE; + combo_phy_write(sata_port, CPHY_TX_OVERRIDE + lane * SPHY_LANE, tmp); + + tmp |= CPHY_SATA_TX_OVERRIDE; + combo_phy_write(sata_port, CPHY_TX_OVERRIDE + lane * SPHY_LANE, tmp); + + tmp |= (val << CPHY_SATA_TX_ATTEN_SHIFT) & CPHY_SATA_TX_ATTEN; + combo_phy_write(sata_port, CPHY_TX_OVERRIDE + lane * SPHY_LANE, tmp); } static void cphy_override_rx_mode(u8 sata_port, u32 val) @@ -268,21 +292,21 @@ static void cphy_override_rx_mode(u8 sata_port, u32 val) u8 lane = port_data[sata_port].lane_mapping; u32 tmp; tmp = combo_phy_read(sata_port, CPHY_RX_INPUT_STS + lane * SPHY_LANE); - tmp &= ~CPHY_SATA_OVERRIDE; - combo_phy_write(sata_port, CPHY_OVERRIDE + lane * SPHY_LANE, tmp); + tmp &= ~CPHY_SATA_RX_OVERRIDE; + combo_phy_write(sata_port, CPHY_RX_OVERRIDE + lane * SPHY_LANE, tmp); - tmp |= CPHY_SATA_OVERRIDE; -
[PATCH v3 2/5] sata highbank: enable 64-bit DMA mask when using LPAE
From: Rob Herring Signed-off-by: Rob Herring Signed-off-by: Mark Langsdorf --- Changes from v1, v2 None. drivers/ata/sata_highbank.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/ata/sata_highbank.c b/drivers/ata/sata_highbank.c index e9a4f46..8b40025 100644 --- a/drivers/ata/sata_highbank.c +++ b/drivers/ata/sata_highbank.c @@ -479,6 +479,9 @@ static int ahci_highbank_probe(struct platform_device *pdev) if (hpriv->cap & HOST_CAP_PMP) pi.flags |= ATA_FLAG_PMP; + if (hpriv->cap & HOST_CAP_64) + dma_set_coherent_mask(dev, DMA_BIT_MASK(64)); + /* CAP.NP sometimes indicate the index of the last enabled * port, at other times, that of the last possible port, so * determining the maximum port number requires looking at -- 1.8.1.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] mm: make lru_add_drain_all() selective
This change makes lru_add_drain_all() only selectively interrupt the cpus that have per-cpu free pages that can be drained. This is important in nohz mode where calling mlockall(), for example, otherwise will interrupt every core unnecessarily. Signed-off-by: Chris Metcalf --- Oops! In the previous version of this change I had just blindly patched it forward from a slightly older version of mm/swap.c. This version is now properly against a version of mm/swap.c that includes all the latest changes to lru_add_drain_all(). include/linux/workqueue.h | 3 +++ kernel/workqueue.c| 35 ++- mm/swap.c | 37 - 3 files changed, 65 insertions(+), 10 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index a0ed78a..71a3fe7 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -13,6 +13,8 @@ #include #include +struct cpumask; + struct workqueue_struct; struct work_struct; @@ -470,6 +472,7 @@ extern void flush_workqueue(struct workqueue_struct *wq); extern void drain_workqueue(struct workqueue_struct *wq); extern void flush_scheduled_work(void); +extern int schedule_on_cpu_mask(work_func_t func, const struct cpumask *mask); extern int schedule_on_each_cpu(work_func_t func); int execute_in_process_context(work_func_t fn, struct execute_work *); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index f02c4a4..a6d1809 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2962,17 +2962,18 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork) EXPORT_SYMBOL(cancel_delayed_work_sync); /** - * schedule_on_each_cpu - execute a function synchronously on each online CPU + * schedule_on_cpu_mask - execute a function synchronously on each listed CPU * @func: the function to call + * @mask: the cpumask to invoke the function on * - * schedule_on_each_cpu() executes @func on each online CPU using the + * schedule_on_cpu_mask() executes @func on each listed CPU using the * system workqueue and blocks until all CPUs have completed. - * schedule_on_each_cpu() is very slow. + * schedule_on_cpu_mask() is very slow. * * RETURNS: * 0 on success, -errno on failure. */ -int schedule_on_each_cpu(work_func_t func) +int schedule_on_cpu_mask(work_func_t func, const struct cpumask *mask) { int cpu; struct work_struct __percpu *works; @@ -2981,24 +2982,40 @@ int schedule_on_each_cpu(work_func_t func) if (!works) return -ENOMEM; - get_online_cpus(); - - for_each_online_cpu(cpu) { + for_each_cpu(cpu, mask) { struct work_struct *work = per_cpu_ptr(works, cpu); INIT_WORK(work, func); schedule_work_on(cpu, work); } - for_each_online_cpu(cpu) + for_each_cpu(cpu, mask) flush_work(per_cpu_ptr(works, cpu)); - put_online_cpus(); free_percpu(works); return 0; } /** + * schedule_on_each_cpu - execute a function synchronously on each online CPU + * @func: the function to call + * + * schedule_on_each_cpu() executes @func on each online CPU using the + * system workqueue and blocks until all CPUs have completed. + * schedule_on_each_cpu() is very slow. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int schedule_on_each_cpu(work_func_t func) +{ + get_online_cpus(); + schedule_on_cpu_mask(func, cpu_online_mask); + put_online_cpus(); + return 0; +} + +/** * flush_scheduled_work - ensure that any scheduled work has run to completion. * * Forces execution of the kernel-global workqueue and blocks until its diff --git a/mm/swap.c b/mm/swap.c index 4a1d0d2..d4a862b 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -405,6 +405,11 @@ static void activate_page_drain(int cpu) pagevec_lru_move_fn(pvec, __activate_page, NULL); } +static bool need_activate_page_drain(int cpu) +{ + return pagevec_count(_cpu(activate_page_pvecs, cpu)) != 0; +} + void activate_page(struct page *page) { if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) { @@ -422,6 +427,11 @@ static inline void activate_page_drain(int cpu) { } +static bool need_activate_page_drain(int cpu) +{ + return false; +} + void activate_page(struct page *page) { struct zone *zone = page_zone(page); @@ -683,7 +693,32 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) */ int lru_add_drain_all(void) { - return schedule_on_each_cpu(lru_add_drain_per_cpu); + cpumask_var_t mask; + int cpu, rc; + + if (!alloc_cpumask_var(, GFP_KERNEL)) + return -ENOMEM; + cpumask_clear(mask); + + /* +* Figure out which cpus need flushing. It's OK if we race +* with changes to the per-cpu lru pvecs, since it's no worse +* than if we flushed all cpus, since a cpu could still end +
Re: [PATCH v2] [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
On Wed, Aug 7, 2013 at 7:38 AM, David Milburn wrote: > I was able to succesfully test this patch overnight, I had been experimenting > with the > sg driver setting the BIO_NULL_MAPPED flag in sg_rq_end_io_usercontext for a > orphan process > which prevented the corruption, but your solution seems much better. Very cool, thanks for the testing. I actually looked at using BIO_NULL_MAPPED as well, but it seemed a bit too fragile to me -- it had the right effect of skipping __bio_copy_iov(), and skipping the __free_pages() stuff in there is OK because sg owns its pages rather than the bio layer, but all that seemed vulnerable to being broken by an unrelated change. Out of curiousity, were you already working on this bug? Because if you had fixed it a few weeks earlier we might not have spent so long wondering WTF was stomping on the memory of one of our processes :) - R. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: OPP: rename functions? (was [PATCH] OPP: Export opp_add())
Rafael, offline question: On 08/06/2013 09:15 AM, Rafael J. Wysocki wrote: On Tuesday, August 06, 2013 08:08:20 AM Nishanth Menon wrote: change in subject to reflect new discussion. On 05:53-20130806, Randy Dunlap wrote: On 08/03/2013 02:25 AM, Viresh Kumar wrote: +EXPORT_SYMBOL_GPL(opp_add); Could it be renamed to pm_opp_add() or power_opp_add() ? The name is a bit too unspecific IMO. Though this has nothing specific with this patch, an interesting point. git grep -w opp . showed drivers/tty/n_tty.c, drivers/sbus/char/openprom.c and arch/powerpc/kvm/mpic.c using variables named opp to mean what ever they had in context. rest(around 40 odd files) seem to use opp as in Documentation/power/opp.txt.. We could go with a pm_ prefix or even dev_pm_opp_ prefix to be more specific, though I prefer just pm_. If Rafael and others are ok, I can post a series out. Yup, that would be useful. I'm for dev_pm_opp_ if that matters. Given that there would be quiet a few conflicts, do you have a suggestion around what baseline I should submit this? -- Regards, Nishanth Menon -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 4/4] fuse: drop dentry on failed revalidate
On Tue, Aug 6, 2013 at 10:06 PM, Anand Avati wrote: > On 8/6/13 7:30 AM, Miklos Szeredi wrote: >> >> From: Anand Avati >> >> Drop a subtree when we find that it has moved or been delated. This can >> be >> done as long as there are no submounts under this location. >> >> If the directory was moved and we come across the same directory in a >> future lookup it will be reconnected by d_materialise_unique(). >> >> Signed-off-by: Anand Avati >> Signed-off-by: Miklos Szeredi >> --- >> fs/fuse/dir.c | 7 ++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c >> index 131d14b..4ba5893 100644 >> --- a/fs/fuse/dir.c >> +++ b/fs/fuse/dir.c >> @@ -226,8 +226,13 @@ static int fuse_dentry_revalidate(struct dentry >> *entry, unsigned int flags) >> if (!err) { >> struct fuse_inode *fi = get_fuse_inode(inode); >> if (outarg.nodeid != get_node_id(inode)) { >> + int ret = 0; >> + >> + if (check_submounts_and_drop(entry) != 0) >> + ret = 1; >> + >> fuse_queue_forget(fc, forget, >> outarg.nodeid, 1); >> - return 0; >> + return ret; > > > If outarg.nodeid != get_node_id(inode), then we have to return 0 no matter > what (whether we successfully dropped the entry or not), no? If we return 0 in that case (we failed to invalidate the dentry), then the VFS will call d_invalidate() which will fail. The result is the same... > Or are you > trying to forcefully keep the path to reach the submount alive? If so, we > still fail in inode_permission() .. -> getattr() of the dir inode, no? Yes. But the path to the mountpoint should still be reachable (for the purpose of unmounting for example). I'm including an interesting discussion between Al and Linus about this (mailing lists weren't CC-d, but I don't think they'd mind). BTW, the isue that non-directory mountpoints are dropped by NFS and friends is not addressed by my previous patchset. Updated patches coming up. Thanks, Miklos Subject: [heads-up] breakage with revalidate on NFS and elsewhere --- From: Al Viro 1) In NFS ->d_revalidate() we blindly evict non-directories from dcache. So does d_invalidate(). Which will leave anything bound on the file in question unreachable. It's not a complete leak (e.g. umount -l or death of namespace will still evict those), but it's certainly a bug and one with potential for rather unhappy admin. Note that there's no reason whatsoever to do that d_drop() in case of non-directories; the only possible caller (do_revalidate(); the other call site is for directories only) will call d_invalidate(), which will drop them itself. d_invalidate() is more interesting; the minimal fix is to have it check d_mounted and if it's non-zero - grab namespace_sem, find all vfsmounts with this ->mnt_root, umount_tree() for all of those, drop namespace_sem, then release all collected vfsmounts. What's more, we probably want to extend that to directories; the same thing could be done for all children with non-zero d_mounted, killing the "has submounts" logics in NFS revalidate. It's not even hard to implement - all we need is a secondary hash chains going through vfsmounts, keyed by ->mnt_mountpoint alone. That would be enough (alternative would be to put them on a cyclic list anchored in dentry, but that'd lead to much worse memory waste since for almost all dentries the list would be empty). _However_, there's a secondary issue with d_invalidate() callers. What happens to the "case-insensitive" crap? Suppose we have something mounted on /mnt/foo/bar, with /mnt/foo/bar being on VFAT. Somebody wants to open /mnt/foo/BaR; what should that do to mountpoint? Current behaviour is a) if it's a directory, have lookup return /mnt/foo/bar, case be damned. b) if it's a non-directory, leak the vfsmount(s), return dentry with new name. IMO we should _NOT_ make any vfsmounts unreachable in that case; too obvious abuse potential. The only question is whether to have invalidation simply fail (i.e. case (a) for everything) or to try and flip ->mnt_mountpoint in them to the "replacement" dentry. I think that the former is the right answer. In any case, this means splitting d_invalidate() in two variants (unmounting and non-unmounting). We also need to review other __d_drop()/d_drop() users - potentially they might need the same kind of treatment ;-/ 2) NFS4 ->d_revalidate() is too bloody eager to bypass everything bypassable; as the result, if you have a something bound on top of file and attempt to open it, the damn thing will blindly try to open _underlying_ file. You either get that file opened (and nameidata_to_filp() will return it, nevermind where
Re: [PATCH 00/26] STA2X11 devicetree support for amba/pci
On 08/07/2013 03:16 AM, Alessandro Rubini wrote: > > Some of the problems he found are: > > * Passing a dtb to the kernel: we use a modified kexec at present >because x86 boot loaders can't pass the DT blob, to our knowledge. > > * Passing correct irq numbers to the AMBA drivers, because PCI MSI >irq numbers are dynamically allocated (we solved this by using >of_update_property() at runtime). We also had to register a new >irq domain for msi irqs, otherwise of_irq_map_one() would complain >about irqs lacking a corresponding domain. > > * Switching to a new gpio driver with devicetree support (we took the >Nomadik gpio/pinctrl because our device apparently has more or less >the same gpio cell as the Nomadik chip). This requires implementation >of writel_relaxed() and IRQF_VALID on x86: we hacked them internally >but the patches are not part of this set. We're willing to solve >these incompatibilities first, if there's interest. > > * Writing a suitable dts: at present, a dts only exists for one >of the STA2X11 based boards (Intel Northville). This includes a >copy of all the physical addresses for the devices, as dts requires >that, even if such addresses are automatically assigned by PCI. >Clearly, with this approach we kill PCI autodetect: if you plug >to a different slot you need a different dts. > > This got us a more or less working kernel on the Northville board > (where the device is soldered on the motherboard and acts as main chipset). > The plug-in PCIe board cannot be supported by device tree, as far as > we know, which in our opinion is a strong downside of device tree in favor > of the platform data "shit". > OK, so we have a real corner case here... which is a plugin board beyond which sits a bus normally used by fixed devices. You are definitely correct that this is something that stresses current means of description to the breaking point. *Note there are some questions below that I would perfectly understand if you can't talk about publicly, if so, please contact me privately at my corporate address.* However, the plugin board is very different from it being the main chipset, in no small part because you can boot without it. I think this is the first time I have ever heard of a chip which can act both as a system chipset and a plugin card. The mainboard case is relatively straightforward -- we should use ACPI 5 (preferred for x86) or device tree to describe it. My understanding from what you describe so far is that the only existing case is the Northville which is a mainboard. For the plugin case, my thinking is that we probably do need a driver of some kind which at least contains the description of the board, as I assume one is not present in any kind of firmware on the board itself (*do any such boards or plans for them actually exist at this point?*) Ideally that driver should be (primarily?) a data object (an ACPI 5 SSDT or a DTB file) rather than open coded C. I believe ACPI 5 unlike device tree should be able to specify the dynamic properties that you are rightfully concerned with. Sorry if this feels like a wild goose chase to you. Some of this problem domain is not very well handled by the current code, but we really have to draw a hard line to make sure it doesn't descend into unmaintainable chaos. We have similar issues with MinnowBoard and are trying to use that as a platform to figure out how a lot of these things need to be handled. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 06/27] drivers/i2c/busses: don't check resource with devm_ioremap_resource
On Tue, Jul 23, 2013 at 08:01:39PM +0200, Wolfram Sang wrote: > devm_ioremap_resource does sanity checks on the given resource. No need to > duplicate this in the driver. > > Signed-off-by: Wolfram Sang Applied to for-next, thanks! signature.asc Description: Digital signature
Re: [patch v2 3/3] mm: page_alloc: fair zone allocator policy
On Wed, Aug 07, 2013 at 03:58:28PM +0100, Mel Gorman wrote: > On Fri, Aug 02, 2013 at 11:37:26AM -0400, Johannes Weiner wrote: > > Each zone that holds userspace pages of one workload must be aged at a > > speed proportional to the zone size. Otherwise, the time an > > individual page gets to stay in memory depends on the zone it happened > > to be allocated in. Asymmetry in the zone aging creates rather > > unpredictable aging behavior and results in the wrong pages being > > reclaimed, activated etc. > > > > But exactly this happens right now because of the way the page > > allocator and kswapd interact. The page allocator uses per-node lists > > of all zones in the system, ordered by preference, when allocating a > > new page. When the first iteration does not yield any results, kswapd > > is woken up and the allocator retries. Due to the way kswapd reclaims > > zones below the high watermark while a zone can be allocated from when > > it is above the low watermark, the allocator may keep kswapd running > > while kswapd reclaim ensures that the page allocator can keep > > allocating from the first zone in the zonelist for extended periods of > > time. Meanwhile the other zones rarely see new allocations and thus > > get aged much slower in comparison. > > > > The result is that the occasional page placed in lower zones gets > > relatively more time in memory, even gets promoted to the active list > > after its peers have long been evicted. Meanwhile, the bulk of the > > working set may be thrashing on the preferred zone even though there > > may be significant amounts of memory available in the lower zones. > > > > Even the most basic test -- repeatedly reading a file slightly bigger > > than memory -- shows how broken the zone aging is. In this scenario, > > no single page should be able stay in memory long enough to get > > referenced twice and activated, but activation happens in spades: > > > > $ grep active_file /proc/zoneinfo > > nr_inactive_file 0 > > nr_active_file 0 > > nr_inactive_file 0 > > nr_active_file 8 > > nr_inactive_file 1582 > > nr_active_file 11994 > > $ cat data data data data >/dev/null > > $ grep active_file /proc/zoneinfo > > nr_inactive_file 0 > > nr_active_file 70 > > nr_inactive_file 258753 > > nr_active_file 443214 > > nr_inactive_file 149793 > > nr_active_file 12021 > > > > Fix this with a very simple round robin allocator. Each zone is > > allowed a batch of allocations that is proportional to the zone's > > size, after which it is treated as full. The batch counters are reset > > when all zones have been tried and the allocator enters the slowpath > > and kicks off kswapd reclaim. Allocation and reclaim is now fairly > > spread out to all available/allowable zones: > > > > $ grep active_file /proc/zoneinfo > > nr_inactive_file 0 > > nr_active_file 0 > > nr_inactive_file 174 > > nr_active_file 4865 > > nr_inactive_file 53 > > nr_active_file 860 > > $ cat data data data data >/dev/null > > $ grep active_file /proc/zoneinfo > > nr_inactive_file 0 > > nr_active_file 0 > > nr_inactive_file 22 > > nr_active_file 4988 > > nr_inactive_file 190969 > > nr_active_file 937 > > > > When zone_reclaim_mode is enabled, allocations will now spread out to > > all zones on the local node, not just the first preferred zone (which > > on a 4G node might be a tiny Normal zone). > > > > Signed-off-by: Johannes Weiner > > Tested-by: Zlatko Calusic > > --- > > include/linux/mmzone.h | 1 + > > mm/page_alloc.c| 69 > > ++ > > 2 files changed, 60 insertions(+), 10 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index af4a3b7..dcad2ab 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -352,6 +352,7 @@ struct zone { > > * free areas of different sizes > > */ > > spinlock_t lock; > > + int alloc_batch; > > int all_unreclaimable; /* All pages pinned */ > > #if defined CONFIG_COMPACTION || defined CONFIG_CMA > > /* Set to true when the PG_migrate_skip bits should be cleared */ > > This adds a dirty cache line that is updated on every allocation even if > it's from the per-cpu allocator. I am concerned that this will introduce > noticable overhead in the allocator paths on large machines running > allocator intensive workloads. > > Would it be possible to move it into the per-cpu pageset? I understand > that hte round-robin nature will then depend on what CPU is running and > the performance characterisics will be different. There might even be an > adverse workload that uses all the batches from all available CPUs until > it is essentially the same problem but that would be a very worst case. > I would hope that in general
Re: [PATCH 02/12] drivers/i2c/busses: don't use devm_pinctrl_get_select_default() in probe
On Wed, Jul 10, 2013 at 04:57:37PM +0100, Wolfram Sang wrote: > Since commit ab78029 (drivers/pinctrl: grab default handles from device core), > we can rely on device core for setting the default pins. Compile tested only. > > Acked-by: Linus Walleij (personally at LCE13) > Signed-off-by: Wolfram Sang Applied to for-next, thanks! signature.asc Description: Digital signature
Re: WARNING: CPU: 26 PID: 93793 at fs/ext4/inode.c:230 ext4_evict_inode+0x4c9/0x500 [ext4]() still in 3.11-rc3
On Wed 07-08-13 08:27:32, Guenter Roeck wrote: > On 08/07/2013 08:20 AM, Jan Kara wrote: > >On Thu 01-08-13 20:58:46, Davidlohr Bueso wrote: > >>On Thu, 2013-08-01 at 22:33 +0200, Jan Kara wrote: > >>> Hi, > >>> > >>>On Thu 01-08-13 13:14:19, Davidlohr Bueso wrote: > FYI I'm seeing loads of the following messages with Linus' latest > 3.11-rc3 (which includes 822dbba33458cd6ad) > >>> Thanks for notice. I see you are running reaim to trigger this. What > >>>workload? > >> > >>After re-running the workloads one by one, I finally hit the issue again > >>with 'dbase'. FWIW I'm using ramdisks + ext4. > > Hum, I'm not able to reproduce this with current Linus' kernel - commit > >e4ef108fcde0b97ed38923ba1ea06c7a152bab9e - I've tried with ramdisk but no > >luck. Are you using some special mount options? > > > I don't see this commit in the upstream kernel ? It is Linus's merge of Tejun's libata fix from Tuesday... > I tried reproducing the problem on the same system I had seen > 822dbba33458cd6ad on, > with the same workload. It has now been running since last Friday, but I have > not seen any problems. Ah, OK, so it may be fixed after all. If you happen to see it again, please let me know. Thanks! Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: unused swap offset / bad page map.
On Wed, Aug 07, 2013 at 06:04:20PM +0800, Hillf Danton wrote: > > There were a slew of these. same trace, different addr/anon_vma/index. > > mapping always null. > > > Would you please run again with the debug info added? > --- > --- a/mm/swapfile.c Wed Aug 7 17:27:22 2013 > +++ b/mm/swapfile.c Wed Aug 7 17:57:20 2013 > @@ -509,6 +509,7 @@ static struct swap_info_struct *swap_inf > { > struct swap_info_struct *p; > unsigned long offset, type; > +int race = 0; > > if (!entry.val) > goto out; > @@ -524,10 +525,17 @@ static struct swap_info_struct *swap_inf > if (!p->swap_map[offset]) > goto bad_free; > spin_lock(>lock); > +if (!p->swap_map[offset]) { > +race = 1; > +spin_unlock(>lock); > +goto bad_free; > +} > return p; > > bad_free: > printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val); > +if (race) > +printk(KERN_ERR "but due to race\n"); > goto out; > bad_offset: > printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val); > -- printk didn't trigger. This time around the oom killer was going off the same time. I'm wondering if we have some allocations somewhere in the swap code that don't handle failure correctly. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: build failure after merge of the ext4 tree
Stephen Rothwell writes: > Hi Sedat, > > On Wed, 7 Aug 2013 07:16:57 +0200 Sedat Dilek > wrote: >> >> On Mon, Jul 29, 2013 at 3:08 AM, Stephen Rothwell >> wrote: >> > >> > After merging the ext4 tree, today's linux-next build (powerpc >> > ppc64_defconfig) failed like this: >> > >> > fs/ext4/ialloc.c: In function '__ext4_new_inode': >> > fs/ext4/ialloc.c:817:1: warning: label 'next_ino' defined but not used >> > [-Wunused-label] >> > next_ino: >> > ^ >> > fs/ext4/ialloc.c:792:4: error: label 'next_inode' used but not defined >> > goto next_inode; >> > ^ >> > >> > Hmm ... >> > >> > Caused by commit 4a8603ef197a ("ext4: avoid reusing recently deleted >> > inodes in no journal mode"). >> > >> > I have used the ext4 tree from next-20130726 for today. >> >> Since this message ext4-tree was not updated. >> The commit "ext4: avoid reusing recently deleted inodes in no journal >> mode" was refreshed and has a different commit-id. >> Did you test with this one? You still see the breakage? > > Today's linux-next does not have this build failure. However, this same commit does introduce a new build failure (not present in next-20130806) when ext4 is built as a module: ERROR: "dirty_expire_interval" [fs/ext4/ext4.ko] undefined! make[3]: *** [__modpost] Error 1 make[2]: *** [modules] Error 2 The change below fixes the problem. Found when building the mv78xx0_defconfig on ARM. Kevin 8<-- >From 8bd2e08124d9b298f42a0e0c3a7584ba285f Mon Sep 17 00:00:00 2001 From: Kevin Hilman Date: Wed, 7 Aug 2013 08:17:43 -0700 Subject: [PATCH] mm: page-writeback: export dirty_expire_interval, used by ext4 commit 533ec0ed (ext4: avoid reusing recently deleted inodes in no journal mode) started using dirty_expire_inteval, which is not available to modules. Make it available to modules. Cc: "Theodore Ts'o" Signed-off-by: Kevin Hilman --- mm/page-writeback.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index d374b29..c8b61ef 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -104,6 +104,8 @@ EXPORT_SYMBOL_GPL(dirty_writeback_interval); */ unsigned int dirty_expire_interval = 30 * 100; /* centiseconds */ +EXPORT_SYMBOL_GPL(dirty_expire_interval); + /* * Flag that makes the machine dump writes/reads and block dirtyings. */ -- 1.8.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.10.4] NFS locking panic, plus persisting NFS shutdown panic from 3.9.*
On Wed, 2013-08-07 at 11:18 +0100, Nix wrote: > On 6 Aug 2013, Trond Myklebust verbalised: > > True. How about something like the following instead. Note the change to > > the original patch... > > Well, with those applied I could reboot without a panic for the first > time since 3.8.x: looking good. I'll give it a reboot or two with a > system that's not hot from booting though. > Could you please also try applying only the 1/2 patch, to see if that suffices to quell the shutdown panic? -- Trond Myklebust Linux NFS client maintainer NetApp trond.mykleb...@netapp.com www.netapp.com N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
Re: WARNING: CPU: 26 PID: 93793 at fs/ext4/inode.c:230 ext4_evict_inode+0x4c9/0x500 [ext4]() still in 3.11-rc3
On 08/07/2013 08:20 AM, Jan Kara wrote: On Thu 01-08-13 20:58:46, Davidlohr Bueso wrote: On Thu, 2013-08-01 at 22:33 +0200, Jan Kara wrote: Hi, On Thu 01-08-13 13:14:19, Davidlohr Bueso wrote: FYI I'm seeing loads of the following messages with Linus' latest 3.11-rc3 (which includes 822dbba33458cd6ad) Thanks for notice. I see you are running reaim to trigger this. What workload? After re-running the workloads one by one, I finally hit the issue again with 'dbase'. FWIW I'm using ramdisks + ext4. Hum, I'm not able to reproduce this with current Linus' kernel - commit e4ef108fcde0b97ed38923ba1ea06c7a152bab9e - I've tried with ramdisk but no luck. Are you using some special mount options? I don't see this commit in the upstream kernel ? I tried reproducing the problem on the same system I had seen 822dbba33458cd6ad on, with the same workload. It has now been running since last Friday, but I have not seen any problems. Guenter Honza [ cut here ] WARNING: CPU: 26 PID: 93793 at fs/ext4/inode.c:230 ext4_evict_inode+0x4c9/0x500 [ext4]() Modules linked in: autofs4 cpufreq_ondemand freq_table sunrpc 8021q garp stp llc pcc_cpufreq ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log dm_mod uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core netxen_nic mperf ext4 jbd2 mbcache sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core [last unloaded: freq_table] CPU: 26 PID: 93793 Comm: reaim Tainted: GW3.11.0-rc3+ #1 Hardware name: HP ProLiant DL980 G7, BIOS P66 06/24/2011 00e6 8985db603d78 8153ce4d 00e6 8985db603db8 8104cf1c 8985db603dc8 8b05c485b8b0 8b05c485b9b8 8b05c485b800 ff9c Call Trace: [] dump_stack+0x49/0x5c [] warn_slowpath_common+0x8c/0xc0 [] warn_slowpath_null+0x1a/0x20 [] ext4_evict_inode+0x4c9/0x500 [ext4] [] evict+0xa7/0x1c0 [] iput_final+0xe3/0x170 [] iput+0x3e/0x50 [] do_unlinkat+0x1c6/0x280 [] ? task_work_run+0x94/0xf0 [] ? do_notify_resume+0x84/0x90 [] SyS_unlink+0x16/0x20 [] system_call_fastpath+0x16/0x1b ---[ end trace 15e812809616488b ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] arm: omap: Proper cleanups for omap_device
Am 07.08.2013 07:52, schrieb Greg Kroah-Hartman: On Tue, Aug 06, 2013 at 03:37:13PM +0200, Alexander Holler wrote: Am 06.08.2013 12:14, schrieb Greg Kroah-Hartman: What exactly is a platform device anyway? Originally it was a "something that wasn't connected to a bus, but just had memory-mapped i/o." Like the PS2 keyboard controller. Embedded systems got ahold of this and went to town, and made everything a platform device because they could, and no one was paying attention. Then OF came along and used it as well, and you know the rest... I think we need to get the ACPI and OF people, and me, in a room together at the kernel summit and not let us out until we have this all worked out. MFD uses platform devices too. Ugh, I've been avoiding looking at mfd for a long time now, and really don't want to start now... I've just mentioned it to suggest that platform devices seem to be used all over the kernel as the generic (minimal) form of a device driver. At least that is the impression I've got. Regards, Alexander Holler -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: perf,arm -- oops in validate_event
On Wed, 7 Aug 2013, Vince Weaver wrote: > On Wed, 7 Aug 2013, Will Deacon wrote: > > > Ok, so the following quick hack below should solve the issue (can you > > confirm > > it please, since I don't have access to any hardware atm?) > > > > We should revisit this for 3.12 though, because I'm not sure that our > > validation code even does the right thing when there are multiple PMUs > > involved. > > > > --->8 > > > > diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c > > index d9f5cd4..0500f10b 100644 > > --- a/arch/arm/kernel/perf_event.c > > +++ b/arch/arm/kernel/perf_event.c > > @@ -253,6 +253,9 @@ validate_event(struct pmu_hw_events *hw_events, > > struct arm_pmu *armpmu = to_arm_pmu(event->pmu); > > struct pmu *leader_pmu = event->group_leader->pmu; > > > > + if (is_software_event(event)) > > + return 1; > > + > > if (event->pmu != leader_pmu || event->state < PERF_EVENT_STATE_OFF) > > return 1; > > this isn't enough. You can also trigger the oops by using > tracepoint or breakpoint events as group leaders in addition to software > events. I take that back, it turns out tracepoint and breakpoint both have task_ctx_nr set to perf_sw_context (althouth breakpoint has a comment saying this may change in the future). Let me compile and verify the fix. It may take some time for the compile to finish as it's not a very fast machine. Vince -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] gpio: adnp: Fix segfault if request_threaded_irq fails
From: Lars Poeschel In case request_threaded_irq inside adnp_irq_setup fails, the driver segfaults. This is because irq_domain_remove is called twice with the same pointer. First time in adnp_irq_setup and then a second time after leaving adnp_irq_setup in the error path of adnp_i2c_probe inside adnp_teardown. This fixes this by removing the call to irq_domain_remove from adnp_irq_setup. Signed-off-by: Lars Poeschel --- drivers/gpio/gpio-adnp.c |6 +- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/gpio/gpio-adnp.c b/drivers/gpio/gpio-adnp.c index e60567f..c0f3fc4 100644 --- a/drivers/gpio/gpio-adnp.c +++ b/drivers/gpio/gpio-adnp.c @@ -490,15 +490,11 @@ static int adnp_irq_setup(struct adnp *adnp) if (err != 0) { dev_err(chip->dev, "can't request IRQ#%d: %d\n", adnp->client->irq, err); - goto error; + return err; } chip->to_irq = adnp_gpio_to_irq; return 0; - -error: - irq_domain_remove(adnp->domain); - return err; } static void adnp_irq_teardown(struct adnp *adnp) -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: WARNING: CPU: 26 PID: 93793 at fs/ext4/inode.c:230 ext4_evict_inode+0x4c9/0x500 [ext4]() still in 3.11-rc3
On Thu 01-08-13 20:58:46, Davidlohr Bueso wrote: > On Thu, 2013-08-01 at 22:33 +0200, Jan Kara wrote: > > Hi, > > > > On Thu 01-08-13 13:14:19, Davidlohr Bueso wrote: > > > FYI I'm seeing loads of the following messages with Linus' latest > > > 3.11-rc3 (which includes 822dbba33458cd6ad) > > Thanks for notice. I see you are running reaim to trigger this. What > > workload? > > After re-running the workloads one by one, I finally hit the issue again > with 'dbase'. FWIW I'm using ramdisks + ext4. Hum, I'm not able to reproduce this with current Linus' kernel - commit e4ef108fcde0b97ed38923ba1ea06c7a152bab9e - I've tried with ramdisk but no luck. Are you using some special mount options? Honza > > > > > [ cut here ] > > > WARNING: CPU: 26 PID: 93793 at fs/ext4/inode.c:230 > > > ext4_evict_inode+0x4c9/0x500 [ext4]() > > > Modules linked in: autofs4 cpufreq_ondemand freq_table sunrpc 8021q garp > > > stp llc pcc_cpufreq ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 > > > iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 > > > xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror > > > dm_region_hash dm_log dm_mod uinput iTCO_wdt iTCO_vendor_support coretemp > > > kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode pcspkr sg > > > lpc_ich mfd_core hpilo hpwdt i7core_edac edac_core netxen_nic mperf ext4 > > > jbd2 mbcache sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw > > > gf128mul glue_helper aes_x86_64 hpsa radeon ttm drm_kms_helper drm > > > i2c_algo_bit i2c_core [last unloaded: freq_table] > > > CPU: 26 PID: 93793 Comm: reaim Tainted: GW3.11.0-rc3+ #1 > > > Hardware name: HP ProLiant DL980 G7, BIOS P66 06/24/2011 > > > 00e6 8985db603d78 8153ce4d 00e6 > > > 8985db603db8 8104cf1c 8985db603dc8 > > > 8b05c485b8b0 8b05c485b9b8 8b05c485b800 ff9c > > > Call Trace: > > > [] dump_stack+0x49/0x5c > > > [] warn_slowpath_common+0x8c/0xc0 > > > [] warn_slowpath_null+0x1a/0x20 > > > [] ext4_evict_inode+0x4c9/0x500 [ext4] > > > [] evict+0xa7/0x1c0 > > > [] iput_final+0xe3/0x170 > > > [] iput+0x3e/0x50 > > > [] do_unlinkat+0x1c6/0x280 > > > [] ? task_work_run+0x94/0xf0 > > > [] ? do_notify_resume+0x84/0x90 > > > [] SyS_unlink+0x16/0x20 > > > [] system_call_fastpath+0x16/0x1b > > > ---[ end trace 15e812809616488b ]--- > > > > > > > > -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: perf,arm -- oops in validate_event
On Wed, 7 Aug 2013, Will Deacon wrote: > Ok, so the following quick hack below should solve the issue (can you confirm > it please, since I don't have access to any hardware atm?) > > We should revisit this for 3.12 though, because I'm not sure that our > validation code even does the right thing when there are multiple PMUs > involved. > > --->8 > > diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c > index d9f5cd4..0500f10b 100644 > --- a/arch/arm/kernel/perf_event.c > +++ b/arch/arm/kernel/perf_event.c > @@ -253,6 +253,9 @@ validate_event(struct pmu_hw_events *hw_events, > struct arm_pmu *armpmu = to_arm_pmu(event->pmu); > struct pmu *leader_pmu = event->group_leader->pmu; > > + if (is_software_event(event)) > + return 1; > + > if (event->pmu != leader_pmu || event->state < PERF_EVENT_STATE_OFF) > return 1; this isn't enough. You can also trigger the oops by using tracepoint or breakpoint events as group leaders in addition to software events. Vince -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] dmaengine: Add MOXA ART DMA engine driver
On Tue, Aug 06, 2013 at 01:38:31PM +0100, Jonas Jensen wrote: > The MOXA ART SoC has a DMA controller capable of offloading expensive > memory operations, such as large copies. This patch adds support for > the controller including four channels. Two of these are used to > handle MMC copy on the UC-7112-LX hardware. The remaining two can be > used in a future audio driver or client application. > > Signed-off-by: Jonas Jensen > --- > > Notes: > Add test dummy DMA channels to MMC, prove the controller > has support for interchangeable channel numbers [0]. > > Add new filter data struct, store dma_spec passed in xlate, > similar to proposed patch for omap/edma [1][2]. > > [0] > https://bitbucket.org/Kasreyn/linux-next/commits/2f17ac38c5d3af49bc0c559c429a351ddd40063d > [1] https://lkml.org/lkml/2013/8/1/750 "[PATCH] DMA: let filter > functions of of_dma_simple_xlate possible check of_node" > [2] https://lkml.org/lkml/2013/3/11/203 "A proposal to check the device > in generic way" > > Changes since v7: > > 1. remove unnecessary loop in moxart_alloc_chan_resources() > 2. remove unnecessary status check in moxart_tx_status() > 3. check/handle dma_async_device_register() return value > 4. check/handle devm_request_irq() return value > 5. add and use filter data struct > 6. check if channel device is the same as passed to >of_dma_controller_register() > 7. add check if chan->device->dev->of_node is the same as >dma_spec->np (xlate) > 8. support interchangeable channels, #dma-cells is now <1> > > device tree bindings document: > 9. update description and example, change "#dma-cells" to "<1>" > > Applies to next-20130806 > > .../devicetree/bindings/dma/moxa,moxart-dma.txt| 19 + > drivers/dma/Kconfig| 7 + > drivers/dma/Makefile | 1 + > drivers/dma/moxart-dma.c | 614 > + > 4 files changed, 641 insertions(+) > create mode 100644 Documentation/devicetree/bindings/dma/moxa,moxart-dma.txt > create mode 100644 drivers/dma/moxart-dma.c > > diff --git a/Documentation/devicetree/bindings/dma/moxa,moxart-dma.txt > b/Documentation/devicetree/bindings/dma/moxa,moxart-dma.txt > new file mode 100644 > index 000..69e7001 > --- /dev/null > +++ b/Documentation/devicetree/bindings/dma/moxa,moxart-dma.txt > @@ -0,0 +1,19 @@ > +MOXA ART DMA Controller > + > +See dma.txt first > + > +Required properties: > + > +- compatible : Must be "moxa,moxart-dma" > +- reg :Should contain registers location and length > +- interrupts : Should contain the interrupt number > +- #dma-cells : Should be 1, a single cell holding a line request number > + > +Example: > + > + dma: dma@9050 { > + compatible = "moxa,moxart-dma"; > + reg = <0x9050 0x1000>; > + interrupts = <24 0>; > + #dma-cells = <1>; > + }; The binding looks sensible to me now, but I have a couple of (hopefully final) questions on the probe failure path. [...] > + > + ret = dma_async_device_register(>dma_slave); > + if (ret) { > + dev_err(dev, "dma_async_device_register failed\n"); > + return ret; > + } > + > + ret = of_dma_controller_register(node, moxart_of_xlate, mdc); > + if (ret) { > + dev_err(dev, "of_dma_controller_register failed\n"); > + dma_async_device_unregister(>dma_slave); > + return ret; > + } > + > + platform_set_drvdata(pdev, mdc); > + > + tasklet_init(>tasklet, moxart_dma_tasklet, (unsigned long)mdc); > + > + ret = devm_request_irq(dev, irq, moxart_dma_interrupt, 0, > + "moxart-dma-engine", mdc); > + if (ret) { > + dev_err(dev, "devm_request_irq failed\n"); Do you not need calls to of_dma_controller_free and dma_async_device_unregister here? I'm not all that familiar with the DMA API, so maybe you don't. > + return ret; > + } > + > + dev_dbg(dev, "%s: IRQ=%u\n", __func__, irq); > + > + return 0; > +} > + > +static int moxart_remove(struct platform_device *pdev) > +{ > + struct moxart_dma_container *m = dev_get_drvdata(>dev); Similarly, do you not need to call of_dma_controller free here? > + dma_async_device_unregister(>dma_slave); > + return 0; > +} Thanks, Mark. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/1 resend] i2c: rcar: modify I2C driver
On Mon, Aug 05, 2013 at 04:19:34PM +0900, Nguyen Viet Dung wrote: > This patch modify I2C driver of rcar-H1 to usable on both rcar-H1 and rcar-H2. > > Signed-off-by: Nguyen Viet Dung Isn't it possible to distinguish between H1 and H2 somewhere in hardware? Then we could skip the 'flags' variable in pdata. Thanks, Wolfram signature.asc Description: Digital signature
Re: [PATCH] ARM: dts: am33xx: Correct gpio #interrupt-cells property
On Wednesday 07 August 2013 at 16:53:09, Mark Rutland wrote: > On Wed, Aug 07, 2013 at 12:06:32PM +0100, Lars Poeschel wrote: > > From: Lars Poeschel > > > > Following commit ff5c9059 and therefore other omap platforms using > > the gpio-omap driver correct the #interrupt-cells property on am33xx > > too. The omap gpio binding documentaion also states that > > the #interrupt-cells property should be 2. > > I take it there are no device nodes for which any of these nodes are an > interrupt parent (which would need to be updated)? As far as I know: No. Lars > If so: > > Acked-by: Mark Rutland > > Thanks, > Mark. > > > Signed-off-by: Lars Poeschel > > --- > > > > arch/arm/boot/dts/am33xx.dtsi |8 > > 1 file changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/arch/arm/boot/dts/am33xx.dtsi > > b/arch/arm/boot/dts/am33xx.dtsi index 38b446b..033c5dd 100644 > > --- a/arch/arm/boot/dts/am33xx.dtsi > > +++ b/arch/arm/boot/dts/am33xx.dtsi > > @@ -102,7 +102,7 @@ > > > > gpio-controller; > > #gpio-cells = <2>; > > interrupt-controller; > > > > - #interrupt-cells = <1>; > > + #interrupt-cells = <2>; > > > > reg = <0x44e07000 0x1000>; > > interrupts = <96>; > > > > }; > > > > @@ -113,7 +113,7 @@ > > > > gpio-controller; > > #gpio-cells = <2>; > > interrupt-controller; > > > > - #interrupt-cells = <1>; > > + #interrupt-cells = <2>; > > > > reg = <0x4804c000 0x1000>; > > interrupts = <98>; > > > > }; > > > > @@ -124,7 +124,7 @@ > > > > gpio-controller; > > #gpio-cells = <2>; > > interrupt-controller; > > > > - #interrupt-cells = <1>; > > + #interrupt-cells = <2>; > > > > reg = <0x481ac000 0x1000>; > > interrupts = <32>; > > > > }; > > > > @@ -135,7 +135,7 @@ > > > > gpio-controller; > > #gpio-cells = <2>; > > interrupt-controller; > > > > - #interrupt-cells = <1>; > > + #interrupt-cells = <2>; > > > > reg = <0x481ae000 0x1000>; > > interrupts = <62>; > > > > }; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: List corruption in hidraw_release in 3.11-rc4
Hi Peter, The patch I posted was solving slab memory corruption issue which was occurring because of the race in device disconnect and device release. We found the some of the device data structure being used after free. Later we figure out the patch which was reverted earlier was solving our issue but there was still some slab memory corruption. That was due to reason that list delete of the device was called after freeing the hidraw. I protect drop_ref by mutex lock and also delete the list before calling drop_ref that solve the issue. If you are seeing memory corruption then the patch could solve your issue. Regards -Manoj -Original Message- From: Jiri Kosina [mailto:jkos...@suse.cz] Sent: Wednesday, August 07, 2013 7:04 PM To: Peter Wu Cc: linux-in...@vger.kernel.org; Manoj Chourasia; linux-kernel@vger.kernel.org; alno...@suse.cz Subject: Re: List corruption in hidraw_release in 3.11-rc4 On Wed, 7 Aug 2013, Peter Wu wrote: > > does the patch below fix the problem you are seeing? > That one is already in 3.11-rc4 as far as I can see. Also, that code > can probably simplified by moving the mutex_unlock after the out > label, removing the need to duplicate the mutex_unlock. > > Remember what I said about "no Oopses"? Well, it turned out that > several memory structures were damaged which causes a general > protection fault in sock_alloc_inode and other places. > > I managed to create a program that can reproduce this bug 100% in a > QEMU virtual machine with a Logitech USB receiver passed to it. > > qemu-system-x86_64 -enable-kvm -m 1G -usb -usbdevice host:046d:c52b > (pass -kernel, -initrd, -append as needed) > > Copy hidraw-test to initrd, boot QEMU and run `hidraw-test`. Result: > instant (= +/- 2 seconds) crash. > > I have applied Manoj's patch[1] on top of 3.11-rc4 which seem to fix the > issue. > One observation is that the new device is named /dev/hidraw1 instead > of /dev/hidraw0. Example: > > f(){ hidraw-test /dev/hidraw$1 usb1;} > # needed for 3.11-rc4 > f 1; f 1 # crash > # needed for 3.11-rc4 + patch > f 1; f 2 # ok > > Regards, > Peter > > [1]: http://lkml.org/lkml/2013/7/22/248 That one I am still reviewing ... can I add your Tested-by: to it when I'll be applying it and pushing to Linus? Thanks. > -- > /* cc hidraw-test.c -o hidraw-test > * hidraw-test /dev/hidraw0 usb1; hidraw-test /dev/hidraw0 usb1; */ > #include #include #include #include > #include #include > > int open_and_write(const char *path, const char *data) { > int sfd, r; > > sfd = open(path, O_WRONLY); > if (sfd < 0) { > perror(path); > return 1; > } > > r = write(sfd, data, strlen(data)); > if (r < 0) { > fprintf(stderr, "write(%s, %s): %s\n", > path, data, strerror(errno)); > return 1; > } > close(sfd); > return 0; > } > > int dork(const char *hiddev, const char *name) { > int fd; > char c; > > fd = open(hiddev, O_RDWR | O_NONBLOCK); > if (fd < 0) { > perror("open"); > return 1; > } > > if (open_and_write("/sys/bus/usb/drivers/usb/unbind", name)) > return 1; > > // does not make a difference > //sleep(1); > > if (open_and_write("/sys/bus/usb/drivers/usb/bind", name)) > return 1; > > // allow devices to get discovered > sleep(1); > > printf("read() = %zi\n", read(fd, , 1)); perror("read"); > close(fd); > return 0; > } > > int main(int argc, char **argv) { > if (argc < 3) { > fprintf(stderr, "Usage: %s /dev/hidrawN usbN\n", *argv); > return 1; > } > > system("modprobe -v usbhid"); > system("modprobe -v hid-logitech-dj"); > > dork(argv[1], argv[2]); > > return 0; > } > -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] i2c: add sanity check to i2c_put_adapter
On Thu, Aug 01, 2013 at 02:10:46PM +0200, Sebastian Hesselbarth wrote: > i2c_put_adapter dereferences i2c_adapter pointer passed without check > for NULL. This adds a check for non-NULL pointer to allow i2c_put_adapter > called with NULL and behave the same way i2c_release_client does already. > > Signed-off-by: Sebastian Hesselbarth Applied to for-next, thanks! Please describe the use case next time in the patch description. The current text describes more what is changed not why. You did that later ("easier probing"). signature.asc Description: Digital signature
Re: [RFC] gcc feature request: Moving blocks into sections
On Wed, 2013-08-07 at 07:06 +0200, Ondřej Bílka wrote: > Add short_counter,long_counter and before increment counter before each > jump. That way we will know how many short/long jumps were taken. That's not trivial at all. The jump is a single location (in an asm goto() statement) that happens to be inlined through out the kernel. The assembler decides if it will be a short or long jump. How do you add a counter to count the difference? The output I gave is from the boot up code that converts the jmp back to a nop (or in this case, the default nop to the ideal nop). It knows the size by reading the op code. This is a static analysis, not a running one. It's no trivial task to have a counter for each jump. There is a way though. If we enable all the jumps (all tracepoints, and other users of jumplabel), record the trace and then compare the trace to the output that shows which ones were short jumps, and all others are long jumps. I'll post the patches soon and you can have fun doing the compare :-) Actually, I'm working on the 4 patches of the series that is more about clean ups and safety checks than the jmp conversion. That is not controversial, and I'll be posting them for 3.12 soon. After that, I'll post the updated patches that have the conversion as well as the counter, for RFC and for others to play with. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tile: remove unnecessary backslashes in asm-offsets.c
Pointed out by checkpatch. A few of the DEFINE() lines were properly written without backslash continuation; fix the rest. Signed-off-by: Chris Metcalf --- arch/tile/kernel/asm-offsets.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/arch/tile/kernel/asm-offsets.c b/arch/tile/kernel/asm-offsets.c index 01ddf19..8fff475 100644 --- a/arch/tile/kernel/asm-offsets.c +++ b/arch/tile/kernel/asm-offsets.c @@ -37,28 +37,28 @@ void foo(void) { - DEFINE(SINGLESTEP_STATE_BUFFER_OFFSET, \ + DEFINE(SINGLESTEP_STATE_BUFFER_OFFSET, offsetof(struct single_step_state, buffer)); - DEFINE(SINGLESTEP_STATE_FLAGS_OFFSET, \ + DEFINE(SINGLESTEP_STATE_FLAGS_OFFSET, offsetof(struct single_step_state, flags)); - DEFINE(SINGLESTEP_STATE_ORIG_PC_OFFSET, \ + DEFINE(SINGLESTEP_STATE_ORIG_PC_OFFSET, offsetof(struct single_step_state, orig_pc)); - DEFINE(SINGLESTEP_STATE_NEXT_PC_OFFSET, \ + DEFINE(SINGLESTEP_STATE_NEXT_PC_OFFSET, offsetof(struct single_step_state, next_pc)); - DEFINE(SINGLESTEP_STATE_BRANCH_NEXT_PC_OFFSET, \ + DEFINE(SINGLESTEP_STATE_BRANCH_NEXT_PC_OFFSET, offsetof(struct single_step_state, branch_next_pc)); - DEFINE(SINGLESTEP_STATE_UPDATE_VALUE_OFFSET, \ + DEFINE(SINGLESTEP_STATE_UPDATE_VALUE_OFFSET, offsetof(struct single_step_state, update_value)); - DEFINE(THREAD_INFO_TASK_OFFSET, \ + DEFINE(THREAD_INFO_TASK_OFFSET, offsetof(struct thread_info, task)); - DEFINE(THREAD_INFO_FLAGS_OFFSET, \ + DEFINE(THREAD_INFO_FLAGS_OFFSET, offsetof(struct thread_info, flags)); - DEFINE(THREAD_INFO_STATUS_OFFSET, \ + DEFINE(THREAD_INFO_STATUS_OFFSET, offsetof(struct thread_info, status)); - DEFINE(THREAD_INFO_HOMECACHE_CPU_OFFSET, \ + DEFINE(THREAD_INFO_HOMECACHE_CPU_OFFSET, offsetof(struct thread_info, homecache_cpu)); - DEFINE(THREAD_INFO_STEP_STATE_OFFSET, \ + DEFINE(THREAD_INFO_STEP_STATE_OFFSET, offsetof(struct thread_info, step_state)); DEFINE(TASK_STRUCT_THREAD_KSP_OFFSET, @@ -66,11 +66,11 @@ void foo(void) DEFINE(TASK_STRUCT_THREAD_PC_OFFSET, offsetof(struct task_struct, thread.pc)); - DEFINE(HV_TOPOLOGY_WIDTH_OFFSET, \ + DEFINE(HV_TOPOLOGY_WIDTH_OFFSET, offsetof(HV_Topology, width)); - DEFINE(HV_TOPOLOGY_HEIGHT_OFFSET, \ + DEFINE(HV_TOPOLOGY_HEIGHT_OFFSET, offsetof(HV_Topology, height)); - DEFINE(IRQ_CPUSTAT_SYSCALL_COUNT_OFFSET, \ + DEFINE(IRQ_CPUSTAT_SYSCALL_COUNT_OFFSET, offsetof(irq_cpustat_t, irq_syscall_count)); } -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tile: various console improvements
This change improves and cleans up the tile console. - We enable HVC_IRQ support on tilegx, with the addition of a new Tilera hypervisor API for tilegx to allow a console IPI. If IPI support is not available we fall back to the previous polling mode. - We simplify the earlyprintk code to use CON_BOOT and eliminate some of the other supporting earlyprintk code. - A new tile_console_write() primitive is used to send output to the console and is factored out of the hvc_tile driver. This lets us support a "sim_console" boot argument to allow using simulator hooks to send output to the "console" as a slightly faster alternative to emulating the hardware more directly. Signed-off-by: Chris Metcalf --- arch/tile/Kconfig | 1 + arch/tile/include/asm/setup.h | 3 +- arch/tile/include/hv/hypervisor.h | 29 +++- arch/tile/kernel/early_printk.c | 47 +++- arch/tile/kernel/hvglue.lds | 3 +- arch/tile/kernel/reboot.c | 2 - drivers/tty/hvc/hvc_tile.c| 149 -- 7 files changed, 186 insertions(+), 48 deletions(-) diff --git a/arch/tile/Kconfig b/arch/tile/Kconfig index e41a381..0576e1d 100644 --- a/arch/tile/Kconfig +++ b/arch/tile/Kconfig @@ -112,6 +112,7 @@ config SMP config HVC_TILE depends on TTY select HVC_DRIVER + select HVC_IRQ if TILEGX def_bool y config TILEGX diff --git a/arch/tile/include/asm/setup.h b/arch/tile/include/asm/setup.h index d04..e989090 100644 --- a/arch/tile/include/asm/setup.h +++ b/arch/tile/include/asm/setup.h @@ -24,9 +24,8 @@ */ #define MAXMEM_PFN PFN_DOWN(MAXMEM) +int tile_console_write(const char *buf, int count); void early_panic(const char *fmt, ...); -void warn_early_printk(void); -void __init disable_early_printk(void); /* Init-time routine to do tile-specific per-cpu setup. */ void setup_cpu(int boot); diff --git a/arch/tile/include/hv/hypervisor.h b/arch/tile/include/hv/hypervisor.h index 837dca5..f882ebc 100644 --- a/arch/tile/include/hv/hypervisor.h +++ b/arch/tile/include/hv/hypervisor.h @@ -318,8 +318,11 @@ /** hv_set_pte_super_shift */ #define HV_DISPATCH_SET_PTE_SUPER_SHIFT 57 +/** hv_console_set_ipi */ +#define HV_DISPATCH_CONSOLE_SET_IPI 63 + /** One more than the largest dispatch value */ -#define _HV_DISPATCH_END 58 +#define _HV_DISPATCH_END 64 #ifndef __ASSEMBLER__ @@ -585,6 +588,30 @@ typedef struct */ int hv_get_ipi_pte(HV_Coord tile, int pl, HV_PTE* pte); +/** Configure the console interrupt. + * + * When the console client interrupt is enabled, the hypervisor will + * deliver the specified IPI to the client in the following situations: + * + * - The console has at least one character available for input. + * + * - The console can accept new characters for output, and the last call + * to hv_console_write() did not write all of the characters requested + * by the client. + * + * Note that in some system configurations, console interrupt will not + * be available; clients should be prepared for this routine to fail and + * to fall back to periodic console polling in that case. + * + * @param ipi Index of the IPI register which will receive the interrupt. + * @param event IPI event number for console interrupt. If less than 0, + *disable the console IPI interrupt. + * @param coord Tile to be targeted for console interrupt. + * @return 0 on success, otherwise, HV_EINVAL if illegal parameter, + * HV_ENOTSUP if console interrupt are not available. + */ +int hv_console_set_ipi(int ipi, int event, HV_Coord coord); + #else /* !CHIP_HAS_IPI() */ /** A set of interrupts. */ diff --git a/arch/tile/kernel/early_printk.c b/arch/tile/kernel/early_printk.c index 34d72a1..b608e00 100644 --- a/arch/tile/kernel/early_printk.c +++ b/arch/tile/kernel/early_printk.c @@ -23,19 +23,24 @@ static void early_hv_write(struct console *con, const char *s, unsigned n) { - hv_console_write((HV_VirtAddr) s, n); + tile_console_write(s, n); + + /* +* Convert NL to NLCR (close enough to CRNL) during early boot. +* We assume newlines are at the ends of strings, which turns out +* to be good enough for early boot console output. +*/ + if (n && s[n-1] == '\n') + tile_console_write("\r", 1); } static struct console early_hv_console = { .name = "earlyhv", .write =early_hv_write, - .flags =CON_PRINTBUFFER, + .flags =CON_PRINTBUFFER | CON_BOOT, .index =-1, }; -/* Direct interface for emergencies */ -static int early_console_complete; - void early_panic(const char *fmt, ...) { va_list ap; @@ -43,51 +48,21 @@ void early_panic(const char *fmt, ...) va_start(ap, fmt); early_printk("Kernel panic - not syncing: "); early_vprintk(fmt, ap); -
[PATCH] tile: support "memmap" boot parameter
This change adds support for the "memmap" boot parameter similar to what x86 provides. The tile version supports "memmap=1G$5G", for example, as a way to reserve a 1 GB range starting at PA 5GB. The memory is reserved via bootmem during startup, and we create a suitable "struct resource" marked as "Reserved" so you can see the range reported by /proc/iomem. Up to 64 such regions can currently be reserved on the boot command line. We do not support the x86 options "memmap=nn@ss" (force some memory to be available at the given address) since it's pointless to try to have Linux use memory the Tilera hypervisor hasn't given it. We do not support "memmap=nn#ss" to add an ACPI range for later processing, since we don't support ACPI. We do not support "memmap=exactmap" since we don't support reading the e820 information from the BIOS like x86 does. I did add support for "memmap=nn" (and the synonym "mem=nn") which cap the highest PA value at "nn"; these are both just a synonym for the existing tile boot option "maxmem". Signed-off-by: Chris Metcalf --- arch/tile/kernel/setup.c | 80 +--- 1 file changed, 76 insertions(+), 4 deletions(-) diff --git a/arch/tile/kernel/setup.c b/arch/tile/kernel/setup.c index 676e155..b00e156 100644 --- a/arch/tile/kernel/setup.c +++ b/arch/tile/kernel/setup.c @@ -154,6 +154,65 @@ static int __init setup_maxnodemem(char *str) } early_param("maxnodemem", setup_maxnodemem); +struct memmap_entry { + u64 addr; /* start of memory segment */ + u64 size; /* size of memory segment */ +}; +static struct memmap_entry memmap_map[64]; +static int memmap_nr; + +static void add_memmap_region(u64 addr, u64 size) +{ + if (memmap_nr >= ARRAY_SIZE(memmap_map)) { + pr_err("Ooops! Too many entries in the memory map!\n"); + return; + } + memmap_map[memmap_nr].addr = addr; + memmap_map[memmap_nr].size = size; + memmap_nr++; +} + +static int __init setup_memmap(char *p) +{ + char *oldp; + u64 start_at, mem_size; + + if (!p) + return -EINVAL; + + if (!strncmp(p, "exactmap", 8)) { + pr_err("\"memmap=exactmap\" not valid on tile\n"); + return 0; + } + + oldp = p; + mem_size = memparse(p, ); + if (p == oldp) + return -EINVAL; + + if (*p == '@') { + pr_err("\"memmap=nn@ss\" (force RAM) invalid on tile\n"); + } else if (*p == '#') { + pr_err("\"memmap=nn#ss\" (force ACPI data) invalid on tile\n"); + } else if (*p == '$') { + start_at = memparse(p+1, ); + add_memmap_region(start_at, mem_size); + } else { + if (mem_size == 0) + return -EINVAL; + maxmem_pfn = (mem_size >> HPAGE_SHIFT) << + (HPAGE_SHIFT - PAGE_SHIFT); + } + return *p == '\0' ? 0 : -EINVAL; +} +early_param("memmap", setup_memmap); + +static int __init setup_mem(char *str) +{ + return setup_maxmem(str); +} +early_param("mem", setup_mem); /* compatibility with x86 */ + static int __init setup_isolnodes(char *str) { char buf[MAX_NUMNODES * 5]; @@ -629,6 +688,12 @@ static void __init setup_bootmem_allocator(void) for (i = 0; i < MAX_NUMNODES; ++i) setup_bootmem_allocator_node(i); + /* Reserve any memory excluded by "memmap" arguments. */ + for (i = 0; i < memmap_nr; ++i) { + struct memmap_entry *m = _map[i]; + reserve_bootmem(m->addr, m->size, 0); + } + #ifdef CONFIG_KEXEC if (crashk_res.start != crashk_res.end) reserve_bootmem(crashk_res.start, resource_size(_res), 0); @@ -1562,11 +1627,11 @@ insert_non_bus_resource(void) #endif static struct resource* __init -insert_ram_resource(u64 start_pfn, u64 end_pfn) +insert_ram_resource(u64 start_pfn, u64 end_pfn, bool reserved) { struct resource *res = kzalloc(sizeof(struct resource), GFP_ATOMIC); - res->name = "System RAM"; + res->name = reserved ? "Reserved" : "System RAM"; res->start = start_pfn << PAGE_SHIFT; res->end = (end_pfn << PAGE_SHIFT) - 1; res->flags = IORESOURCE_BUSY | IORESOURCE_MEM; @@ -1601,11 +1666,11 @@ static int __init request_standard_resources(void) end_pfn > pci_reserve_start_pfn) { if (end_pfn > pci_reserve_end_pfn) insert_ram_resource(pci_reserve_end_pfn, -end_pfn); + end_pfn, 0); end_pfn = pci_reserve_start_pfn; } #endif - insert_ram_resource(start_pfn, end_pfn); + insert_ram_resource(start_pfn, end_pfn, 0); } code_resource.start = __pa(_text -
Re: [PATCH] i2c: mv64xxx: Document the newly introduced allwinner compatible
On Wed, Jul 24, 2013 at 09:14:35AM +0200, Maxime Ripard wrote: > Signed-off-by: Maxime Ripard Applied to for-current, thanks! And please, always send to the I2C list. I work heavily with patchwork monitoring the I2C list; everything not there will easily be forgotten! signature.asc Description: Digital signature
[PATCH] tile: fix comment bug in sys_cmpxchg description
Signed-off-by: Chris Metcalf --- arch/tile/kernel/intvec_32.S | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/tile/kernel/intvec_32.S b/arch/tile/kernel/intvec_32.S index cb52d66..25966af 100644 --- a/arch/tile/kernel/intvec_32.S +++ b/arch/tile/kernel/intvec_32.S @@ -1609,7 +1609,7 @@ ENTRY(sys_cmpxchg) * Because of C pointer arithmetic, we want to compute this: * * ((char*)atomic_locks + - * (((r0 >> 3) & (1 << (ATOMIC_HASH_SIZE - 1))) << 2)) + * (((r0 >> 3) & ((1 << ATOMIC_HASH_SHIFT) - 1)) << 2)) * * Instead of two shifts we just ">> 1", and use 'mm' * to ignore the low and high bits we don't want. -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] mm: make lru_add_drain_all() selective
This change makes lru_add_drain_all() only selectively interrupt the cpus that have per-cpu free pages that can be drained. This is important in nohz mode where calling mlockall(), for example, otherwise will interrupt every core unnecessarily. Signed-off-by: Chris Metcalf --- include/linux/workqueue.h | 3 +++ kernel/workqueue.c| 35 ++- mm/swap.c | 38 +- 3 files changed, 66 insertions(+), 10 deletions(-) diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index a0ed78a..71a3fe7 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -13,6 +13,8 @@ #include #include +struct cpumask; + struct workqueue_struct; struct work_struct; @@ -470,6 +472,7 @@ extern void flush_workqueue(struct workqueue_struct *wq); extern void drain_workqueue(struct workqueue_struct *wq); extern void flush_scheduled_work(void); +extern int schedule_on_cpu_mask(work_func_t func, const struct cpumask *mask); extern int schedule_on_each_cpu(work_func_t func); int execute_in_process_context(work_func_t fn, struct execute_work *); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index f02c4a4..a6d1809 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -2962,17 +2962,18 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork) EXPORT_SYMBOL(cancel_delayed_work_sync); /** - * schedule_on_each_cpu - execute a function synchronously on each online CPU + * schedule_on_cpu_mask - execute a function synchronously on each listed CPU * @func: the function to call + * @mask: the cpumask to invoke the function on * - * schedule_on_each_cpu() executes @func on each online CPU using the + * schedule_on_cpu_mask() executes @func on each listed CPU using the * system workqueue and blocks until all CPUs have completed. - * schedule_on_each_cpu() is very slow. + * schedule_on_cpu_mask() is very slow. * * RETURNS: * 0 on success, -errno on failure. */ -int schedule_on_each_cpu(work_func_t func) +int schedule_on_cpu_mask(work_func_t func, const struct cpumask *mask) { int cpu; struct work_struct __percpu *works; @@ -2981,24 +2982,40 @@ int schedule_on_each_cpu(work_func_t func) if (!works) return -ENOMEM; - get_online_cpus(); - - for_each_online_cpu(cpu) { + for_each_cpu(cpu, mask) { struct work_struct *work = per_cpu_ptr(works, cpu); INIT_WORK(work, func); schedule_work_on(cpu, work); } - for_each_online_cpu(cpu) + for_each_cpu(cpu, mask) flush_work(per_cpu_ptr(works, cpu)); - put_online_cpus(); free_percpu(works); return 0; } /** + * schedule_on_each_cpu - execute a function synchronously on each online CPU + * @func: the function to call + * + * schedule_on_each_cpu() executes @func on each online CPU using the + * system workqueue and blocks until all CPUs have completed. + * schedule_on_each_cpu() is very slow. + * + * RETURNS: + * 0 on success, -errno on failure. + */ +int schedule_on_each_cpu(work_func_t func) +{ + get_online_cpus(); + schedule_on_cpu_mask(func, cpu_online_mask); + put_online_cpus(); + return 0; +} + +/** * flush_scheduled_work - ensure that any scheduled work has run to completion. * * Forces execution of the kernel-global workqueue and blocks until its diff --git a/mm/swap.c b/mm/swap.c index 4a1d0d2..981b1d9 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -683,7 +683,43 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) */ int lru_add_drain_all(void) { - return schedule_on_each_cpu(lru_add_drain_per_cpu); + cpumask_var_t mask; + int cpu, rc; + + if (!alloc_cpumask_var(, GFP_KERNEL)) + return -ENOMEM; + cpumask_clear(mask); + + /* +* Figure out which cpus need flushing. It's OK if we race +* with changes to the per-cpu lru pvecs, since it's no worse +* than if we flushed all cpus, since a cpu could still end +* up putting pages back on its pvec before we returned. +* And this avoids interrupting other cpus unnecessarily. +*/ + for_each_online_cpu(cpu) { + struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu); + struct pagevec *pvec = _cpu(lru_rotate_pvecs, cpu); + int count = pagevec_count(pvec); + int lru; + + if (!count) { + for_each_lru(lru) { + pvec = [lru - LRU_BASE]; + count = pagevec_count(pvec); + if (count) + break; + } + } + + if (count) + cpumask_set_cpu(cpu, mask); + } + + rc = schedule_on_cpu_mask(lru_add_drain_per_cpu,
[PATCH] tile: avoid recursive backtrace faults
This change adds support for avoiding recursive backtracer crashes; we haven't seen this in practice other than when things are seriously corrupt, but it may help avoid losing the root cause of a crash. Also, don't abort kernel backtracers for invalid userspace PC's. If we do, we lose the ability to backtrace through a userspace call to a bad address above PAGE_OFFSET, even though that it can be perfectly reasonable to continue the backtrace in such a case. Signed-off-by: Chris Metcalf --- arch/tile/include/asm/processor.h | 2 ++ arch/tile/kernel/stack.c | 30 -- 2 files changed, 30 insertions(+), 2 deletions(-) diff --git a/arch/tile/include/asm/processor.h b/arch/tile/include/asm/processor.h index cda2724..fed1c04 100644 --- a/arch/tile/include/asm/processor.h +++ b/arch/tile/include/asm/processor.h @@ -110,6 +110,8 @@ struct thread_struct { unsigned long long interrupt_mask; /* User interrupt-control 0 state */ unsigned long intctrl_0; + /* Is this task currently doing a backtrace? */ + bool in_backtrace; #if CHIP_HAS_PROC_STATUS_SPR() /* Any other miscellaneous processor state bits */ unsigned long proc_status; diff --git a/arch/tile/kernel/stack.c b/arch/tile/kernel/stack.c index af8dfc9..c972689 100644 --- a/arch/tile/kernel/stack.c +++ b/arch/tile/kernel/stack.c @@ -103,8 +103,7 @@ static struct pt_regs *valid_fault_handler(struct KBacktraceIterator* kbt) if (kbt->verbose) pr_err(" <%s while in kernel mode>\n", fault); } else if (EX1_PL(p->ex1) == USER_PL && - p->pc < PAGE_OFFSET && - p->sp < PAGE_OFFSET) { + p->sp < PAGE_OFFSET && p->sp != 0) { if (kbt->verbose) pr_err(" <%s while in user mode>\n", fault); } else if (kbt->verbose) { @@ -352,6 +351,26 @@ static void describe_addr(struct KBacktraceIterator *kbt, } /* + * Avoid possible crash recursion during backtrace. If it happens, it + * makes it easy to lose the actual root cause of the failure, so we + * put a simple guard on all the backtrace loops. + */ +static bool start_backtrace(void) +{ + if (current->thread.in_backtrace) { + pr_err("Backtrace requested while in backtrace!\n"); + return false; + } + current->thread.in_backtrace = true; + return true; +} + +static void end_backtrace(void) +{ + current->thread.in_backtrace = false; +} + +/* * This method wraps the backtracer's more generic support. * It is only invoked from the architecture-specific code; show_stack() * and dump_stack() (in entry.S) are architecture-independent entry points. @@ -361,6 +380,8 @@ void tile_show_stack(struct KBacktraceIterator *kbt, int headers) int i; int have_mmap_sem = 0; + if (!start_backtrace()) + return; if (headers) { /* * Add a blank line since if we are called from panic(), @@ -402,6 +423,7 @@ void tile_show_stack(struct KBacktraceIterator *kbt, int headers) pr_err("Stack dump complete\n"); if (have_mmap_sem) up_read(>task->mm->mmap_sem); + end_backtrace(); } EXPORT_SYMBOL(tile_show_stack); @@ -463,6 +485,8 @@ void save_stack_trace_tsk(struct task_struct *task, struct stack_trace *trace) int skip = trace->skip; int i = 0; + if (!start_backtrace()) + goto done; if (task == NULL || task == current) KBacktraceIterator_init_current(); else @@ -476,6 +500,8 @@ void save_stack_trace_tsk(struct task_struct *task, struct stack_trace *trace) break; trace->entries[i++] = kbt.it.pc; } + end_backtrace(); +done: trace->nr_entries = i; } EXPORT_SYMBOL(save_stack_trace_tsk); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] tile: fix tilegx vmalloc_sync_all BUG_ON
As specified, the test wasn't correct, and in any case it should be a BUILD_BUG_ON. Signed-off-by: Chris Metcalf --- arch/tile/mm/fault.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c index f7f99f9..6152819 100644 --- a/arch/tile/mm/fault.c +++ b/arch/tile/mm/fault.c @@ -870,7 +870,8 @@ void vmalloc_sync_all(void) { #ifdef __tilegx__ /* Currently all L1 kernel pmd's are static and shared. */ - BUG_ON(pgd_index(VMALLOC_END) != pgd_index(VMALLOC_START)); + BUILD_BUG_ON(pgd_index(VMALLOC_END - PAGE_SIZE) != +pgd_index(VMALLOC_START)); #else /* * Note that races in the updates of insync and start aren't -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2 3/3] mm: page_alloc: fair zone allocator policy
On Fri, Aug 02, 2013 at 11:37:26AM -0400, Johannes Weiner wrote: > Each zone that holds userspace pages of one workload must be aged at a > speed proportional to the zone size. Otherwise, the time an > individual page gets to stay in memory depends on the zone it happened > to be allocated in. Asymmetry in the zone aging creates rather > unpredictable aging behavior and results in the wrong pages being > reclaimed, activated etc. > > But exactly this happens right now because of the way the page > allocator and kswapd interact. The page allocator uses per-node lists > of all zones in the system, ordered by preference, when allocating a > new page. When the first iteration does not yield any results, kswapd > is woken up and the allocator retries. Due to the way kswapd reclaims > zones below the high watermark while a zone can be allocated from when > it is above the low watermark, the allocator may keep kswapd running > while kswapd reclaim ensures that the page allocator can keep > allocating from the first zone in the zonelist for extended periods of > time. Meanwhile the other zones rarely see new allocations and thus > get aged much slower in comparison. > > The result is that the occasional page placed in lower zones gets > relatively more time in memory, even gets promoted to the active list > after its peers have long been evicted. Meanwhile, the bulk of the > working set may be thrashing on the preferred zone even though there > may be significant amounts of memory available in the lower zones. > > Even the most basic test -- repeatedly reading a file slightly bigger > than memory -- shows how broken the zone aging is. In this scenario, > no single page should be able stay in memory long enough to get > referenced twice and activated, but activation happens in spades: > > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 0 > nr_inactive_file 0 > nr_active_file 8 > nr_inactive_file 1582 > nr_active_file 11994 > $ cat data data data data >/dev/null > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 70 > nr_inactive_file 258753 > nr_active_file 443214 > nr_inactive_file 149793 > nr_active_file 12021 > > Fix this with a very simple round robin allocator. Each zone is > allowed a batch of allocations that is proportional to the zone's > size, after which it is treated as full. The batch counters are reset > when all zones have been tried and the allocator enters the slowpath > and kicks off kswapd reclaim. Allocation and reclaim is now fairly > spread out to all available/allowable zones: > > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 0 > nr_inactive_file 174 > nr_active_file 4865 > nr_inactive_file 53 > nr_active_file 860 > $ cat data data data data >/dev/null > $ grep active_file /proc/zoneinfo > nr_inactive_file 0 > nr_active_file 0 > nr_inactive_file 22 > nr_active_file 4988 > nr_inactive_file 190969 > nr_active_file 937 > > When zone_reclaim_mode is enabled, allocations will now spread out to > all zones on the local node, not just the first preferred zone (which > on a 4G node might be a tiny Normal zone). > > Signed-off-by: Johannes Weiner > Tested-by: Zlatko Calusic > --- > include/linux/mmzone.h | 1 + > mm/page_alloc.c| 69 > ++ > 2 files changed, 60 insertions(+), 10 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index af4a3b7..dcad2ab 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -352,6 +352,7 @@ struct zone { >* free areas of different sizes >*/ > spinlock_t lock; > + int alloc_batch; > int all_unreclaimable; /* All pages pinned */ > #if defined CONFIG_COMPACTION || defined CONFIG_CMA > /* Set to true when the PG_migrate_skip bits should be cleared */ This adds a dirty cache line that is updated on every allocation even if it's from the per-cpu allocator. I am concerned that this will introduce noticable overhead in the allocator paths on large machines running allocator intensive workloads. Would it be possible to move it into the per-cpu pageset? I understand that hte round-robin nature will then depend on what CPU is running and the performance characterisics will be different. There might even be an adverse workload that uses all the batches from all available CPUs until it is essentially the same problem but that would be a very worst case. I would hope that in general it would work without adding a big source of dirty cache line bouncing in the middle of the allocator. What I do not know offhand is how much space there is in that pageset thing before it grows by another cache line. I should note that the page allocator
Re: [PATCH] ARM: dts: am33xx: Correct gpio #interrupt-cells property
On Wed, Aug 07, 2013 at 12:06:32PM +0100, Lars Poeschel wrote: > From: Lars Poeschel > > Following commit ff5c9059 and therefore other omap platforms using > the gpio-omap driver correct the #interrupt-cells property on am33xx > too. The omap gpio binding documentaion also states that > the #interrupt-cells property should be 2. I take it there are no device nodes for which any of these nodes are an interrupt parent (which would need to be updated)? If so: Acked-by: Mark Rutland Thanks, Mark. > > Signed-off-by: Lars Poeschel > --- > arch/arm/boot/dts/am33xx.dtsi |8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/arm/boot/dts/am33xx.dtsi b/arch/arm/boot/dts/am33xx.dtsi > index 38b446b..033c5dd 100644 > --- a/arch/arm/boot/dts/am33xx.dtsi > +++ b/arch/arm/boot/dts/am33xx.dtsi > @@ -102,7 +102,7 @@ > gpio-controller; > #gpio-cells = <2>; > interrupt-controller; > - #interrupt-cells = <1>; > + #interrupt-cells = <2>; > reg = <0x44e07000 0x1000>; > interrupts = <96>; > }; > @@ -113,7 +113,7 @@ > gpio-controller; > #gpio-cells = <2>; > interrupt-controller; > - #interrupt-cells = <1>; > + #interrupt-cells = <2>; > reg = <0x4804c000 0x1000>; > interrupts = <98>; > }; > @@ -124,7 +124,7 @@ > gpio-controller; > #gpio-cells = <2>; > interrupt-controller; > - #interrupt-cells = <1>; > + #interrupt-cells = <2>; > reg = <0x481ac000 0x1000>; > interrupts = <32>; > }; > @@ -135,7 +135,7 @@ > gpio-controller; > #gpio-cells = <2>; > interrupt-controller; > - #interrupt-cells = <1>; > + #interrupt-cells = <2>; > reg = <0x481ae000 0x1000>; > interrupts = <62>; > }; > -- > 1.7.10.4 > > > ___ > linux-arm-kernel mailing list > linux-arm-ker...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] memcg: Limit the number of events registered on oom_control
On Wed 07-08-13 15:57:34, Michal Hocko wrote: [...] > Hmm, OK so you think that the fd limit is sufficient already? Hmm, that would need to touch the code as well (the register callback would need to make sure only one event is registered per cfile). But yes this way would be better. I will send a new patch once I have an idle moment. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/6] ARM: Tegra: Add CPU's OPPs for using cpufreq-cpu0 driver
cpufreq-cpu0 driver needs OPPs to be present in DT which can be probed by it to get frequency table. This patch adds OPPs and clock-latency to tegra cpu0 node for multiple SoCs. Voltage levels aren't used until now for tegra and so a flat value which would eventually be ignored is used to represent voltage. Signed-off-by: Viresh Kumar --- arch/arm/boot/dts/tegra114.dtsi | 12 arch/arm/boot/dts/tegra20.dtsi | 12 arch/arm/boot/dts/tegra30.dtsi | 12 3 files changed, 36 insertions(+) diff --git a/arch/arm/boot/dts/tegra114.dtsi b/arch/arm/boot/dts/tegra114.dtsi index abf6c40..730e0d9 100644 --- a/arch/arm/boot/dts/tegra114.dtsi +++ b/arch/arm/boot/dts/tegra114.dtsi @@ -438,6 +438,18 @@ device_type = "cpu"; compatible = "arm,cortex-a15"; reg = <0>; + operating-points = < + /* kHzignored */ +216000 100 +312000 100 +456000 100 +608000 100 +76 100 +816000 100 +912000 100 +100 100 + >; + clock-latency = <30>; }; cpu@1 { diff --git a/arch/arm/boot/dts/tegra20.dtsi b/arch/arm/boot/dts/tegra20.dtsi index 9653fd8..5696f98 100644 --- a/arch/arm/boot/dts/tegra20.dtsi +++ b/arch/arm/boot/dts/tegra20.dtsi @@ -577,6 +577,18 @@ device_type = "cpu"; compatible = "arm,cortex-a9"; reg = <0>; + operating-points = < + /* kHzignored */ +216000 100 +312000 100 +456000 100 +608000 100 +76 100 +816000 100 +912000 100 +100 100 + >; + clock-latency = <30>; }; cpu@1 { diff --git a/arch/arm/boot/dts/tegra30.dtsi b/arch/arm/boot/dts/tegra30.dtsi index d8783f0..5930290 100644 --- a/arch/arm/boot/dts/tegra30.dtsi +++ b/arch/arm/boot/dts/tegra30.dtsi @@ -569,6 +569,18 @@ device_type = "cpu"; compatible = "arm,cortex-a9"; reg = <0>; + operating-points = < + /* kHzignored */ +216000 100 +312000 100 +456000 100 +608000 100 +76 100 +816000 100 +912000 100 +100 100 + >; + clock-latency = <30>; }; cpu@1 { -- 1.7.12.rc2.18.g61b472e -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 6/6] cpufreq: Tegra: Remove tegra-cpufreq driver
We are using generic cpufreq-cpu0 driver, so lets get rid of platform specific tegra-cpufreq.c driver. Signed-off-by: Viresh Kumar --- drivers/cpufreq/Makefile| 1 - drivers/cpufreq/tegra-cpufreq.c | 291 2 files changed, 292 deletions(-) delete mode 100644 drivers/cpufreq/tegra-cpufreq.c diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile index ad5866c..e74b3ee 100644 --- a/drivers/cpufreq/Makefile +++ b/drivers/cpufreq/Makefile @@ -76,7 +76,6 @@ obj-$(CONFIG_ARM_S5PV210_CPUFREQ) += s5pv210-cpufreq.o obj-$(CONFIG_ARM_SA1100_CPUFREQ) += sa1100-cpufreq.o obj-$(CONFIG_ARM_SA1110_CPUFREQ) += sa1110-cpufreq.o obj-$(CONFIG_ARM_SPEAR_CPUFREQ)+= spear-cpufreq.o -obj-$(CONFIG_ARM_TEGRA_CPUFREQ)+= tegra-cpufreq.o ## # PowerPC platform drivers diff --git a/drivers/cpufreq/tegra-cpufreq.c b/drivers/cpufreq/tegra-cpufreq.c deleted file mode 100644 index cd66b85..000 --- a/drivers/cpufreq/tegra-cpufreq.c +++ /dev/null @@ -1,291 +0,0 @@ -/* - * Copyright (C) 2010 Google, Inc. - * - * Author: - * Colin Cross - * Based on arch/arm/plat-omap/cpu-omap.c, (C) 2005 Nokia Corporation - * - * This software is licensed under the terms of the GNU General Public - * License version 2, as published by the Free Software Foundation, and - * may be copied, distributed, and modified under those terms. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -static struct cpufreq_frequency_table freq_table[] = { - { .frequency = 216000 }, - { .frequency = 312000 }, - { .frequency = 456000 }, - { .frequency = 608000 }, - { .frequency = 76 }, - { .frequency = 816000 }, - { .frequency = 912000 }, - { .frequency = 100 }, - { .frequency = CPUFREQ_TABLE_END }, -}; - -#define NUM_CPUS 2 - -static struct clk *cpu_clk; -static struct clk *pll_x_clk; -static struct clk *pll_p_clk; -static struct clk *emc_clk; - -static unsigned long target_cpu_speed[NUM_CPUS]; -static DEFINE_MUTEX(tegra_cpu_lock); -static bool is_suspended; - -static int tegra_verify_speed(struct cpufreq_policy *policy) -{ - return cpufreq_frequency_table_verify(policy, freq_table); -} - -static unsigned int tegra_getspeed(unsigned int cpu) -{ - unsigned long rate; - - if (cpu >= NUM_CPUS) - return 0; - - rate = clk_get_rate(cpu_clk) / 1000; - return rate; -} - -static int tegra_cpu_clk_set_rate(unsigned long rate) -{ - int ret; - - /* -* Take an extra reference to the main pll so it doesn't turn -* off when we move the cpu off of it -*/ - clk_prepare_enable(pll_x_clk); - - ret = clk_set_parent(cpu_clk, pll_p_clk); - if (ret) { - pr_err("Failed to switch cpu to clock pll_p\n"); - goto out; - } - - if (rate == clk_get_rate(pll_p_clk)) - goto out; - - ret = clk_set_rate(pll_x_clk, rate); - if (ret) { - pr_err("Failed to change pll_x to %lu\n", rate); - goto out; - } - - ret = clk_set_parent(cpu_clk, pll_x_clk); - if (ret) { - pr_err("Failed to switch cpu to clock pll_x\n"); - goto out; - } - -out: - clk_disable_unprepare(pll_x_clk); - return ret; -} - -static int tegra_update_cpu_speed(struct cpufreq_policy *policy, - unsigned long rate) -{ - int ret = 0; - struct cpufreq_freqs freqs; - - freqs.old = tegra_getspeed(0); - freqs.new = rate; - - if (freqs.old == freqs.new) - return ret; - - /* -* Vote on memory bus frequency based on cpu frequency -* This sets the minimum frequency, display or avp may request higher -*/ - if (rate >= 816000) - clk_set_rate(emc_clk, 6); /* cpu 816 MHz, emc max */ - else if (rate >= 456000) - clk_set_rate(emc_clk, 3); /* cpu 456 MHz, emc 150Mhz */ - else - clk_set_rate(emc_clk, 1); /* emc 50Mhz */ - - cpufreq_notify_transition(policy, , CPUFREQ_PRECHANGE); - -#ifdef CONFIG_CPU_FREQ_DEBUG - printk(KERN_DEBUG "cpufreq-tegra: transition: %u --> %u\n", - freqs.old, freqs.new); -#endif - - ret = tegra_cpu_clk_set_rate(freqs.new * 1000); - if (ret) { - pr_err("cpu-tegra: Failed to set cpu frequency to %d kHz\n", - freqs.new); - freqs.new =
[PATCH 4/6] ARM: Tegra: defconfig: select cpufreq-cpu0 driver
Tegra requires cpufreq-cpu0 driver to be compiled in and hence we select it from the defconfig. Signed-off-by: Viresh Kumar --- arch/arm/configs/tegra_defconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/arm/configs/tegra_defconfig b/arch/arm/configs/tegra_defconfig index 1effb43..3fcec8f 100644 --- a/arch/arm/configs/tegra_defconfig +++ b/arch/arm/configs/tegra_defconfig @@ -38,6 +38,7 @@ CONFIG_ZBOOT_ROM_BSS=0x0 CONFIG_KEXEC=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y +CONFIG_GENERIC_CPUFREQ_CPU0=y CONFIG_CPU_IDLE=y CONFIG_VFP=y CONFIG_PM_RUNTIME=y -- 1.7.12.rc2.18.g61b472e -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/6] ARM: Tegra: start using cpufreq-cpu0 driver
cpufreq-cpu0 driver can be probed over DT only if a corresponding device node is created for the SoC which wants to use it. Lets create a platform device for cpufreq-cpu0 driver for Tegra. Also it removes the Kconfig entry responsible to compiling tegra-cpufreq driver and hence there will not be any conflicts between two cpufreq drivers. Signed-off-by: Viresh Kumar --- arch/arm/mach-tegra/tegra.c | 2 ++ drivers/cpufreq/Kconfig.arm | 8 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/arm/mach-tegra/tegra.c b/arch/arm/mach-tegra/tegra.c index 0d1e412..6ab3f69 100644 --- a/arch/arm/mach-tegra/tegra.c +++ b/arch/arm/mach-tegra/tegra.c @@ -82,11 +82,13 @@ static struct of_dev_auxdata tegra20_auxdata_lookup[] __initdata = { static void __init tegra_dt_init(void) { + struct platform_device_info devinfo = { .name = "cpufreq-cpu0", }; struct soc_device_attribute *soc_dev_attr; struct soc_device *soc_dev; struct device *parent = NULL; tegra_clocks_apply_init_table(); + platform_device_register_full(); soc_dev_attr = kzalloc(sizeof(*soc_dev_attr), GFP_KERNEL); if (!soc_dev_attr) diff --git a/drivers/cpufreq/Kconfig.arm b/drivers/cpufreq/Kconfig.arm index de4d5d9..9472160 100644 --- a/drivers/cpufreq/Kconfig.arm +++ b/drivers/cpufreq/Kconfig.arm @@ -215,11 +215,3 @@ config ARM_SPEAR_CPUFREQ default y help This adds the CPUFreq driver support for SPEAr SOCs. - -config ARM_TEGRA_CPUFREQ - bool "TEGRA CPUFreq support" - depends on ARCH_TEGRA - select CPU_FREQ_TABLE - default y - help - This adds the CPUFreq driver support for TEGRA SOCs. -- 1.7.12.rc2.18.g61b472e -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/6] ARM: Tegra: Enable OPP library
cpufreq-cpu0 driver is dependent on OPP library and hence we need to enable it for Tegra as we are going to use cpufreq-cpu0. Signed-off-by: Viresh Kumar --- arch/arm/mach-tegra/Kconfig | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm/mach-tegra/Kconfig b/arch/arm/mach-tegra/Kconfig index ef3a8da..63875c5 100644 --- a/arch/arm/mach-tegra/Kconfig +++ b/arch/arm/mach-tegra/Kconfig @@ -1,6 +1,8 @@ config ARCH_TEGRA bool "NVIDIA Tegra" if ARCH_MULTI_V7 select ARCH_HAS_CPUFREQ + select ARCH_HAS_OPP + select PM_OPP if PM select ARCH_REQUIRE_GPIOLIB select CLKDEV_LOOKUP select CLKSRC_MMIO -- 1.7.12.rc2.18.g61b472e -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/6] clk: Tegra: Add CPU0 clock driver
This patch adds CPU0's clk driver for Tegra. It will be used by the generic cpufreq-cpu0 driver to get/set cpu clk. Most of the platform specific bits are picked from tegra-cpufreq.c. Signed-off-by: Viresh Kumar --- drivers/clk/tegra/Makefile | 1 + drivers/clk/tegra/clk-cpu.c | 164 drivers/clk/tegra/clk-tegra30.c | 4 + include/linux/clk/tegra.h | 1 + 4 files changed, 170 insertions(+) create mode 100644 drivers/clk/tegra/clk-cpu.c diff --git a/drivers/clk/tegra/Makefile b/drivers/clk/tegra/Makefile index f49fac2..0e818c0 100644 --- a/drivers/clk/tegra/Makefile +++ b/drivers/clk/tegra/Makefile @@ -10,3 +10,4 @@ obj-y += clk-super.o obj-$(CONFIG_ARCH_TEGRA_2x_SOC) += clk-tegra20.o obj-$(CONFIG_ARCH_TEGRA_3x_SOC) += clk-tegra30.o obj-$(CONFIG_ARCH_TEGRA_114_SOC) += clk-tegra114.o +obj-$(CONFIG_GENERIC_CPUFREQ_CPU0) += clk-cpu.o diff --git a/drivers/clk/tegra/clk-cpu.c b/drivers/clk/tegra/clk-cpu.c new file mode 100644 index 000..01716d6 --- /dev/null +++ b/drivers/clk/tegra/clk-cpu.c @@ -0,0 +1,164 @@ +/* + * Copyright (C) 2013 Linaro + * + * Author: Viresh Kumar + * + * This file is licensed under the terms of the GNU General Public + * License version 2. This program is licensed "as is" without any + * warranty of any kind, whether express or implied. + */ + +/* + * Responsible for setting cpu0 clk as requested by cpufreq-cpu0 driver + * + * All platform specific bits are taken from tegra-cpufreq driver. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include + +#define to_clk_cpu0(_hw) container_of(_hw, struct clk_cpu0, hw) + +struct clk_cpu0 { + struct clk_hw hw; + spinlock_t *lock; +}; + +static struct clk *cpu_clk; +static struct clk *pll_x_clk; +static struct clk *pll_p_clk; +static struct clk *emc_clk; + +static unsigned long cpu0_recalc_rate(struct clk_hw *hw, + unsigned long parent_rate) +{ + return clk_get_rate(cpu_clk); +} + +static long cpu0_round_rate(struct clk_hw *hw, unsigned long drate, + unsigned long *parent_rate) +{ + return clk_round_rate(cpu_clk, drate); +} + +static int cpu0_set_rate(struct clk_hw *hw, unsigned long rate, + unsigned long parent_rate) +{ + int ret; + + /* +* Vote on memory bus frequency based on cpu frequency +* This sets the minimum frequency, display or avp may request higher +*/ + if (rate >= 81600) + clk_set_rate(emc_clk, 6); /* cpu 816 MHz, emc max */ + else if (rate >= 45600) + clk_set_rate(emc_clk, 3); /* cpu 456 MHz, emc 150Mhz */ + else + clk_set_rate(emc_clk, 1); /* emc 50Mhz */ + + /* +* Take an extra reference to the main pll so it doesn't turn +* off when we move the cpu off of it +*/ + clk_prepare_enable(pll_x_clk); + + ret = clk_set_parent(cpu_clk, pll_p_clk); + if (ret) { + pr_err("%s: Failed to switch cpu to clock pll_p\n", __func__); + goto out; + } + + if (rate == clk_get_rate(pll_p_clk)) + goto out; + + ret = clk_set_rate(pll_x_clk, rate); + if (ret) { + pr_err("Failed to change pll_x to %lu\n", rate); + goto out; + } + + ret = clk_set_parent(cpu_clk, pll_x_clk); + if (ret) { + pr_err("Failed to switch cpu to clock pll_x\n"); + goto out; + } + +out: + clk_disable_unprepare(pll_x_clk); + return ret; +} + +static struct clk_ops clk_cpu0_ops = { + .recalc_rate = cpu0_recalc_rate, + .round_rate = cpu0_round_rate, + .set_rate = cpu0_set_rate, +}; + +struct clk *tegra_clk_register_cpu0(void) +{ + struct clk_init_data init; + struct clk_cpu0 *cpu0; + struct clk *clk; + + cpu0 = kzalloc(sizeof(*cpu0), GFP_KERNEL); + if (!cpu0) { + pr_err("%s: could not allocate cpu0 clk\n", __func__); + return ERR_PTR(-ENOMEM); + } + + cpu_clk = clk_get_sys(NULL, "cpu"); + if (IS_ERR(cpu_clk)) { + clk = cpu_clk; + goto free_mem; + } + + pll_x_clk = clk_get_sys(NULL, "pll_x"); + if (IS_ERR(pll_x_clk)) { + clk = pll_x_clk; + goto put_cpu_clk; + } + + pll_p_clk = clk_get_sys(NULL, "pll_p_cclk"); + if (IS_ERR(pll_p_clk)) { + clk = pll_p_clk; + goto put_pll_x_clk; + } + + emc_clk = clk_get_sys("cpu", "emc"); + if (IS_ERR(emc_clk)) { + clk = emc_clk; + goto put_pll_p_clk; + } + + cpu0->hw.init = + + init.name = "cpu0"; + init.ops = _cpu0_ops; + init.flags = CLK_IS_ROOT | CLK_GET_RATE_NOCACHE; + init.num_parents = 0; + + clk
[PATCH 0/6] Tegra: Use cpufreq-cpu0 driver
Hi Stephen, This is the first attempt to get rid of tegra-cpufreq driver. This patchset tries to add supporting infrastructure for tegra to use cpufreq-cpu0 driver. I don't have hardware to test it and so is compiled tested only.. Few bits may be missing as I couldn't think of all aspects and so may need your help getting them fixed. Once this is tested by you, I would like to take it through my ARM cpufreq tree if nobody else has a problem with it. Thanks -- Viresh. Viresh Kumar (6): clk: Tegra: Add CPU0 clock driver ARM: Tegra: Add CPU's OPPs for using cpufreq-cpu0 driver ARM: Tegra: Enable OPP library ARM: Tegra: defconfig: select cpufreq-cpu0 driver ARM: Tegra: start using cpufreq-cpu0 driver cpufreq: Tegra: Remove tegra-cpufreq driver arch/arm/boot/dts/tegra114.dtsi | 12 ++ arch/arm/boot/dts/tegra20.dtsi | 12 ++ arch/arm/boot/dts/tegra30.dtsi | 12 ++ arch/arm/configs/tegra_defconfig | 1 + arch/arm/mach-tegra/Kconfig | 2 + arch/arm/mach-tegra/tegra.c | 2 + drivers/clk/tegra/Makefile | 1 + drivers/clk/tegra/clk-cpu.c | 164 ++ drivers/clk/tegra/clk-tegra30.c | 4 + drivers/cpufreq/Kconfig.arm | 8 -- drivers/cpufreq/Makefile | 1 - drivers/cpufreq/tegra-cpufreq.c | 291 --- include/linux/clk/tegra.h| 1 + 13 files changed, 211 insertions(+), 300 deletions(-) create mode 100644 drivers/clk/tegra/clk-cpu.c delete mode 100644 drivers/cpufreq/tegra-cpufreq.c -- 1.7.12.rc2.18.g61b472e -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] [SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
Roland Dreier wrote: From: Roland Dreier There is a nasty bug in the SCSI SG_IO ioctl that in some circumstances leads to one process writing data into the address space of some other random unrelated process if the ioctl is interrupted by a signal. What happens is the following: - A process issues an SG_IO ioctl with direction DXFER_FROM_DEV (ie the underlying SCSI command will transfer data from the SCSI device to the buffer provided in the ioctl) - Before the command finishes, a signal is sent to the process waiting in the ioctl. This will end up waking up the sg_ioctl() code: result = wait_event_interruptible(sfp->read_wait, (srp_done(sfp, srp) || sdp->detached)); but neither srp_done() nor sdp->detached is true, so we end up just setting srp->orphan and returning to userspace: srp->orphan = 1; write_unlock_irq(>rq_list_lock); return result; /* -ERESTARTSYS because signal hit process */ At this point the original process is done with the ioctl and blithely goes ahead handling the signal, reissuing the ioctl, etc. - Eventually, the SCSI command issued by the first ioctl finishes and ends up in sg_rq_end_io(). At the end of that function, we run through: write_lock_irqsave(>rq_list_lock, iflags); if (unlikely(srp->orphan)) { if (sfp->keep_orphan) srp->sg_io_owned = 0; else done = 0; } srp->done = done; write_unlock_irqrestore(>rq_list_lock, iflags); if (likely(done)) { /* Now wake up any sg_read() that is waiting for this * packet. */ wake_up_interruptible(>read_wait); kill_fasync(>async_qp, SIGPOLL, POLL_IN); kref_put(>f_ref, sg_remove_sfp); } else { INIT_WORK(>ew.work, sg_rq_end_io_usercontext); schedule_work(>ew.work); } Since srp->orphan *is* set, we set done to 0 (assuming the userspace app has not set keep_orphan via an SG_SET_KEEP_ORPHAN ioctl), and therefore we end up scheduling sg_rq_end_io_usercontext() to run in a workqueue. - In workqueue context we go through sg_rq_end_io_usercontext() -> sg_finish_rem_req() -> blk_rq_unmap_user() -> ... -> bio_uncopy_user() -> __bio_copy_iov() -> copy_to_user(). The key point here is that we are doing copy_to_user() on a workqueue -- that is, we're on a kernel thread with current->mm equal to whatever random previous user process was scheduled before this kernel thread. So we end up copying whatever data the SCSI command returned to the virtual address of the buffer passed into the original ioctl, but it's quite likely we do this copying into a different address space! As suggested by James Bottomley , add a check for current->mm (which is NULL if we're on a kernel thread without a real userspace address space) in bio_uncopy_user(), and skip the copy if we're on a kernel thread. There's no reason that I can think of for any caller of bio_uncopy_user() to want to do copying on a kernel thread with a random active userspace address space. Huge thanks to Costa Sapuntzakis for the original pointer to this bug in the sg code. Signed-off-by: Roland Dreier Cc: --- fs/bio.c | 20 +++- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/fs/bio.c b/fs/bio.c index 94bbc04..c5eae72 100644 --- a/fs/bio.c +++ b/fs/bio.c @@ -1045,12 +1045,22 @@ static int __bio_copy_iov(struct bio *bio, struct bio_vec *iovecs, int bio_uncopy_user(struct bio *bio) { struct bio_map_data *bmd = bio->bi_private; - int ret = 0; + struct bio_vec *bvec; + int ret = 0, i; - if (!bio_flagged(bio, BIO_NULL_MAPPED)) - ret = __bio_copy_iov(bio, bmd->iovecs, bmd->sgvecs, -bmd->nr_sgvecs, bio_data_dir(bio) == READ, -0, bmd->is_our_pages); + if (!bio_flagged(bio, BIO_NULL_MAPPED)) { + /* +* if we're in a workqueue, the request is orphaned, so +* don't copy into a random user address space, just free. +*/ + if (current->mm) + ret = __bio_copy_iov(bio, bmd->iovecs, bmd->sgvecs, +bmd->nr_sgvecs, bio_data_dir(bio) == READ, +0, bmd->is_our_pages); + else if (bmd->is_our_pages) + bio_for_each_segment_all(bvec, bio, i) + __free_page(bvec->bv_page); + } bio_free_map_data(bmd); bio_put(bio); return ret; Hi Roland, I was able to succesfully test this patch overnight, I had been experimenting with the sg driver setting the
Re: [PATCH 1/3] memcg: limit the number of thresholds per-memcg
On Wed 07-08-13 09:58:18, Tejun Heo wrote: > Hello, > > On Wed, Aug 07, 2013 at 03:46:54PM +0200, Michal Hocko wrote: > > OK, I have obviously misunderstood your concern mentioned in the other > > email. Could you be more specific what is the DoS scenario which was > > your concern, then? > > So, let's say the file is write-accessible to !priv user which is > under reasonable resource limits. Normally this shouldn't affect priv > system tools which are monitoring the same event as it shouldn't be > able to deplete resources as long as the resource control mechanisms > are configured and functioning properly; however, the memory usage > event puts all event listeners into a single contiguous table which a > !priv user can easily expand to a size where the table can no longer > be enlarged and if a priv system tool or another user tries to > register event afterwards, it'll fail. IOW, it creates a shared > resource which isn't properly provisioned and can be trivially filled > up making it an easy DoS target. OK, got your point. You are right and I haven't considered the size of the table and the size restrictions of kmalloc. Thanks for pointing this out! --- >From cde8a296eddd288780e78803610127401b6a Mon Sep 17 00:00:00 2001 From: Michal Hocko Date: Wed, 7 Aug 2013 11:11:22 +0200 Subject: [PATCH] memcg: limit the number of thresholds per-memcg There is no limit for the maximum number of threshold events registered per memcg. It is even worse that all the events are stored in a per-memcg table which is enlarged when a new event is registered. This can lead to the following issue mentioned by Tejun: " So, let's say the file is write-accessible to !priv user which is under reasonable resource limits. Normally this shouldn't affect priv system tools which are monitoring the same event as it shouldn't be able to deplete resources as long as the resource control mechanisms are configured and functioning properly; however, the memory usage event puts all event listeners into a single contiguous table which a !priv user can easily expand to a size where the table can no longer be enlarged and if a priv system tool or another user tries to register event afterwards, it'll fail. IOW, it creates a shared resource which isn't properly provisioned and can be trivially filled up making it an easy DoS target. " Let's be more strict and cap the number of events that might be registered. MAX_THRESHOLD_EVENTS value is more or less random. The expectation is that it should be high enough to cover reasonable usecases while not too high to allow excessive resources consumption. 1024 events consume something like 16KB which shouldn't be a big deal and it should be good enough. Reported-by: Tejun Heo Signed-off-by: Michal Hocko --- mm/memcontrol.c |8 1 file changed, 8 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4330cd..8247db3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5401,6 +5401,9 @@ static void mem_cgroup_oom_notify(struct mem_cgroup *memcg) mem_cgroup_oom_notify_cb(iter); } +/* Maximum number of treshold events registered per memcg. */ +#define MAX_THRESHOLD_EVENTS 1024 + static int mem_cgroup_usage_register_event(struct cgroup *cgrp, struct cftype *cft, struct eventfd_ctx *eventfd, const char *args) { @@ -5424,6 +5427,11 @@ static int mem_cgroup_usage_register_event(struct cgroup *cgrp, else BUG(); + if (thresholds->primary->size == MAX_THRESHOLD_EVENTS) { + ret = -ENOSPC; + goto unlock; + } + usage = mem_cgroup_usage(memcg, type == _MEMSWAP); /* Check if a threshold crossed before adding a new one */ -- 1.7.10.4 -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] i2c-designware: Manually set RESTART bit between messages
On Fri, Jun 21, 2013 at 03:05:28PM +0800, Chew Chiau Ee wrote: > From: Chew, Chiau Ee > > If both IC_EMPTYFIFO_HOLD_MASTER_EN and IC_RESTART_EN are set to 1, the > Designware I2C controller doesn't generate RESTART unless user specifically > requests it by setting RESTART bit in IC_DATA_CMD register. > > Since IC_EMPTYFIFO_HOLD_MASTER_EN setting can't be detected from hardware > register, we must always manually set the restart bit between messages. > > Signed-off-by: Chew, Chiau Ee Applied to for-next, thanks! signature.asc Description: Digital signature
Re: [patch v2 2/3] mm: page_alloc: rearrange watermark checking in get_page_from_freelist
On Fri, Aug 02, 2013 at 11:37:25AM -0400, Johannes Weiner wrote: > Allocations that do not have to respect the watermarks are rare > high-priority events. Reorder the code such that per-zone dirty > limits and future checks important only to regular page allocations > are ignored in these extraordinary situations. > > Signed-off-by: Johannes Weiner > Reviewed-by: Rik van Riel Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2 1/3] mm: vmscan: fix numa reclaim balance problem in kswapd
On Fri, Aug 02, 2013 at 11:37:24AM -0400, Johannes Weiner wrote: > When the page allocator fails to get a page from all zones in its > given zonelist, it wakes up the per-node kswapds for all zones that > are at their low watermark. > > However, with a system under load the free pages in a zone can > fluctuate enough that the allocation fails but the kswapd wakeup is > also skipped while the zone is still really close to the low > watermark. > > When one node misses a wakeup like this, it won't be aged before all > the other node's zones are down to their low watermarks again. And > skipping a full aging cycle is an obvious fairness problem. > > Kswapd runs until the high watermarks are restored, so it should also > be woken when the high watermarks are not met. This ages nodes more > equally and creates a safety margin for the page counter fluctuation. > > By using zone_balanced(), it will now check, in addition to the > watermark, if compaction requires more order-0 pages to create a > higher order page. > > Signed-off-by: Johannes Weiner > Reviewed-by: Rik van Riel Acked-by: Mel Gorman -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 2/5] dma: mpc512x: add support for peripheral transfers
2013/8/3 Gerhard Sittig : > On Wed, Jul 31, 2013 at 11:21 +0400, Alexander Popov wrote: >> > You don't provide a lot of information to those you want to > receive feedback from. You should keep a history and list the > changes between versions. And you may want to somehow link this > v3 to its predecessor -- especially when you only send part of > the series and assume that reviewers may know where to find the > remainder. > > Please help those persons you want to get help from. Thanks. Now I see how to collaborate via mailing lists properly. > I think it's unfortunate to attribute the "will access > peripheral" to the channel instead of the transfer job, and to > set the flag from within the device control callback, and to > nevery clear the flag (what will happen if a channel gets freed > and reallocated by some other client?). > > I think that the peripheral access is an attribute of the > transfer job, and should be setup in the prep routines (both set > and cleared, depending on what gets setup). This would be more > robust and more readable (read: maintainable) in my eyes. Yes. I agree, I will implement it and offer differences from RFC v2 in the initial topic. Best regards, Alexander. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ARM: fix wrong address when loading PRM_FRAC_INCREMENTOR_DENUMERATOR_RELOAD
The denominator should be load from INCREMENTOR_DENUMERATOR_RELOAD_OFFSET rather than INCREMENTER_NUMERATOR_OFFSET. This is more likely a typo, since INCREMENTER_DENUMERATOR_RELOAD[23:17] is reserved. It seems that it won't make much trouble without this fix, because the useful [11:0] bits are mask and set the right value. Anyway, reading from a right address is better choice. Signed-off-by: Chen Baozi --- arch/arm/mach-omap2/timer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/mach-omap2/timer.c b/arch/arm/mach-omap2/timer.c index 1e77f11..ccc5c72 100644 --- a/arch/arm/mach-omap2/timer.c +++ b/arch/arm/mach-omap2/timer.c @@ -537,7 +537,7 @@ static void __init realtime_counter_init(void) reg |= num; __raw_writel(reg, base + INCREMENTER_NUMERATOR_OFFSET); - reg = __raw_readl(base + INCREMENTER_NUMERATOR_OFFSET) & + reg = __raw_readl(base + INCREMENTER_DENUMERATOR_RELOAD_OFFSET) & NUMERATOR_DENUMERATOR_MASK; reg |= den; __raw_writel(reg, base + INCREMENTER_DENUMERATOR_RELOAD_OFFSET); -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v3 1/5] dma: mpc512x: reorder mpc8308 specific instructions
2013/8/3 Gerhard Sittig : > On Wed, Jul 31, 2013 at 11:20 +0400, Alexander Popov wrote: >> > Please make sure to either cite > properly or to properly mark changes as such. Don't spread false > information, please. You are free to change what I submitted, > but you should not pretend that I wrote what has become of the > code after you have modified it. Please fix the attribution. Excuse me for the confusion. I'll be careful with "From:" notes. > Just to clarify: The defines here appear to be more appropriate > than the initial enums, after it turned out that we need not > handle indiviudal channels in special ways, and really only need > these three numbers (one of them being the maximum of the > others). But regardless of what you have changed, you should > clearly state the fact. Ok, I'll do so. Best regards, Alexander. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] KVM: MMU: fix check the reserved bits on the gpte of L2
On 08/05/2013 06:59 AM, Xiao Guangrong wrote: Current code always uses arch.mmu to check the reserved bits on guest gpte which is valid only for L1 guest, we should use arch.nested_mmu instead when we translate gva to gpa for the L2 guest Fix it by using @mmu instead since it is adapted to the current mmu mode automatically The bug can be triggered when nested npt is used and L1 guest and L2 guest use different mmu mode Reported-by: Jan Kiszka Signed-off-by: Xiao Guangrong --- arch/x86/kvm/paging_tmpl.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 7769699..3a75828 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -218,8 +218,7 @@ retry_walk: if (unlikely(!is_present_gpte(pte))) goto error; - if (unlikely(is_rsvd_bits_set(>arch.mmu, pte, - walker->level))) { + if (unlikely(is_rsvd_bits_set(mmu, pte, walker->level))) { errcode |= PFERR_RSVD_MASK | PFERR_PRESENT_MASK; goto error; } Applied, thanks. Paolo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] xfs: introduce object readahead to log recovery
HI, xfs maintainers, any comments? On Wed, Jul 31, 2013 at 4:42 PM, wrote: > From: Zhi Yong Wu > > It can take a long time to run log recovery operation because it is > single threaded and is bound by read latency. We can find that it took > most of the time to wait for the read IO to occur, so if one object > readahead is introduced to log recovery, it will obviously reduce the > log recovery time. > > Log recovery time stat: > > w/o this patchw/ this patch > > real:0m15.023s 0m7.802s > user:0m0.001s 0m0.001s > sys: 0m0.246s 0m0.107s > > Signed-off-by: Zhi Yong Wu > --- > fs/xfs/xfs_log_recover.c | 159 > +-- > 1 file changed, 153 insertions(+), 6 deletions(-) > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 7681b19..ebb00bc 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -3116,6 +3116,106 @@ xlog_recover_free_trans( > kmem_free(trans); > } > > +STATIC void > +xlog_recover_buffer_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + struct xfs_buf_log_format *buf_f = item->ri_buf[0].i_addr; > + struct xfs_mount*mp = log->l_mp; > + > + if (xlog_check_buffer_cancelled(log, buf_f->blf_blkno, > + buf_f->blf_len, buf_f->blf_flags)) { > + return; > + } > + > + xfs_buf_readahead(mp->m_ddev_targp, buf_f->blf_blkno, > + buf_f->blf_len, NULL); > +} > + > +STATIC void > +xlog_recover_inode_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + struct xfs_inode_log_format ilf_buf; > + struct xfs_inode_log_format *ilfp; > + struct xfs_mount*mp = log->l_mp; > + int error; > + > + if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) { > + ilfp = item->ri_buf[0].i_addr; > + } else { > + ilfp = _buf; > + memset(ilfp, 0, sizeof(*ilfp)); > + error = xfs_inode_item_format_convert(>ri_buf[0], ilfp); > + if (error) > + return; > + } > + > + if (xlog_check_buffer_cancelled(log, ilfp->ilf_blkno, ilfp->ilf_len, > 0)) > + return; > + > + xfs_buf_readahead(mp->m_ddev_targp, ilfp->ilf_blkno, > + ilfp->ilf_len, _inode_buf_ops); > +} > + > +STATIC void > +xlog_recover_dquot_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + struct xfs_mount*mp = log->l_mp; > + struct xfs_disk_dquot *recddq; > + struct xfs_dq_logformat *dq_f; > + uinttype; > + > + > + if (mp->m_qflags == 0) > + return; > + > + recddq = item->ri_buf[1].i_addr; > + if (recddq == NULL) > + return; > + if (item->ri_buf[1].i_len < sizeof(struct xfs_disk_dquot)) > + return; > + > + type = recddq->d_flags & (XFS_DQ_USER | XFS_DQ_PROJ | XFS_DQ_GROUP); > + ASSERT(type); > + if (log->l_quotaoffs_flag & type) > + return; > + > + dq_f = item->ri_buf[0].i_addr; > + ASSERT(dq_f); > + ASSERT(dq_f->qlf_len == 1); > + > + xfs_buf_readahead(mp->m_ddev_targp, dq_f->qlf_blkno, > + dq_f->qlf_len, NULL); > +} > + > +STATIC void > +xlog_recover_ra_pass2( > + struct xlog *log, > + struct xlog_recover_item*item) > +{ > + switch (ITEM_TYPE(item)) { > + case XFS_LI_BUF: > + xlog_recover_buffer_ra_pass2(log, item); > + break; > + case XFS_LI_INODE: > + xlog_recover_inode_ra_pass2(log, item); > + break; > + case XFS_LI_DQUOT: > + xlog_recover_dquot_ra_pass2(log, item); > + break; > + case XFS_LI_EFI: > + case XFS_LI_EFD: > + case XFS_LI_QUOTAOFF: > + default: > + break; > + } > +} > + > STATIC int > xlog_recover_commit_pass1( > struct xlog *log, > @@ -3177,6 +3277,26 @@ xlog_recover_commit_pass2( > } > } > > +STATIC int > +xlog_recover_items_pass2( > + struct xlog *log, > + struct xlog_recover *trans, > + struct list_head*buffer_list, > + struct list_head*item_list) > +{ > + struct xlog_recover_item*item; > + int error = 0; > + > + list_for_each_entry(item, item_list, ri_list) { > + error = xlog_recover_commit_pass2(log, trans, > +
Re: [PATCH 2/3] memcg: Limit the number of events registered on oom_control
On Wed, Aug 07, 2013 at 03:57:34PM +0200, Michal Hocko wrote: > On Wed 07-08-13 09:47:41, Tejun Heo wrote: > > Hello, > > > > On Wed, Aug 07, 2013 at 03:37:46PM +0200, Michal Hocko wrote: > > > > It isn't different from listening from epoll, for example. > > > > > > epoll limits the number of watchers, no? > > > > Not that I know of. It'll be limited by max open fds but I don't > > think there are other limits. > > max_user_watches seems to be a limit (4% of lowmem in maximum). That's per *user* not per event source. The problem here is creating a global (across securit domains) resource shared by all users. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/22] ARM: ux500: Remove '0x's from HREF v60+ DTS file
On Wed, 07 Aug 2013, Linus Walleij wrote: > On Mon, Jul 22, 2013 at 12:52 PM, Lee Jones wrote: > > > Signed-off-by: Lee Jones > > None of these patches apply since I applied your other patch series > that rename all the files ... can you respin the ux500 "0x"-strip patches > on top of the rename set? My ux500-devicetree branch can be used > as a baseline. I can do that. Although, would you prefer that I fixed-up my renaming patches, then applied the 0x patches on top instead? -- Lee Jones Linaro ST-Ericsson Landing Team Lead Linaro.org │ Open source software for ARM SoCs Follow Linaro: Facebook | Twitter | Blog -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] memcg: limit the number of thresholds per-memcg
Hello, On Wed, Aug 07, 2013 at 03:46:54PM +0200, Michal Hocko wrote: > OK, I have obviously misunderstood your concern mentioned in the other > email. Could you be more specific what is the DoS scenario which was > your concern, then? So, let's say the file is write-accessible to !priv user which is under reasonable resource limits. Normally this shouldn't affect priv system tools which are monitoring the same event as it shouldn't be able to deplete resources as long as the resource control mechanisms are configured and functioning properly; however, the memory usage event puts all event listeners into a single contiguous table which a !priv user can easily expand to a size where the table can no longer be enlarged and if a priv system tool or another user tries to register event afterwards, it'll fail. IOW, it creates a shared resource which isn't properly provisioned and can be trivially filled up making it an easy DoS target. Putting an extra limit on it isn't an actual solution but could be better, I think. It at least makes it clear that this is a limited global resource. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] memcg: Limit the number of events registered on oom_control
On Wed 07-08-13 09:47:41, Tejun Heo wrote: > Hello, > > On Wed, Aug 07, 2013 at 03:37:46PM +0200, Michal Hocko wrote: > > > It isn't different from listening from epoll, for example. > > > > epoll limits the number of watchers, no? > > Not that I know of. It'll be limited by max open fds but I don't > think there are other limits. max_user_watches seems to be a limit (4% of lowmem in maximum). > Why would there be? Because userspace should hog kernel resources without any limit. > > > If there needs to be kernel memory limit, shouldn't that be handled by > > > kmemcg? > > > > kmemcg would surely help but turning it on just because of potential > > abuse of the event registration API sounds like an overkill. > > > > I think having a cap for user trigable kernel resources is a good thing > > in general. > > I don't know. It's just very arbitrary because listening to events > itself isn't (and shouldn't) be something which consumes resource > which isn't attributed to the listener and this artificially creates a > global resource. The problem with memory usage event is breaching > that rule with shared kmalloc() so putting well-defined limit on it is > fine but the latter two create additional artificial restrictions > which are both unnecessary and unconventional. No? Hmm, OK so you think that the fd limit is sufficient already? -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: perf,arm -- oops in validate_event
On Wed, Aug 07, 2013 at 02:00:27PM +0100, Will Deacon wrote: > On Tue, Aug 06, 2013 at 02:08:15PM +0100, Mark Rutland wrote: > > On Tue, Aug 06, 2013 at 12:59:21PM +0100, Will Deacon wrote: > > > But we already check `event->pmu != leader_pmu' in validate_event, so we > > > shouldn't get anywhere nearer calling get_event_idx in the case you > > > describe. It sounds more like we have an inconsistency with one of the > > > events. > > > > Note in my example that the software event was the group leader (so in > > fact we'd *only* be checking those events which we can't actually > > handle...). > > > > I was also under the impression that in the case of mixed hardware and > > software events, a hardware event must be the group leader. That > > doesn't seem to be the case. If a hardware event is added to a software > > group, the group is moved to hardware context but the original software > > event stays as the group leader. > > Ok, so the following quick hack below should solve the issue (can you confirm > it please, since I don't have access to any hardware atm?) It works for me when running Vince's test case. Tested-by: Mark Rutland > > We should revisit this for 3.12 though, because I'm not sure that our > validation code even does the right thing when there are multiple PMUs > involved. Certainly. I suspect we're not alone there. Thanks, Mark. > > Will > > --->8 > > diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c > index d9f5cd4..0500f10b 100644 > --- a/arch/arm/kernel/perf_event.c > +++ b/arch/arm/kernel/perf_event.c > @@ -253,6 +253,9 @@ validate_event(struct pmu_hw_events *hw_events, > struct arm_pmu *armpmu = to_arm_pmu(event->pmu); > struct pmu *leader_pmu = event->group_leader->pmu; > > + if (is_software_event(event)) > + return 1; > + > if (event->pmu != leader_pmu || event->state < PERF_EVENT_STATE_OFF) > return 1; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 05/18] ARM: integrator: Switch to sched_clock_register()
On Thu, Aug 1, 2013 at 12:31 AM, Stephen Boyd wrote: > The 32 bit sched_clock interface now supports 64 bits. Upgrade to > the 64 bit function to allow us to remove the 32 bit registration > interface. > > Cc: Linus Walleij > Signed-off-by: Stephen Boyd For this patch (given the idea is accepted) Acked-by: Linus Walleij Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] Tools: hv: use full nlmsghdr in netlink_send
> -Original Message- > From: Olaf Hering [mailto:o...@aepfle.de] > Sent: Wednesday, August 07, 2013 9:45 AM > To: KY Srinivasan; gre...@linuxfoundation.org > Cc: linux-kernel@vger.kernel.org; Olaf Hering > Subject: [PATCH] Tools: hv: use full nlmsghdr in netlink_send > > There is no need to have a nlmsghdr pointer to another temporary buffer. > Instead use a full struct nlmsghdr. > > Signed-off-by: Olaf Hering Signed-off-by: K. Y. Srinivasan > --- > tools/hv/hv_kvp_daemon.c | 15 +-- > tools/hv/hv_vss_daemon.c | 15 +-- > 2 files changed, 10 insertions(+), 20 deletions(-) > > diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c > index 1bd1ad1..7c05f55 100644 > --- a/tools/hv/hv_kvp_daemon.c > +++ b/tools/hv/hv_kvp_daemon.c > @@ -1393,23 +1393,18 @@ kvp_get_domain_name(char *buffer, int length) > static int > netlink_send(int fd, struct cn_msg *msg) > { > - struct nlmsghdr *nlh; > + struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE }; > unsigned int size; > struct msghdr message; > - char buffer[64]; > struct iovec iov[2]; > > size = sizeof(struct cn_msg) + msg->len; > > - nlh = (struct nlmsghdr *)buffer; > - nlh->nlmsg_seq = 0; > - nlh->nlmsg_pid = getpid(); > - nlh->nlmsg_type = NLMSG_DONE; > - nlh->nlmsg_len = NLMSG_LENGTH(size - sizeof(*nlh)); > - nlh->nlmsg_flags = 0; > + nlh.nlmsg_pid = getpid(); > + nlh.nlmsg_len = NLMSG_LENGTH(size); > > - iov[0].iov_base = nlh; > - iov[0].iov_len = sizeof(*nlh); > + iov[0].iov_base = > + iov[0].iov_len = sizeof(nlh); > > iov[1].iov_base = msg; > iov[1].iov_len = size; > diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c > index 2f1f53f..8ac0ee7 100644 > --- a/tools/hv/hv_vss_daemon.c > +++ b/tools/hv/hv_vss_daemon.c > @@ -105,23 +105,18 @@ static int vss_operate(int operation) > > static int netlink_send(int fd, struct cn_msg *msg) > { > - struct nlmsghdr *nlh; > + struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE }; > unsigned int size; > struct msghdr message; > - char buffer[64]; > struct iovec iov[2]; > > size = sizeof(struct cn_msg) + msg->len; > > - nlh = (struct nlmsghdr *)buffer; > - nlh->nlmsg_seq = 0; > - nlh->nlmsg_pid = getpid(); > - nlh->nlmsg_type = NLMSG_DONE; > - nlh->nlmsg_len = NLMSG_LENGTH(size - sizeof(*nlh)); > - nlh->nlmsg_flags = 0; > + nlh.nlmsg_pid = getpid(); > + nlh.nlmsg_len = NLMSG_LENGTH(size); > > - iov[0].iov_base = nlh; > - iov[0].iov_len = sizeof(*nlh); > + iov[0].iov_base = > + iov[0].iov_len = sizeof(nlh); > > iov[1].iov_base = msg; > iov[1].iov_len = size; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/22] ARM: ux500: Remove '0x's from HREF v60+ DTS file
On Mon, Jul 22, 2013 at 12:52 PM, Lee Jones wrote: > Signed-off-by: Lee Jones None of these patches apply since I applied your other patch series that rename all the files ... can you respin the ux500 "0x"-strip patches on top of the rename set? My ux500-devicetree branch can be used as a baseline. Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 00/11] Add namespace support for syslog
Quoting Eric W. Biederman (ebied...@xmission.com): > > Since this still has not been addressed. I am going to repeat Andrews > objection again. > > Isn't there a better way to get iptables information out than to use > syslog. I did not have time to follow up on that but it did appear that Bruno suggested NFLOG target + ulogd. That's not ideal, but doable. At least each container should be able to do that for itself. What it won't do is let a host admin make sure that he doesn't get corrupted syslog entries when partial-lines get sent from several containers and the kernel and randomly spliced together. It also would simply be better if the information was *always* sent to userspace instead of syslog. > someone did have a better way to get the information out. > > Essentially the argument against this goes. The kernel logging facility > is really not a particularly good tool to be using for anything other > than kernel debugging information, and there appear to be no substantial > uses for a separate syslog that should not be done in other ways. > > That design objection must be addressed before merging this code can be > given serious consideration. > > Eric > ___ > Containers mailing list > contain...@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 13/13] ARM: ux500: Remove u9540.dts as it's been replaced
On Fri, Jul 19, 2013 at 4:13 PM, Lee Jones wrote: > This must have been a merge error. There was a patch which renamed the > u9540.dts to ccu9540.dts, however the u9540.dts was reincarnate with > the same patches which created it in the first place. Let's kill it > once and for all. > > Signed-off-by: Lee Jones I applied all the rename patches but it appears they were never really tested, so I made this patch fixing all the bugs they introduced. (Quicker than iterating the patch set.) Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] memcg: Limit the number of events registered on oom_control
Hello, On Wed, Aug 07, 2013 at 03:37:46PM +0200, Michal Hocko wrote: > > It isn't different from listening from epoll, for example. > > epoll limits the number of watchers, no? Not that I know of. It'll be limited by max open fds but I don't think there are other limits. Why would there be? > > If there needs to be kernel memory limit, shouldn't that be handled by > > kmemcg? > > kmemcg would surely help but turning it on just because of potential > abuse of the event registration API sounds like an overkill. > > I think having a cap for user trigable kernel resources is a good thing > in general. I don't know. It's just very arbitrary because listening to events itself isn't (and shouldn't) be something which consumes resource which isn't attributed to the listener and this artificially creates a global resource. The problem with memory usage event is breaching that rule with shared kmalloc() so putting well-defined limit on it is fine but the latter two create additional artificial restrictions which are both unnecessary and unconventional. No? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] memcg: limit the number of thresholds per-memcg
On Wed 07-08-13 09:22:10, Tejun Heo wrote: > Hello, > > On Wed, Aug 07, 2013 at 01:28:25PM +0200, Michal Hocko wrote: > > There is no limit for the maximum number of threshold events registered > > per memcg. This might lead to an user triggered memory depletion if a > > regular user is allowed to register on memory.[memsw.]usage_in_bytes > > eventfd interface. > > > > Let's be more strict and cap the number of events that might be > > registered. MAX_THRESHOLD_EVENTS value is more or less random. The > > expectation is that it should be high enough to cover reasonable > > usecases while not too high to allow excessive resources consumption. > > 1024 events consume something like 16KB which shouldn't be a big deal > > and it should be good enough. > > I don't think the memory consumption per-se is the issue to be handled > here (as kernel memory consumption is a different generic problem) but > rather that all listeners, regardless of their priv level, cgroup > membership and so on, end up contributing to this single shared > contiguous table, The table is per-memcg but you are right that everybody who has file write access to the particular group's usage file can register to it. > which makes it quite easy to do DoS attack on it if > the event control is actually delegated to untrusted security domain, OK, I have obviously misunderstood your concern mentioned in the other email. Could you be more specific what is the DoS scenario which was your concern, then? [...] > Can you please update the patch description to reflect the actual > problem? As soon as I understand what is your concern ;) -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Tools: hv: use full nlmsghdr in netlink_send
There is no need to have a nlmsghdr pointer to another temporary buffer. Instead use a full struct nlmsghdr. Signed-off-by: Olaf Hering --- tools/hv/hv_kvp_daemon.c | 15 +-- tools/hv/hv_vss_daemon.c | 15 +-- 2 files changed, 10 insertions(+), 20 deletions(-) diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c index 1bd1ad1..7c05f55 100644 --- a/tools/hv/hv_kvp_daemon.c +++ b/tools/hv/hv_kvp_daemon.c @@ -1393,23 +1393,18 @@ kvp_get_domain_name(char *buffer, int length) static int netlink_send(int fd, struct cn_msg *msg) { - struct nlmsghdr *nlh; + struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE }; unsigned int size; struct msghdr message; - char buffer[64]; struct iovec iov[2]; size = sizeof(struct cn_msg) + msg->len; - nlh = (struct nlmsghdr *)buffer; - nlh->nlmsg_seq = 0; - nlh->nlmsg_pid = getpid(); - nlh->nlmsg_type = NLMSG_DONE; - nlh->nlmsg_len = NLMSG_LENGTH(size - sizeof(*nlh)); - nlh->nlmsg_flags = 0; + nlh.nlmsg_pid = getpid(); + nlh.nlmsg_len = NLMSG_LENGTH(size); - iov[0].iov_base = nlh; - iov[0].iov_len = sizeof(*nlh); + iov[0].iov_base = + iov[0].iov_len = sizeof(nlh); iov[1].iov_base = msg; iov[1].iov_len = size; diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 2f1f53f..8ac0ee7 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -105,23 +105,18 @@ static int vss_operate(int operation) static int netlink_send(int fd, struct cn_msg *msg) { - struct nlmsghdr *nlh; + struct nlmsghdr nlh = { .nlmsg_type = NLMSG_DONE }; unsigned int size; struct msghdr message; - char buffer[64]; struct iovec iov[2]; size = sizeof(struct cn_msg) + msg->len; - nlh = (struct nlmsghdr *)buffer; - nlh->nlmsg_seq = 0; - nlh->nlmsg_pid = getpid(); - nlh->nlmsg_type = NLMSG_DONE; - nlh->nlmsg_len = NLMSG_LENGTH(size - sizeof(*nlh)); - nlh->nlmsg_flags = 0; + nlh.nlmsg_pid = getpid(); + nlh.nlmsg_len = NLMSG_LENGTH(size); - iov[0].iov_base = nlh; - iov[0].iov_len = sizeof(*nlh); + iov[0].iov_base = + iov[0].iov_len = sizeof(nlh); iov[1].iov_base = msg; iov[1].iov_len = size; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] cris: fix return type of ffs()
The return type of ffs() is 'int' on all architectures except cris and hexagon. This unifies the return type to 'int'. The problem I'm seeing is that the following line generates a warning on cris and hexagon because of the mismatch between format '%u' and return type of ffs(). printk("bits in OOB size: %u\n",ffs(ns->geom.oobsz) - 1); But removing this warning by casting to 'int' looks odd, so I suggest unifying the return type of ffs() on all architectures. Signed-off-by: Akinobu Mita Reported-by: Fengguang Wu Cc: Mikael Starvik Cc: Jesper Nilsson Cc: linux-cris-ker...@axis.com Cc: Richard Kuo Cc: linux-hexa...@vger.kernel.org Cc: linux-a...@vger.kernel.org --- arch/cris/include/arch-v10/arch/bitops.h | 2 +- arch/cris/include/arch-v32/arch/bitops.h | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/cris/include/arch-v10/arch/bitops.h b/arch/cris/include/arch-v10/arch/bitops.h index 03d9cfd..cc37a22 100644 --- a/arch/cris/include/arch-v10/arch/bitops.h +++ b/arch/cris/include/arch-v10/arch/bitops.h @@ -65,7 +65,7 @@ static inline unsigned long __ffs(unsigned long word) * differs in spirit from the above ffz (man ffs). */ -static inline unsigned long kernel_ffs(unsigned long w) +static inline int kernel_ffs(unsigned long w) { return w ? cris_swapwbrlz (w) + 1 : 0; } diff --git a/arch/cris/include/arch-v32/arch/bitops.h b/arch/cris/include/arch-v32/arch/bitops.h index 147689d6..a5d0963 100644 --- a/arch/cris/include/arch-v32/arch/bitops.h +++ b/arch/cris/include/arch-v32/arch/bitops.h @@ -55,7 +55,7 @@ __ffs(unsigned long w) /* * Find First Bit that is set. */ -static inline unsigned long +static inline int kernel_ffs(unsigned long w) { return w ? cris_swapwbrlz (w) + 1 : 0; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] hexagon: fix return type of ffs()
The return type of ffs() is 'int' on all architectures except cris and hexagon. This unifies the return type to 'int'. The problem I'm seeing is that the following line generates a warning on cris and hexagon because of the mismatch between format '%u' and return type of ffs(). printk("bits in OOB size: %u\n",ffs(ns->geom.oobsz) - 1); But removing this warning by casting to 'int' looks odd, so I suggest unifying the return type of ffs() on all architectures. Signed-off-by: Akinobu Mita Reported-by: Fengguang Wu Cc: Mikael Starvik Cc: Jesper Nilsson Cc: linux-cris-ker...@axis.com Cc: Richard Kuo Cc: linux-hexa...@vger.kernel.org Cc: linux-a...@vger.kernel.org --- This patch is not compile tested yet, because I couldn't find cross compiler for hexagon. arch/hexagon/include/asm/bitops.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/hexagon/include/asm/bitops.h b/arch/hexagon/include/asm/bitops.h index 9b1e4af..80e34a6 100644 --- a/arch/hexagon/include/asm/bitops.h +++ b/arch/hexagon/include/asm/bitops.h @@ -234,7 +234,7 @@ static inline long fls(int x) * the libc and compiler builtin ffs routines, therefore * differs in spirit from the above ffz (man ffs). */ -static inline long ffs(int x) +static inline int ffs(int x) { int r; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/17] perf util: Save pid-cmdline mapping into tracing header
On 8/5/13 3:17 AM, Namhyung Kim wrote: I don't think this is a problem, its in line with Ingo's suggestion of a new perf ioctl to ask the kernel to generate PERF_RECORD_MMAP events for existing threads. Hmm.. could you please give me a link of the thread? I believe this is the thread being referred to: https://lkml.org/lkml/2013/6/25/180 David -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/3] Add madvise(..., MADV_WILLWRITE)
On Mon 05-08-13 12:43:58, Andy Lutomirski wrote: > My application fallocates and mmaps (shared, writable) a lot (several > GB) of data at startup. Those mappings are mlocked, and they live on > ext4. The first write to any given page is slow because > ext4_da_get_block_prep can block. This means that, to get decent > performance, I need to write something to all of these pages at > startup. This, in turn, causes a giant IO storm as several GB of > zeros get pointlessly written to disk. > > This series is an attempt to add madvise(..., MADV_WILLWRITE) to > signal to the kernel that I will eventually write to the referenced > pages. It should cause any expensive operations that happen on the > first write to happen immediately, but it should not result in > dirtying the pages. > > madvice(addr, len, MADV_WILLWRITE) returns the number of bytes that > the operation succeeded on or a negative error code if there was an > actual failure. A return value of zero signifies that the kernel > doesn't know how to "willwrite" the range and that userspace should > implement a fallback. > > For now, it only works on shared writable ext4 mappings. Eventually > it should support other filesystems as well as private pages (it > should COW the pages but not cause swap IO) and anonymous pages (it > should COW the zero page if applicable). > > The implementation leaves much to be desired. In particular, it > generates dirty buffer heads on a clean page, and this scares me. > > Thoughts? One question before I look at the patches: Why don't you use fallocate() in your application? The functionality you require seems to be pretty similar to it - writing to an already allocated block is usually quick. Honza > Andy Lutomirski (3): > mm: Add MADV_WILLWRITE to indicate that a range will be written to > fs: Add block_willwrite > ext4: Implement willwrite for the delalloc case > > fs/buffer.c| 57 > ++ > fs/ext4/ext4.h | 2 ++ > fs/ext4/file.c | 1 + > fs/ext4/inode.c| 22 + > include/linux/buffer_head.h| 3 ++ > include/linux/mm.h | 12 +++ > include/uapi/asm-generic/mman-common.h | 3 ++ > mm/madvise.c | 28 +++-- > 8 files changed, 126 insertions(+), 2 deletions(-) > > -- > 1.8.3.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] memcg: Limit the number of events registered on oom_control
On Wed 07-08-13 09:08:36, Tejun Heo wrote: > Hello, Michal. > > On Wed, Aug 07, 2013 at 01:28:26PM +0200, Michal Hocko wrote: > > There is no limit for the maximum number of oom_control events > > registered per memcg. This might lead to an user triggered memory > > depletion if a regular user is allowed to register events. > > > > Let's be more strict and cap the number of events that might be > > registered. MAX_OOM_NOTIFY_EVENTS value is more or less random. The > > expectation is that it should be high enough to cover reasonable > > usecases while not too high to allow excessive resources consumption. > > 1024 events consume something like 24KB which shouldn't be a big deal > > and it should be good enough (even 1024 oom notification events sounds > > crazy). > > I think putting restriction on usage_event makes sense as that builds > a shared contiguous table from all events which can't be attributed > correctly and makes it easy to trigger allocation failures due to > large order allocation but is this necessary for oom and vmpressure, > both of which allocate only for the listening task? Once I was there I made them consistent in that regards. > It isn't different from listening from epoll, for example. epoll limits the number of watchers, no? > If there needs to be kernel memory limit, shouldn't that be handled by > kmemcg? kmemcg would surely help but turning it on just because of potential abuse of the event registration API sounds like an overkill. I think having a cap for user trigable kernel resources is a good thing in general. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
Hello, Michal. On Wed, Aug 07, 2013 at 03:26:13PM +0200, Michal Hocko wrote: > I would rather see it not changed unless it really is a big win in the > cgroup core. So far I do not see anything like that (just look at > __cgroup_from_dentry which needs to be exported to allow for the move). The end goal is cleaning up cftype so that it becomes a thin wrapper around seq_file and I'd really like to keep the interface minimal so that it's difficult to misunderstand. > You reduce the amount of code in cgroup.c, alright, but the code > doesn't go away really. It just moves out of your sight and moves the > same burden on somebody else without providing a new generic interface. If the implementation details are all that you're objecting, I'll be happy to restructure it. I just didn't pay too much attention to it because I considered it to be mostly deprecated. I don't think it'll be too much work and strongly think it'll be worth the effort. Our code base is extremely nasty is and I'll try to get any ounce of cleanup I can get. > If somebody needs a notification interface (and there is no one available > right now) then you cannot prevent from such a pointless work anyway... I'm gonna add one for freezer state transitions. It'll be simple "this file changed" thing and will probably apply that to at least oom and vmpressure. I'm relatively confident that it's gonna be pretty simple and that's gonna be the cgroup event mechanism. > cgroup_event_* don't sound memcg specific at all. They are playing with > cgroup dentry reference counting and do a generic functionality which > memcg doesn't need to know about. Sure, I'll try to clean it up so that it doesn't meddle with cgroup internals directly. > I wouldn't object to having non-cgroup internals playing variant. I just > do not think it makes sense to invest time to something that should go > away long term. I suppose it's priority thing. To me, cleaning up cgroup core API is quite important and I'd be happy to sink time and effort into it and it's not like we can drop the event thing in a release cycle or two. We'd have to carry it for years, so I think the effort is justified. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: fix wrong address when loading PRM_FRAC_INCREMENTOR_DENUMERATOR_RELOAD
On Aug 7, 2013, at 7:09 PM, Tony Lindgren wrote: > * Chen Baozi [130805 08:33]: >> ping? >> >> On Aug 1, 2013, at 7:27 PM, Chen Baozi wrote: >> >>> The denominator should be load from INCREMENTOR_DENUMERATOR_RELOAD_OFFSET >>> rather than INCREMENTER_NUMERATOR_OFFSET. > > Maybe describe what exactly happens without this fix? I think it is more likely a typo, since INCREMENTER_DENUMERATOR_RELOAD[23:17] is reserved. It seems that it won't make much trouble without this fix because the useful [11:0] bit is mask and set the right value later. Cheers, Baozi > > Also we should get few acks for this for the -rc series. > > Regards, > > Tony > >>> Signed-off-by: Chen Baozi >>> --- >>> arch/arm/mach-omap2/timer.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/arch/arm/mach-omap2/timer.c b/arch/arm/mach-omap2/timer.c >>> index b37e1fc..9265e03 100644 >>> --- a/arch/arm/mach-omap2/timer.c >>> +++ b/arch/arm/mach-omap2/timer.c >>> @@ -537,7 +537,7 @@ static void __init realtime_counter_init(void) >>> reg |= num; >>> __raw_writel(reg, base + INCREMENTER_NUMERATOR_OFFSET); >>> >>> - reg = __raw_readl(base + INCREMENTER_NUMERATOR_OFFSET) & >>> + reg = __raw_readl(base + INCREMENTER_DENUMERATOR_RELOAD_OFFSET) & >>> NUMERATOR_DENUMERATOR_MASK; >>> reg |= den; >>> __raw_writel(reg, base + INCREMENTER_DENUMERATOR_RELOAD_OFFSET); >>> -- >>> 1.8.1.4 >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: List corruption in hidraw_release in 3.11-rc4
On Wed, 7 Aug 2013, Peter Wu wrote: > > does the patch below fix the problem you are seeing? > That one is already in 3.11-rc4 as far as I can see. Also, that code can > probably simplified by moving the mutex_unlock after the out label, removing > the need to duplicate the mutex_unlock. > > Remember what I said about "no Oopses"? Well, it turned out that several > memory structures were damaged which causes a general protection fault in > sock_alloc_inode and other places. > > I managed to create a program that can reproduce this bug 100% in a QEMU > virtual machine with a Logitech USB receiver passed to it. > > qemu-system-x86_64 -enable-kvm -m 1G -usb -usbdevice host:046d:c52b > (pass -kernel, -initrd, -append as needed) > > Copy hidraw-test to initrd, boot QEMU and run `hidraw-test`. Result: instant > (= +/- 2 seconds) crash. > > I have applied Manoj's patch[1] on top of 3.11-rc4 which seem to fix the > issue. > One observation is that the new device is named /dev/hidraw1 instead of > /dev/hidraw0. Example: > > f(){ hidraw-test /dev/hidraw$1 usb1;} > # needed for 3.11-rc4 > f 1; f 1 # crash > # needed for 3.11-rc4 + patch > f 1; f 2 # ok > > Regards, > Peter > > [1]: http://lkml.org/lkml/2013/7/22/248 That one I am still reviewing ... can I add your Tested-by: to it when I'll be applying it and pushing to Linus? Thanks. > -- > /* cc hidraw-test.c -o hidraw-test > * hidraw-test /dev/hidraw0 usb1; hidraw-test /dev/hidraw0 usb1; > */ > #include > #include > #include > #include > #include > #include > > int open_and_write(const char *path, const char *data) { > int sfd, r; > > sfd = open(path, O_WRONLY); > if (sfd < 0) { > perror(path); > return 1; > } > > r = write(sfd, data, strlen(data)); > if (r < 0) { > fprintf(stderr, "write(%s, %s): %s\n", > path, data, strerror(errno)); > return 1; > } > close(sfd); > return 0; > } > > int dork(const char *hiddev, const char *name) { > int fd; > char c; > > fd = open(hiddev, O_RDWR | O_NONBLOCK); > if (fd < 0) { > perror("open"); > return 1; > } > > if (open_and_write("/sys/bus/usb/drivers/usb/unbind", name)) > return 1; > > // does not make a difference > //sleep(1); > > if (open_and_write("/sys/bus/usb/drivers/usb/bind", name)) > return 1; > > // allow devices to get discovered > sleep(1); > > printf("read() = %zi\n", read(fd, , 1)); perror("read"); > close(fd); > return 0; > } > > int main(int argc, char **argv) { > if (argc < 3) { > fprintf(stderr, "Usage: %s /dev/hidrawN usbN\n", *argv); > return 1; > } > > system("modprobe -v usbhid"); > system("modprobe -v hid-logitech-dj"); > > dork(argv[1], argv[2]); > > return 0; > } > -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: List corruption in hidraw_release in 3.11-rc4
On Wednesday 07 August 2013 03:01:26 Jiri Kosina wrote: > On Tue, 6 Aug 2013, Peter Wu wrote: > > While debugging upowerd (with Logitech Unifying receiver via hidraw), > > I came across this list corruption warning. > > Peter, > > does the patch below fix the problem you are seeing? That one is already in 3.11-rc4 as far as I can see. Also, that code can probably simplified by moving the mutex_unlock after the out label, removing the need to duplicate the mutex_unlock. Remember what I said about "no Oopses"? Well, it turned out that several memory structures were damaged which causes a general protection fault in sock_alloc_inode and other places. I managed to create a program that can reproduce this bug 100% in a QEMU virtual machine with a Logitech USB receiver passed to it. qemu-system-x86_64 -enable-kvm -m 1G -usb -usbdevice host:046d:c52b (pass -kernel, -initrd, -append as needed) Copy hidraw-test to initrd, boot QEMU and run `hidraw-test`. Result: instant (= +/- 2 seconds) crash. I have applied Manoj's patch[1] on top of 3.11-rc4 which seem to fix the issue. One observation is that the new device is named /dev/hidraw1 instead of /dev/hidraw0. Example: f(){ hidraw-test /dev/hidraw$1 usb1;} # needed for 3.11-rc4 f 1; f 1 # crash # needed for 3.11-rc4 + patch f 1; f 2 # ok Regards, Peter [1]: http://lkml.org/lkml/2013/7/22/248 -- /* cc hidraw-test.c -o hidraw-test * hidraw-test /dev/hidraw0 usb1; hidraw-test /dev/hidraw0 usb1; */ #include #include #include #include #include #include int open_and_write(const char *path, const char *data) { int sfd, r; sfd = open(path, O_WRONLY); if (sfd < 0) { perror(path); return 1; } r = write(sfd, data, strlen(data)); if (r < 0) { fprintf(stderr, "write(%s, %s): %s\n", path, data, strerror(errno)); return 1; } close(sfd); return 0; } int dork(const char *hiddev, const char *name) { int fd; char c; fd = open(hiddev, O_RDWR | O_NONBLOCK); if (fd < 0) { perror("open"); return 1; } if (open_and_write("/sys/bus/usb/drivers/usb/unbind", name)) return 1; // does not make a difference //sleep(1); if (open_and_write("/sys/bus/usb/drivers/usb/bind", name)) return 1; // allow devices to get discovered sleep(1); printf("read() = %zi\n", read(fd, , 1)); perror("read"); close(fd); return 0; } int main(int argc, char **argv) { if (argc < 3) { fprintf(stderr, "Usage: %s /dev/hidrawN usbN\n", *argv); return 1; } system("modprobe -v usbhid"); system("modprobe -v hid-logitech-dj"); dork(argv[1], argv[2]); return 0; } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 04/13] ARM: ux500: Remove Snowball DTS entry for ROHM BH1780GLI ambient light sensor
On Fri, Jul 19, 2013 at 4:13 PM, Lee Jones wrote: > It doesn't exist on the Snowball development board. > > Signed-off-by: Lee Jones Patch applied. Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 03/13] ARM: ux500: Remove Snowball DTS entry for TPS61052 chip
On Fri, Jul 19, 2013 at 4:13 PM, Lee Jones wrote: > TPS61052 is a; boost converter, LED driver, LED flash driver and > simple GPIO pin chip. It has no use here however, as it is not > found on the Snowball development board. > > Signed-off-by: Lee Jones Patch applied. Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 02/13] ARM: ux500: Remove Snowball DTS entry for National Semiconductor LP5521 LED chip
On Fri, Jul 19, 2013 at 4:13 PM, Lee Jones wrote: > It doesn't exist on the Snowball development board. > > Signed-off-by: Lee Jones Patch applied. Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 01/13] ARM: ux500: Remove Toshiba TC35892 I/O Expander's DT entry from Snowball's DTS
On Fri, Jul 19, 2013 at 4:13 PM, Lee Jones wrote: > It doesn't exist on this development board. > > Signed-off-by: Lee Jones Patch applied to my ux500-dt branch. Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 0/1] drm/pl111: Initial drm/kms driver for pl111
On Wed, Aug 7, 2013 at 12:23 AM, John Stultz wrote: > On Tue, Aug 6, 2013 at 5:15 AM, Rob Clark wrote: >> well, let's divide things up into two categories: >> >> 1) the arrangement and format of pixels.. ie. what userspace would >> need to know if it mmap's a buffer. This includes pixel format, >> stride, etc. This should be negotiated in userspace, it would be >> crazy to try to do this in the kernel. >> >> 2) the physical placement of the pages. Ie. whether it is contiguous >> or not. Which bank the pages in the buffer are placed in, etc. This >> is not visible to userspace. This is the purpose of the attach step, >> so you know all the devices involved in sharing up front before >> allocating the backing pages. (Or in the worst case, if you have a >> "late attacher" you at least know when no device is doing dma access >> to a buffer and can reallocate and move the buffer.) A long time > > One concern I know the Android folks have expressed previously (and > correct me if its no longer an objection), is that this attach time > in-kernel constraint solving / moving or reallocating buffers is > likely to hurt determinism. If I understood, their perspective was > that userland knows the device path the buffers will travel through, > so why not leverage that knowledge, rather then having the kernel have > to sort it out for itself after the fact. If you know the device path, then attach the buffer at all the devices before you start using it. Problem solved.. kernel knows all devices before pages need be allocated ;-) BR, -R -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] Tools: hv: correct payload size in netlink_send
> -Original Message- > From: Olaf Hering [mailto:o...@aepfle.de] > Sent: Wednesday, August 07, 2013 9:07 AM > To: KY Srinivasan; gre...@linuxfoundation.org > Cc: linux-kernel@vger.kernel.org; Olaf Hering > Subject: [PATCH] Tools: hv: correct payload size in netlink_send > > netlink_send is supposed to send just the cn_msg+hv_kvp_msg via netlink. > Currently it sets an incorrect iovec size, as reported by valgrind. > > In the case of registering with the kernel the allocated buffer is large > enough to hold nlmsghdr+cn_msg+hv_kvp_msg, no overrun happens. In the > case of responding to the kernel the cn_msg is located in the middle of > recv_buffer, after the nlmsghdr. Currently the code in netlink_send adds > also the size of nlmsghdr to the payload. But nlmsghdr is a separate > iovec. This leads to an (harmless) out-of-bounds access when the kernel > processes the iovec. Correct the iovec size of the cn_msg to be just > cn_msg + its payload. Thanks Olaf. > > Signed-off-by: Olaf Hering Signed-off-by: K. Y. Srinivasan > --- > tools/hv/hv_kvp_daemon.c | 2 +- > tools/hv/hv_vss_daemon.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/tools/hv/hv_kvp_daemon.c b/tools/hv/hv_kvp_daemon.c > index d3bcb84..1bd1ad1 100644 > --- a/tools/hv/hv_kvp_daemon.c > +++ b/tools/hv/hv_kvp_daemon.c > @@ -1399,7 +1399,7 @@ netlink_send(int fd, struct cn_msg *msg) > char buffer[64]; > struct iovec iov[2]; > > - size = NLMSG_SPACE(sizeof(struct cn_msg) + msg->len); > + size = sizeof(struct cn_msg) + msg->len; > > nlh = (struct nlmsghdr *)buffer; > nlh->nlmsg_seq = 0; > diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c > index 6b4f2fa..2f1f53f 100644 > --- a/tools/hv/hv_vss_daemon.c > +++ b/tools/hv/hv_vss_daemon.c > @@ -111,7 +111,7 @@ static int netlink_send(int fd, struct cn_msg *msg) > char buffer[64]; > struct iovec iov[2]; > > - size = NLMSG_SPACE(sizeof(struct cn_msg) + msg->len); > + size = sizeof(struct cn_msg) + msg->len; > > nlh = (struct nlmsghdr *)buffer; > nlh->nlmsg_seq = 0; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCHSET cgroup/for-3.12] cgroup: make cgroup_event specific to memcg
On Wed 07-08-13 08:43:21, Tejun Heo wrote: > Hello, Michal. > > On Wed, Aug 07, 2013 at 02:18:36PM +0200, Michal Hocko wrote: > > How is it specific to memcg? The fact only memcg uses the interface > > doesn't imply it is memcg specific. > > I don't follow. It's only for memcg. That is *by definition* memcg > specific. It's the verbatim meaning of the word. My understanding of "memcg specific" is that it uses memcg specific code/data structures. But let's not play with words. > Now, I do > understand that it can be a concern the implementation details as-is > could be a bit too invasive into cgroup core to be moved to memcg, but > that's something we can work on, right? Does it really make sense to work on this interface if it is planned to be replaced by something different. Isn't that just a waste of time? > Can you at least agree that the feature is nmemcg specific and it'd be > better to be located in memcg if possible? That really isn't not much > to ask and is a logical thing to do. I would rather see it not changed unless it really is a big win in the cgroup core. So far I do not see anything like that (just look at __cgroup_from_dentry which needs to be exported to allow for the move). You reduce the amount of code in cgroup.c, alright, but the code doesn't go away really. It just moves out of your sight and moves the same burden on somebody else without providing a new generic interface. > > There are other ways to achieve the same. E.g. not ack new usage of > > register callback users. We have done similar with other things like > > use_hierarchy... > > Yes, but those are all inferior to actually moving the code where it > belongs. Those makes the code harder to follow and people > misunderstand and waste time working on stuff (either in the core or > controllers) which eventually end up getting nacked. Why do that when > we can easily do better? What's the rationale behind that? If somebody needs a notification interface (and there is no one available right now) then you cannot prevent from such a pointless work anyway... > > The cleanup is removing 2 callbacks with a cost of moving non-memcg > > specific code inside memcg. That is what I am objecting to. > > I don't really get your "non-memcg" specific code assertion when it is > by definition memcg-specific. What are you talking about? cgroup_event_* don't sound memcg specific at all. They are playing with cgroup dentry reference counting and do a generic functionality which memcg doesn't need to know about. > > I will not repeat myself. We seem to disagree on where the code belongs. > > As I've said I will not ack this code, try to find somebody else who > > think it is a good idea. I do not see any added value. > > Nacking is part of your authority as maintainer but you should still > provide plausible rationale for that. I didn't say I Nack it. I said I won't Ack it. If Johannes or Kamezawa think this is OK and another bloat in memcg is not a big deal I will not block it. I won't be happy but how is the life. > Are you saying that even if the > code is restructured so that it's not invasive into cgroup core, you > are still gonna disagree with it because it's still somehow not > memcg-specifc? I wouldn't object to having non-cgroup internals playing variant. I just do not think it makes sense to invest time to something that should go away long term. > Please don't repeat yourself but do explain your rationale. That's > part of your duty as a maintainer too. I think I am clear what I do not like about this move. -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] perf tools: add 'keep tracking' test
Add a test for the newly added PERF_COUNT_SW_DUMMY event. The test checks that tracking events continue when an event is disabled but a dummy software event is not disabled. Signed-off-by: Adrian Hunter --- tools/perf/Makefile | 1 + tools/perf/tests/builtin-test.c | 4 ++ tools/perf/tests/keep-tracking.c | 150 +++ tools/perf/tests/tests.h | 1 + tools/perf/util/evlist.c | 42 ++- tools/perf/util/evlist.h | 5 ++ 6 files changed, 201 insertions(+), 2 deletions(-) create mode 100644 tools/perf/tests/keep-tracking.c diff --git a/tools/perf/Makefile b/tools/perf/Makefile index bfd12d0..0193e7c 100644 --- a/tools/perf/Makefile +++ b/tools/perf/Makefile @@ -392,6 +392,7 @@ LIB_OBJS += $(OUTPUT)tests/sw-clock.o ifeq ($(ARCH),x86) LIB_OBJS += $(OUTPUT)tests/perf-time-to-tsc.o endif +LIB_OBJS += $(OUTPUT)tests/keep-tracking.o BUILTIN_OBJS += $(OUTPUT)builtin-annotate.o BUILTIN_OBJS += $(OUTPUT)builtin-bench.o diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c index b7b4049..2a468a1 100644 --- a/tools/perf/tests/builtin-test.c +++ b/tools/perf/tests/builtin-test.c @@ -100,6 +100,10 @@ static struct test { }, #endif { + .desc = "Test using a dummy software event to keep tracking", + .func = test__keep_tracking, + }, + { .func = NULL, }, }; diff --git a/tools/perf/tests/keep-tracking.c b/tools/perf/tests/keep-tracking.c new file mode 100644 index 000..74abe00 --- /dev/null +++ b/tools/perf/tests/keep-tracking.c @@ -0,0 +1,150 @@ +#include +#include +#include + +#include "parse-events.h" +#include "evlist.h" +#include "evsel.h" +#include "thread_map.h" +#include "cpumap.h" +#include "tests.h" + +#define CHECK__(x) { \ + while ((x) < 0) { \ + pr_debug(#x " failed!\n"); \ + goto out_err; \ + } \ +} + +#define CHECK_NOT_NULL__(x) { \ + while ((x) == NULL) { \ + pr_debug(#x " failed!\n"); \ + goto out_err; \ + } \ +} + +static int find_comm(struct perf_evlist *evlist, const char *comm) +{ + union perf_event *event; + int i, found; + + found = 0; + for (i = 0; i < evlist->nr_mmaps; i++) { + while ((event = perf_evlist__mmap_read(evlist, i)) != NULL) { + if (event->header.type == PERF_RECORD_COMM && + (pid_t)event->comm.pid == getpid() && + (pid_t)event->comm.tid == getpid() && + strcmp(event->comm.comm, comm) == 0) + found += 1; + } + } + return found; +} + +/** + * test__keep_tracking - test using a dummy software event to keep tracking. + * + * This function implements a test that checks that tracking events continue + * when an event is disabled but a dummy software event is not disabled. If the + * test passes %0 is returned, otherwise %-1 is returned. + */ +int test__keep_tracking(void) +{ + struct perf_record_opts opts = { + .mmap_pages = UINT_MAX, + .user_freq = UINT_MAX, + .user_interval = ULLONG_MAX, + .freq= 4000, + .target = { + .uses_mmap = true, + }, + }; + struct thread_map *threads = NULL; + struct cpu_map *cpus = NULL; + struct perf_evlist *evlist = NULL; + struct perf_evsel *evsel = NULL; + int found, err = -1; + const char *comm; + + threads = thread_map__new(-1, getpid(), UINT_MAX); + CHECK_NOT_NULL__(threads); + + cpus = cpu_map__new(NULL); + CHECK_NOT_NULL__(cpus); + + evlist = perf_evlist__new(); + CHECK_NOT_NULL__(evlist); + + perf_evlist__set_maps(evlist, cpus, threads); + + CHECK__(parse_events(evlist, "dummy:u")); + CHECK__(parse_events(evlist, "cycles:u")); + + perf_evlist__config(evlist, ); + + evsel = perf_evlist__first(evlist); + + evsel->attr.comm = 1; + evsel->attr.disabled = 1; + evsel->attr.enable_on_exec = 0; + + CHECK__(perf_evlist__open(evlist)); + + CHECK__(perf_evlist__mmap(evlist, UINT_MAX, false)); + + /* +* First, test that a 'comm' event can be found when the event is +* enabled. +*/ + + perf_evlist__enable(evlist); + + comm = "Test COMM 1"; + CHECK__(prctl(PR_SET_NAME, (unsigned long)comm, 0, 0, 0)); + + perf_evlist__disable(evlist); + + found = find_comm(evlist, comm); + if (found != 1) { + pr_debug("First
[PATCH 2/3] perf tools: add support for PERF_COUNT_SW_DUMMY
Add support for the new dummy software event PERF_COUNT_SW_DUMMY. Signed-off-by: Adrian Hunter --- tools/perf/util/parse-events.c | 4 tools/perf/util/parse-events.l | 1 + tools/perf/util/python.c | 1 + 3 files changed, 6 insertions(+) diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index dba877d..1ef81ea 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -108,6 +108,10 @@ static struct event_symbol event_symbols_sw[PERF_COUNT_SW_MAX] = { .symbol = "emulation-faults", .alias = "", }, + [PERF_COUNT_SW_DUMMY] = { + .symbol = "dummy", + .alias = "", + }, }; #define __PERF_EVENT_FIELD(config, name) \ diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l index b36115f..29c5d24 100644 --- a/tools/perf/util/parse-events.l +++ b/tools/perf/util/parse-events.l @@ -144,6 +144,7 @@ context-switches|cs { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW cpu-migrations|migrations { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_CPU_MIGRATIONS); } alignment-faults { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_ALIGNMENT_FAULTS); } emulation-faults { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EMULATION_FAULTS); } +dummy { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); } L1-dcache|l1-d|l1d|L1-data | L1-icache|l1-i|l1i|L1-instruction | diff --git a/tools/perf/util/python.c b/tools/perf/util/python.c index 925e0c3..2fa83c0 100644 --- a/tools/perf/util/python.c +++ b/tools/perf/util/python.c @@ -967,6 +967,7 @@ static struct { { "COUNT_SW_PAGE_FAULTS_MAJ", PERF_COUNT_SW_PAGE_FAULTS_MAJ }, { "COUNT_SW_ALIGNMENT_FAULTS", PERF_COUNT_SW_ALIGNMENT_FAULTS }, { "COUNT_SW_EMULATION_FAULTS", PERF_COUNT_SW_EMULATION_FAULTS }, + { "COUNT_SW_DUMMY",PERF_COUNT_SW_DUMMY }, { "SAMPLE_IP",PERF_SAMPLE_IP }, { "SAMPLE_TID", PERF_SAMPLE_TID }, -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/