Re: [PATCH v7 22/24] [media] rtl2832: change the i2c gate to be mux-locked

2016-04-28 Thread Wolfram Sang
> So, I think all is ok, or do you need more than Tested-by?

I think this will do. Thanks!



signature.asc
Description: PGP signature


Re: [PATCH v2 1/6] mm/page_alloc: recalculate some of zone threshold when on/offline memory

2016-04-28 Thread Joonsoo Kim
On Thu, Apr 28, 2016 at 03:46:33PM +0800, Rui Teng wrote:
> On 4/25/16 1:21 PM, js1...@gmail.com wrote:
> >From: Joonsoo Kim 
> >
> >Some of zone threshold depends on number of managed pages in the zone.
> >When memory is going on/offline, it can be changed and we need to
> >adjust them.
> >
> >This patch add recalculation to appropriate places and clean-up
> >related function for better maintanance.
> >
> >Signed-off-by: Joonsoo Kim 
> >---
> > mm/page_alloc.c | 36 +---
> > 1 file changed, 29 insertions(+), 7 deletions(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 71fa015..ffa93e0 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -4633,6 +4633,8 @@ int local_memory_node(int node)
> > }
> > #endif
> >
> >+static void setup_min_unmapped_ratio(struct zone *zone);
> >+static void setup_min_slab_ratio(struct zone *zone);
> > #else   /* CONFIG_NUMA */
> >
> > static void set_zonelist_order(void)
> >@@ -5747,9 +5749,8 @@ static void __paginginit free_area_init_core(struct 
> >pglist_data *pgdat)
> > zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
> > #ifdef CONFIG_NUMA
> > zone->node = nid;
> >-zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
> >-/ 100;
> >-zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
> >+setup_min_unmapped_ratio(zone);
> >+setup_min_slab_ratio(zone);
> 
> The original logic use freesize to calculate the
> zone->min_unmapped_pages and zone->min_slab_pages here.
> But the new function will use zone->managed_pages.
> Do you mean the original logic is wrong, or the managed_pages will
> always be freesize when CONFIG_NUMA defined?

managed_pages will always be freesize so no problem.

Thanks.


Re: [PATCH v4] x86/boot: Warn on future overlapping memcpy() use

2016-04-28 Thread Kees Cook
On Thu, Apr 28, 2016 at 11:43 PM, Ingo Molnar  wrote:
>
> * Kees Cook  wrote:
>
>> If an overlapping memcpy() is ever attempted, we should at least report
>> it, in case it might lead to problems, so it could be changed to a
>> memmove() call instead.
>>
>> Suggested-by: Ingo Molnar 
>> Signed-off-by: Kees Cook 
>> ---
>> v4:
>> - use __memcpy not memcpy since we've already done the check.
>> v3:
>> - call memmove in addition to doing the warning
>> v2:
>> - warn about overlapping region
>> ---
>>  arch/x86/boot/compressed/string.c | 16 +---
>>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> Applied, thanks Kees!
>
> Btw., can we now also remove the memmove() hack from lib/decompress_unxz.c?

I'll let Lasse answer for sure, but I don't think so. The original commit says:

The XZ decompressor needs memmove(), memeq() (memcmp() == 0), and
memzero() (memset(ptr, 0, size)), which aren't available in all
arch-specific pre-boot environments.  I'm including simple versions in
decompress_unxz.c, but a cleaner solution would naturally be nicer.

-Kees


-- 
Kees Cook
Chrome OS & Brillo Security


Re: linux-next: build warning after merge of the akpm-current tree

2016-04-28 Thread Stephen Rothwell
Hi All,

On Fri, 29 Apr 2016 16:45:43 +1000 Stephen Rothwell  
wrote:
>
> After merging the akpm-current tree, today's linux-next build (x86_64
> allmodconfig) produced this warning:
> 
> drivers/scsi/ipr.c: In function 'ipr_show_device_id':
> drivers/scsi/ipr.c:4462:34: warning: format '%llx' expects argument of type 
> 'long long unsigned int', but argument 4 has type 'long unsigned int' 
> [-Wformat=]
>len = snprintf(buf, PAGE_SIZE, "0x%llx\n", be64_to_cpu(res->dev_id));
>   ^
> 
> Lots and lots like this :-(
> 
> Introduced by commit
> 
>   eef17fb79096 ("byteswap: try to avoid __builtin_constant_p gcc bug")
> 
> I guess __builtin_bswap64() has type "unsigned long int" :-(

So, I have reverted that commit for today ... it produces too many
warnings :-(

-- 
Cheers,
Stephen Rothwell


[PATCH] staging: sm750fb: Comparison to NULL fixed

2016-04-28 Thread Manav Batra
Signed-off-by: Manav Batra 

Improved comparison to NULL and removed some blank lines.
---
 drivers/staging/sm750fb/ddk750_dvi.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/sm750fb/ddk750_dvi.c 
b/drivers/staging/sm750fb/ddk750_dvi.c
index a4a2550..bac432b 100644
--- a/drivers/staging/sm750fb/ddk750_dvi.c
+++ b/drivers/staging/sm750fb/ddk750_dvi.c
@@ -8,7 +8,8 @@
 
 /* This global variable contains all the supported driver and its corresponding
function API. Please set the function pointer to NULL whenever the function
-   is not supported. */
+   is not supported.
+*/
 static dvi_ctrl_device_t g_dcftSupportedDviController[] = {
 #ifdef DVI_CTRL_SII164
{
@@ -28,7 +29,6 @@ static dvi_ctrl_device_t g_dcftSupportedDviController[] = {
 #endif
 };
 
-
 int dviInit(
unsigned char edgeSelect,
unsigned char busSelect,
@@ -45,7 +45,7 @@ int dviInit(
dvi_ctrl_device_t *pCurrentDviCtrl;
 
pCurrentDviCtrl = g_dcftSupportedDviController;
-   if (pCurrentDviCtrl->pfnInit != NULL) {
+   if (pCurrentDviCtrl->pfnInit) {
return pCurrentDviCtrl->pfnInit(edgeSelect, busSelect, 
dualEdgeClkSelect, hsyncEnable,
vsyncEnable, deskewEnable, 
deskewSetting, continuousSyncEnable,
pllFilterEnable, 
pllFilterValue);
@@ -55,4 +55,3 @@ int dviInit(
 
 #endif
 
-
-- 
1.9.1



Re: [PATCH v2 0/6] Introduce ZONE_CMA

2016-04-28 Thread Joonsoo Kim
Hello, Mel.

IIUC, you may miss that alloc_contig_range() currently does linear
reclaim/migration. Your comment is largely based on this
misunderstanding so please keep it in your mind when reading the
reply.

On Thu, Apr 28, 2016 at 11:39:27AM +0100, Mel Gorman wrote:
> On Mon, Apr 25, 2016 at 02:36:54PM +0900, Joonsoo Kim wrote:
> > > Hello,
> > > 
> > > Changes from v1
> > > o Separate some patches which deserve to submit independently
> > > o Modify description to reflect current kernel state
> > > (e.g. high-order watermark problem disappeared by Mel's work)
> > > o Don't increase SECTION_SIZE_BITS to make a room in page flags
> > > (detailed reason is on the patch that adds ZONE_CMA)
> > > o Adjust ZONE_CMA population code
> > > 
> > > This series try to solve problems of current CMA implementation.
> > > 
> > > CMA is introduced to provide physically contiguous pages at runtime
> > > without exclusive reserved memory area. But, current implementation
> > > works like as previous reserved memory approach, because freepages
> > > on CMA region are used only if there is no movable freepage. In other
> > > words, freepages on CMA region are only used as fallback. In that
> > > situation where freepages on CMA region are used as fallback, kswapd
> > > would be woken up easily since there is no unmovable and reclaimable
> > > freepage, too. If kswapd starts to reclaim memory, fallback allocation
> > > to MIGRATE_CMA doesn't occur any more since movable freepages are
> > > already refilled by kswapd and then most of freepage on CMA are left
> > > to be in free. This situation looks like exclusive reserved memory case.
> > > 
> 
> My understanding is that this was intentional. One of the original design
> requirements was that CMA have a high likelihood of allocation success for
> devices if it was necessary as an allocation failure was very visible to
> the user. It does not *have* to be treated as a reserve because Movable
> allocations could try CMA first but it increases allocation latency for
> devices that require it and it gets worse if those pages are pinned.

I know that it was design decision at that time when CMA isn't
actively used. It is due to lack of experience and now situation is
quite different. Most of embedded systems uses CMA with their own
adaptation because utilization is too low. It makes system much
slower and this is more likely than the case that device memory is
required. Given the fact that they adapt their logic to utilize CMA
much more and sacrifice latency, I think that previous design
decision is wrong and we should go another way.

> 
> > > In my experiment, I found that if system memory has 1024 MB memory and
> > > 512 MB is reserved for CMA, kswapd is mostly woken up when roughly 512 MB
> > > free memory is left. Detailed reason is that for keeping enough free
> > > memory for unmovable and reclaimable allocation, kswapd uses below
> > > equation when calculating free memory and it easily go under the 
> > > watermark.
> > > 
> > > Free memory for unmovable and reclaimable = Free total - Free CMA pages
> > > 
> > > This is derivated from the property of CMA freepage that CMA freepage
> > > can't be used for unmovable and reclaimable allocation.
> > > 
> 
> Yes and also keeping it lightly utilised to reduce CMA allocation
> latency and probability of failure.

As my experience about CMA, most of unacceptable failure (takes more
than 3 sec) comes from blockdev pagecache. Even, it's not simple to
check what is going on there when failure happen. ZONE_CMA uses
different approach that it only takes the request with
GFP_HIGHUSER_MOVABLE so blockdev pagecache cannot get in and
probability of failure is much reduced.

> > > Anyway, in this case, kswapd are woken up when (FreeTotal - FreeCMA)
> > > is lower than low watermark and tries to make free memory until
> > > (FreeTotal - FreeCMA) is higher than high watermark. That results
> > > in that FreeTotal is moving around 512MB boundary consistently. It
> > > then means that we can't utilize full memory capacity.
> > > 
> > > To fix this problem, I submitted some patches [1] about 10 months ago,
> > > but, found some more problems to be fixed before solving this problem.
> > > It requires many hooks in allocator hotpath so some developers doesn't
> > > like it. Instead, some of them suggest different approach [2] to fix
> > > all the problems related to CMA, that is, introducing a new zone to deal
> > > with free CMA pages. I agree that it is the best way to go so implement
> > > here. Although properties of ZONE_MOVABLE and ZONE_CMA is similar,
> 
> One of the issues I mentioned at LSF/MM is that I consider ZONE_MOVABLE
> to be a mistake. Zones are meant to be about addressing limitations and
> both ZONE_MOVABLE and ZONE_CMA violate that. When ZONE_MOVABLE was
> introduced, it was intended for use with dynamically resizing the
> hugetlbfs pool. It was competing with fragmentation avoidance at the
> time and the community could no

linux-next: build warning after merge of the akpm-current tree

2016-04-28 Thread Stephen Rothwell
Hi Andrew,

After merging the akpm-current tree, today's linux-next build (x86_64
allmodconfig) produced this warning:

drivers/scsi/ipr.c: In function 'ipr_show_device_id':
drivers/scsi/ipr.c:4462:34: warning: format '%llx' expects argument of type 
'long long unsigned int', but argument 4 has type 'long unsigned int' 
[-Wformat=]
   len = snprintf(buf, PAGE_SIZE, "0x%llx\n", be64_to_cpu(res->dev_id));
  ^

Lots and lots like this :-(

Probably introduced by commit

  eef17fb79096 ("byteswap: try to avoid __builtin_constant_p gcc bug")

I guess __builtin_bswap64() has type "unsigned long int" :-(

-- 
Cheers,
Stephen Rothwell


Re: [PATCH] scripts/dtc: dt_to_config - report kernel config options for a devicetree

2016-04-28 Thread Geert Uytterhoeven
On Fri, Apr 29, 2016 at 8:39 AM, Gaurav Minocha
 wrote:
> On Thu, Apr 28, 2016 at 3:32 PM, Rob Herring  wrote:
>> On Thu, Apr 28, 2016 at 4:46 PM, Frank Rowand  wrote:
>>> From: Frank Rowand 
>>>
>>> Determining which kernel config options need to be enabled for a
>>> given devicetree can be a painful process.  Create a new tool to
>>> find the drivers that may match a devicetree node compatible,
>>> find the kernel config options that enable the driver, and
>>> optionally report whether the kernel config option is enabled.
>>
>> I would find this more useful to output a config fragment with all the
>> options enabled. The hard part there is enabling the options a given
>> option is dependent on which I don't think kbuild takes care of.
>
> Do you mean to generate something like .config? If yes, then IMO it would
> not be a correct configuration file.

A fragment to be appended to your current .config.

After that, an additional run of "make oldconfig" should (hopefully) bring
everything into good shape.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v4] x86/boot: Warn on future overlapping memcpy() use

2016-04-28 Thread Ingo Molnar

* Kees Cook  wrote:

> If an overlapping memcpy() is ever attempted, we should at least report
> it, in case it might lead to problems, so it could be changed to a
> memmove() call instead.
> 
> Suggested-by: Ingo Molnar 
> Signed-off-by: Kees Cook 
> ---
> v4:
> - use __memcpy not memcpy since we've already done the check.
> v3:
> - call memmove in addition to doing the warning
> v2:
> - warn about overlapping region
> ---
>  arch/x86/boot/compressed/string.c | 16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)

Applied, thanks Kees!

Btw., can we now also remove the memmove() hack from lib/decompress_unxz.c?

Thanks,

Ingo


Re: efi_enabled(EFI_PARAVIRT) use

2016-04-28 Thread Ingo Molnar

* Stephen Rothwell  wrote:

> Hi all,
> 
> Today's linux-next merge of the xen-tip tree got a conflict in:
> 
>   drivers/firmware/efi/arm-runtime.c
> 
> between commit:
> 
>   14c43be60166 ("efi/arm*: Drop writable mapping of the UEFI System table")
> 
> from the tip tree and commit:
> 
>   21c8dfaa2327 ("Xen: EFI: Parse DT parameters for Xen specific UEFI")
> 
> from the xen-tip tree.

(I've attached 21c8dfaa2327 below, for reference.)

Argh:

With considerable pain we just got rid of paravirt_enabled() in the x86 tree, 
and 
Xen is now reintroducing it in the EFI code. Please don't: if you have to do 
capability flags then name the flag accordingly to what it does, don't use some 
generic catch-all naming that will inevitably cause the kind of problems 
paravirt_enabled() caused...

So: NAKed-by: Ingo Molnar 

Also, it would be nice to have all things EFI in a single tree, the conflicts 
are 
going to be painful! There's very little reason not to carry this kind of 
commit:

 arch/arm/xen/enlighten.c   |  6 +
 drivers/firmware/efi/arm-runtime.c | 17 +-
 drivers/firmware/efi/efi.c | 45 --
 3 files changed, 56 insertions(+), 12 deletions(-)

in the EFI tree.

Thanks,

Ingo

===>
>From 21c8dfaa23276be2ae6d580331d8d252cc41e8d9 Mon Sep 17 00:00:00 2001
From: Shannon Zhao 
Date: Thu, 7 Apr 2016 20:03:34 +0800
Subject: [PATCH] Xen: EFI: Parse DT parameters for Xen specific UEFI

Add a new function to parse DT parameters for Xen specific UEFI just
like the way for normal UEFI. Then it could reuse the existing codes.

If Xen supports EFI, initialize runtime services.

CC: Matt Fleming 
Signed-off-by: Shannon Zhao 
Reviewed-by: Matt Fleming 
Reviewed-by: Stefano Stabellini 
Tested-by: Julien Grall 
---
 arch/arm/xen/enlighten.c   |  6 +
 drivers/firmware/efi/arm-runtime.c | 17 +-
 drivers/firmware/efi/efi.c | 45 --
 3 files changed, 56 insertions(+), 12 deletions(-)

diff --git a/arch/arm/xen/enlighten.c b/arch/arm/xen/enlighten.c
index 13e3e9f9b094..e130562d3283 100644
--- a/arch/arm/xen/enlighten.c
+++ b/arch/arm/xen/enlighten.c
@@ -261,6 +261,12 @@ static int __init fdt_find_hyper_node(unsigned long node, 
const char *uname,
!strncmp(hyper_node.prefix, s, strlen(hyper_node.prefix)))
hyper_node.version = s + strlen(hyper_node.prefix);
 
+   if (IS_ENABLED(CONFIG_XEN_EFI)) {
+   /* Check if Xen supports EFI */
+   if (of_get_flat_dt_subnode_by_name(node, "uefi") > 0)
+   set_bit(EFI_PARAVIRT, &efi.flags);
+   }
+
return 0;
 }
 
diff --git a/drivers/firmware/efi/arm-runtime.c 
b/drivers/firmware/efi/arm-runtime.c
index 6ae21e41a429..ac609b9f0b99 100644
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 
 extern u64 efi_system_table;
 
@@ -107,13 +108,19 @@ static int __init arm_enable_runtime_services(void)
}
set_bit(EFI_SYSTEM_TABLES, &efi.flags);
 
-   if (!efi_virtmap_init()) {
-   pr_err("No UEFI virtual mapping was installed -- runtime 
services will not be available\n");
-   return -ENOMEM;
+   if (IS_ENABLED(CONFIG_XEN_EFI) && efi_enabled(EFI_PARAVIRT)) {
+   /* Set up runtime services function pointers for Xen Dom0 */
+   xen_efi_runtime_setup();
+   } else {
+   if (!efi_virtmap_init()) {
+   pr_err("No UEFI virtual mapping was installed -- 
runtime services will not be available\n");
+   return -ENOMEM;
+   }
+
+   /* Set up runtime services function pointers */
+   efi_native_runtime_setup();
}
 
-   /* Set up runtime services function pointers */
-   efi_native_runtime_setup();
set_bit(EFI_RUNTIME_SERVICES, &efi.flags);
 
efi.runtime_version = efi.systab->hdr.revision;
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 3a69ed5ecfcb..519c096a7c33 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -469,12 +469,14 @@ device_initcall(efi_load_efivars);
FIELD_SIZEOF(struct efi_fdt_params, field) \
}
 
-static __initdata struct {
+struct params {
const char name[32];
const char propname[32];
int offset;
int size;
-} dt_params[] = {
+};
+
+static struct params fdt_params[] __initdata = {
UEFI_PARAM("System Table", "linux,uefi-system-table", system_table),
UEFI_PARAM("MemMap Address", "linux,uefi-mmap-start", mmap),
UEFI_PARAM("MemMap Size", "linux,uefi-mmap-size", mmap_size),
@@ -482,24 +484,45 @@ static __initdata struct {
UEFI_PARAM("MemMap Desc. Version", "linux,uefi-mmap-desc-ver", desc_ver)
 };
 
+static struct params xen_fdt_params[] _

Re: [PATCH] scripts/dtc: dt_to_config - report kernel config options for a devicetree

2016-04-28 Thread Gaurav Minocha
On Thu, Apr 28, 2016 at 3:32 PM, Rob Herring  wrote:
> On Thu, Apr 28, 2016 at 4:46 PM, Frank Rowand  wrote:
>> From: Frank Rowand 
>>
>> Determining which kernel config options need to be enabled for a
>> given devicetree can be a painful process.  Create a new tool to
>> find the drivers that may match a devicetree node compatible,
>> find the kernel config options that enable the driver, and
>> optionally report whether the kernel config option is enabled.
>
> I would find this more useful to output a config fragment with all the
> options enabled. The hard part there is enabling the options a given
> option is dependent on which I don't think kbuild takes care of.

Do you mean to generate something like .config? If yes, then IMO it would
not be a correct configuration file.

>
>> Signed-off-by: Gaurav Minocha 
>> Signed-off-by: Frank Rowand 
>>
>> ---
>>  scripts/dtc/dt_to_config | 1061 
>> +++
>>  1 file changed, 1061 insertions(+)
>>
>> Index: b/scripts/dtc/dt_to_config
>> ===
>> --- /dev/null
>> +++ b/scripts/dtc/dt_to_config
>> @@ -0,0 +1,1061 @@
>> +#!/usr/bin/perl
>
> I don't do perl...
>
>> +
>> +#   Copyright 2016 by Frank Rowand
>> +# Š Copyright 2016 by Gaurav Minocha
>  ^
> Is this supposed to be a copyright symbol?
>
>> +#
>> +# This file is subject to the terms and conditions of the GNU General Public
>> +# License v2.
>
> [...]
>
>> +# - magic compatibles, do not have a driver
>> +#
>> +# Will not search for drivers for these compatibles.
>> +
>> +%compat_white_list = (
>> +   'fixed-partitions'  => '1',
>
> Enabling CONFIG_MTD would be useful.
>
>> +   'none'  => '1',
>
> Is this an actual string used somewhere?
>
>> +   'pci'   => '1',
>
> ditto?
>
>> +   'simple-bus'=> '1',
>> +);
>> +
>> +# magic compatibles, have a driver
>> +#
>> +# Will not search for drivers for these compatibles.
>> +# Will instead use the drivers and config options listed here.
>> +#
>> +# If you add an entry to this hash, add the corresponding entry
>> +# to %driver_config_hard_code_list.
>> +#
>> +# These compatibles have a very large number of false positives.
>
> What does this mean?
>
>> +#
>> +# 'hardcoded_no_driver' is a magic value.  Other code knows this
>> +# magic value.  Do not use 'no_driver' here!
>> +#
>> +# TODO: Revisit each 'hardcoded_no_driver' to see how the compatible
>> +#   is used.  Are there drivers that can be provided?
>> +
>> +%driver_hard_code_list = (
>> +   'cache' => ['hardcoded_no_driver'],
>> +   'eeprom'=> ['hardcoded_no_driver'],
>> +   'gpio'  => ['hardcoded_no_driver'],
>> +   'gpios' => ['drivers/leds/leds-tca6507.c'],
>> +   'gpio-keys' => ['drivers/input/keyboard/gpio_keys.c'],
>> +   'i2c-gpio'  => ['drivers/i2c/busses/i2c-gpio.c'],
>> +   'isa'   => ['arch/mips/mti-malta/malta-dt.c',
>> +'arch/x86/kernel/devicetree.c'],
>> +   'led'   => ['hardcoded_no_driver'],
>> +   'm25p32'=> ['hardcoded_no_driver'],
>> +   'm25p64'=> ['hardcoded_no_driver'],
>> +   'm25p80'=> ['hardcoded_no_driver'],
>> +   'mtd_ram'   => ['drivers/mtd/maps/physmap_of.c'],
>> +   'pwm-backlight' => ['drivers/video/backlight/pwm_bl.c'],
>> +   'spidev'=> ['hardcoded_no_driver'],
>> +   'syscon'=> ['drivers/mfd/syscon.c'],
>> +   'tlv320aic23'   => ['hardcoded_no_driver'],
>> +   'wm8731'=> ['hardcoded_no_driver'],
>> +);
>> +
>> +%driver_config_hard_code_list = (
>> +
>> +   # this one needed even if %driver_hard_code_list is empty
>> +   'no_driver' => ['no_config'],
>> +   'hardcoded_no_driver'   => ['no_config'],
>> +
>> +   'drivers/leds/leds-tca6507.c'   => ['CONFIG_LEDS_TCA6507'],
>> +   'drivers/input/keyboard/gpio_keys.c'=> ['CONFIG_KEYBOARD_GPIO'],
>> +   'drivers/i2c/busses/i2c-gpio.c' => ['CONFIG_I2C_GPIO'],
>> +   'arch/mips/mti-malta/malta-dt.c'=> ['obj-y'],
>> +   'arch/x86/kernel/devicetree.c'  => ['CONFIG_OF'],
>> +   'drivers/mtd/maps/physmap_of.c' => ['CONFIG_MTD_PHYSMAP_OF'],
>> +   'drivers/video/backlight/pwm_bl.c'  => ['CONFIG_BACKLIGHT_PWM'],
>> +   'drivers/mfd/syscon.c'  => ['CONFIG_MFD_SYSCON'],
>
> I don't understand why some of these are not searchable by compatible strings.

do you mean - pwm-backlight, gpio-keys, i2c-gpio and isa?

these are being filtered by:
my $drivers = `git grep -l '"$compat"' -- $files`;

not,

git grep -l '\.compatible\s*=\s*"$compat"' -- $files

Frank, please advise!

>
>> +

[PATCH v3 4/5] crypto: LRNG - enable compile

2016-04-28 Thread Stephan Mueller
Add LRNG compilation support.

Signed-off-by: Stephan Mueller 
---
 crypto/Kconfig  | 10 ++
 crypto/Makefile |  1 +
 2 files changed, 11 insertions(+)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 93a1fdc..938f2dc 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -1587,6 +1587,16 @@ config CRYPTO_JITTERENTROPY
  random numbers. This Jitterentropy RNG registers with
  the kernel crypto API and can be used by any caller.
 
+config CRYPTO_LRNG
+   bool "Linux Random Number Generator"
+   select CRYPTO_DRBG_MENU
+   help
+ The Linux Random Number Generator (LRNG) is the replacement
+ of the legacy /dev/random provided with drivers/char/random.c.
+ It generates entropy from different noise sources and
+ delivers significant entropy during boot. The LRNG only
+ works with the presence of a high-resolution timer.
+
 config CRYPTO_USER_API
tristate
 
diff --git a/crypto/Makefile b/crypto/Makefile
index 4f4ef7e..7f91c8e 100644
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -114,6 +114,7 @@ obj-$(CONFIG_CRYPTO_DRBG) += drbg.o
 obj-$(CONFIG_CRYPTO_JITTERENTROPY) += jitterentropy_rng.o
 CFLAGS_jitterentropy.o = -O0
 jitterentropy_rng-y := jitterentropy.o jitterentropy-kcapi.o
+obj-$(CONFIG_CRYPTO_LRNG) += lrng.o
 obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o
 obj-$(CONFIG_CRYPTO_GHASH) += ghash-generic.o
 obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o
-- 
2.5.5




[PATCH v7 6/7] usb: pci-quirks: add Intel USB drcfg mux device

2016-04-28 Thread Lu Baolu
In some Intel platforms, a single usb port is shared between USB host
and device controllers. The shared port is under control of a switch
which is defined in the Intel vendor defined extended capability for
xHCI.

This patch adds the support to detect and create the platform device
for the port mux switch.

Signed-off-by: Lu Baolu 
Reviewed-by: Felipe Balbi 
---
 drivers/usb/host/pci-quirks.c| 45 ++--
 drivers/usb/host/xhci-ext-caps.h |  2 ++
 2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/host/pci-quirks.c b/drivers/usb/host/pci-quirks.c
index 35af362..9bb7aa1 100644
--- a/drivers/usb/host/pci-quirks.c
+++ b/drivers/usb/host/pci-quirks.c
@@ -16,10 +16,11 @@
 #include 
 #include 
 #include 
+#include 
+
 #include "pci-quirks.h"
 #include "xhci-ext-caps.h"
 
-
 #define UHCI_USBLEGSUP 0xc0/* legacy support */
 #define UHCI_USBCMD0   /* command register */
 #define UHCI_USBINTR   4   /* interrupt register */
@@ -78,6 +79,8 @@
 #define USB_INTEL_USB3_PSSEN   0xD8
 #define USB_INTEL_USB3PRM  0xDC
 
+#define DEVICE_ID_INTEL_BROXTON_P_XHCI 0x5aa8
+
 /*
  * amd_chipset_gen values represent AMD different chipset generations
  */
@@ -956,6 +959,41 @@ void usb_disable_xhci_ports(struct pci_dev *xhci_pdev)
 }
 EXPORT_SYMBOL_GPL(usb_disable_xhci_ports);
 
+static void create_intel_usb_mux_device(struct pci_dev *xhci_pdev,
+   void __iomem *base)
+{
+   struct platform_device *plat_dev;
+   struct property_set pset;
+   int ret;
+
+   struct property_entry pentry[] = {
+   PROPERTY_ENTRY_U64("reg-start",
+  pci_resource_start(xhci_pdev, 0) + 0x80d8),
+   PROPERTY_ENTRY_U64("reg-size", 8),
+   { },
+   };
+
+   if (!xhci_find_next_ext_cap(base, 0, XHCI_EXT_CAPS_INTEL_USB_MUX))
+   return;
+
+   plat_dev = platform_device_alloc("intel-mux-drcfg",
+PLATFORM_DEVID_NONE);
+   if (!plat_dev)
+   return;
+
+   plat_dev->dev.parent = &xhci_pdev->dev;
+   pset.properties = pentry;
+   platform_device_add_properties(plat_dev, &pset);
+
+   ret = platform_device_add(plat_dev);
+   if (ret) {
+   dev_warn(&xhci_pdev->dev,
+"failed to create mux device with error %d",
+   ret);
+   platform_device_put(plat_dev);
+   }
+}
+
 /**
  * PCI Quirks for xHCI.
  *
@@ -1022,8 +1060,11 @@ static void quirk_usb_handoff_xhci(struct pci_dev *pdev)
writel(val, base + ext_cap_offset + XHCI_LEGACY_CONTROL_OFFSET);
 
 hc_init:
-   if (pdev->vendor == PCI_VENDOR_ID_INTEL)
+   if (pdev->vendor == PCI_VENDOR_ID_INTEL) {
usb_enable_intel_xhci_ports(pdev);
+   if (pdev->device == DEVICE_ID_INTEL_BROXTON_P_XHCI)
+   create_intel_usb_mux_device(pdev, base);
+   }
 
op_reg_base = base + XHCI_HC_LENGTH(readl(base));
 
diff --git a/drivers/usb/host/xhci-ext-caps.h b/drivers/usb/host/xhci-ext-caps.h
index e0244fb..e368ccb 100644
--- a/drivers/usb/host/xhci-ext-caps.h
+++ b/drivers/usb/host/xhci-ext-caps.h
@@ -51,6 +51,8 @@
 #define XHCI_EXT_CAPS_ROUTE5
 /* IDs 6-9 reserved */
 #define XHCI_EXT_CAPS_DEBUG10
+/* Vendor defined 192-255 */
+#define XHCI_EXT_CAPS_INTEL_USB_MUX192
 /* USB Legacy Support Capability - section 7.1.1 */
 #define XHCI_HC_BIOS_OWNED (1 << 16)
 #define XHCI_HC_OS_OWNED   (1 << 24)
-- 
2.1.4



[PATCH v7 4/7] usb: mux: add driver for Intel drcfg controlled port mux

2016-04-28 Thread Lu Baolu
Several Intel PCHs and SOCs have an internal mux that is used to
share one USB port between device controller and host controller.
The mux is handled through the Dual Role Configuration Register.

Signed-off-by: Heikki Krogerus 
Signed-off-by: Lu Baolu 
Signed-off-by: Wu Hao 
Reviewed-by: Felipe Balbi 
---
 drivers/usb/mux/Kconfig   |   8 ++
 drivers/usb/mux/Makefile  |   1 +
 drivers/usb/mux/portmux-intel-drcfg.c | 171 ++
 3 files changed, 180 insertions(+)
 create mode 100644 drivers/usb/mux/portmux-intel-drcfg.c

diff --git a/drivers/usb/mux/Kconfig b/drivers/usb/mux/Kconfig
index 1dc1f33..ae3f746 100644
--- a/drivers/usb/mux/Kconfig
+++ b/drivers/usb/mux/Kconfig
@@ -19,4 +19,12 @@ config INTEL_MUX_GPIO
  Say Y here to enable support for Intel dual role port mux
  controlled by GPIOs.
 
+config INTEL_MUX_DRCFG
+   tristate "Intel dual role port mux controlled by register"
+   depends on X86
+   select USB_PORTMUX
+   help
+ Say Y here to enable support for Intel dual role port mux
+ controlled by the Dual Role Configuration Register.
+
 endmenu
diff --git a/drivers/usb/mux/Makefile b/drivers/usb/mux/Makefile
index 4eb5582..0f102b5 100644
--- a/drivers/usb/mux/Makefile
+++ b/drivers/usb/mux/Makefile
@@ -3,3 +3,4 @@
 #
 obj-$(CONFIG_USB_PORTMUX)  += portmux-core.o
 obj-$(CONFIG_INTEL_MUX_GPIO)   += portmux-intel-gpio.o
+obj-$(CONFIG_INTEL_MUX_DRCFG)  += portmux-intel-drcfg.o
diff --git a/drivers/usb/mux/portmux-intel-drcfg.c 
b/drivers/usb/mux/portmux-intel-drcfg.c
new file mode 100644
index 000..0bb6b08
--- /dev/null
+++ b/drivers/usb/mux/portmux-intel-drcfg.c
@@ -0,0 +1,171 @@
+/**
+ * intel-mux-drcfg.c - Driver for Intel USB mux via register
+ *
+ * Copyright (C) 2016 Intel Corporation
+ * Author: Heikki Krogerus 
+ * Author: Lu Baolu 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define INTEL_MUX_CFG0 0x00
+#define INTEL_MUX_CFG1 0x04
+#define CFG0_SW_IDPIN  BIT(20)
+#define CFG0_SW_IDPIN_EN   BIT(21)
+#define CFG0_SW_VBUS_VALID BIT(24)
+#define CFG1_MODE  BIT(29)
+
+struct intel_mux_drcfg {
+   struct portmux_desc desc;
+   struct device *dev;
+   void __iomem *regs;
+   struct portmux_dev *pdev;
+};
+
+static inline int intel_mux_drcfg_switch(struct device *dev, bool host)
+{
+   u32 data;
+   struct intel_mux_drcfg *mux;
+
+   mux = dev_get_drvdata(dev);
+
+   /* Check and set mux to SW controlled mode */
+   data = readl(mux->regs + INTEL_MUX_CFG0);
+   if (!(data & CFG0_SW_IDPIN_EN)) {
+   data |= CFG0_SW_IDPIN_EN;
+   writel(data, mux->regs + INTEL_MUX_CFG0);
+   }
+
+   /*
+* Configure CFG0 to switch the mux and VBUS_VALID bit is
+* required for device mode.
+*/
+   data = readl(mux->regs + INTEL_MUX_CFG0);
+   if (host)
+   data &= ~(CFG0_SW_IDPIN | CFG0_SW_VBUS_VALID);
+   else
+   data |= (CFG0_SW_IDPIN | CFG0_SW_VBUS_VALID);
+   writel(data, mux->regs + INTEL_MUX_CFG0);
+
+   return 0;
+}
+
+static int intel_mux_drcfg_cable_set(struct device *dev)
+{
+   dev_dbg(dev, "drcfg mux switch to HOST\n");
+
+   return intel_mux_drcfg_switch(dev, true);
+}
+
+static int intel_mux_drcfg_cable_unset(struct device *dev)
+{
+   dev_dbg(dev, "drcfg mux switch to DEVICE\n");
+
+   return intel_mux_drcfg_switch(dev, false);
+}
+
+static const struct portmux_ops drcfg_ops = {
+   .cable_set_cb = intel_mux_drcfg_cable_set,
+   .cable_unset_cb = intel_mux_drcfg_cable_unset,
+};
+
+static int intel_mux_drcfg_probe(struct platform_device *pdev)
+{
+   struct intel_mux_drcfg *mux;
+   struct device *dev = &pdev->dev;
+   const char *extcon_name = NULL;
+   u64 start, size;
+   int ret;
+
+   mux = devm_kzalloc(dev, sizeof(*mux), GFP_KERNEL);
+   if (!mux)
+   return -ENOMEM;
+
+   ret = device_property_read_u64(dev, "reg-start", &start);
+   ret |= device_property_read_u64(dev, "reg-size", &size);
+   if (ret)
+   return -ENODEV;
+
+   ret = device_property_read_string(dev, "extcon-name", &extcon_name);
+   if (!ret)
+   mux->desc.extcon_name = extcon_name;
+
+   mux->regs = devm_ioremap_nocache(dev, start, size);
+   if (!mux->regs)
+   return -ENOMEM;
+
+   mux->desc.dev = dev;
+   mux->desc.name = "intel-mux-drcfg";
+   mux->desc.ops = &drcfg_ops;
+   mux->desc.initial_state =
+   !!(readl(mux->regs + INTEL_MUX_CFG1) & CFG1_MODE);
+   dev_set_drvdata(dev, mux);
+   mux->pdev = portmux_register(&mux->desc);

[PATCH v7 0/7] usb: add support for Intel dual role port mux

2016-04-28 Thread Lu Baolu
Intel SOC chips are featured with USB dual role. The host role
is provided by Intel xHCI IP, and the gadget role is provided
by IP from designware. Tablet platform designs always share a
single port for both host and gadget controllers.  There is a
mux to switch the port to the right controller according to the
cable type. OS needs to provide the callback to control the mux
when a plug-in event raises. The method to control the mux is
platform dependent. At least three types of implementation can
be found across current devices. 1) GPIO pins; 2) a unit which
can be controlled by memory mapped registers; 3) ACPI ASL code.

This patch series adds supports for Intel dual role port mux.
It includes:
(1) A helper layer on top of extcon for individual mux driver.
It listens to the USB-HOST extcon cable and call the switch
call-back when the cable state changes.
(2) Drivers for GPIO controlled port mux which could be found
on Baytrail devices. A mfd driver is used to split the GPIOs
into a USB gpio extcon device, a fixed regulator for gpio
controlled USB VCC, and a USB mux device. Driver for USB
gpio extcon device is already in upstream Linux. This patch
series includes a driver for GPIO USB mux.
(3) Drivers for USB port mux controlled through memory mapped
registers and the logic to create the mux device. This type
of dual role port mux could be found in Cherry Trail and
Broxton devices.

Lu Baolu (7):
  regulator: fixed: add support for ACPI interface
  usb: mux: add generic code for dual role port mux
  usb: mux: add driver for Intel gpio controlled port mux
  usb: mux: add driver for Intel drcfg controlled port mux
  mfd: intel_vuport: Add Intel virtual USB port MFD Driver
  usb: pci-quirks: add Intel USB drcfg mux device
  MAINTAINERS: add maintainer entry for Intel USB dual role mux drivers

Change log:

v6->v7:
 - Two patches have been picked up by extcon maintainer. Hence,
   remove them since this version.
   - extcon: usb-gpio: add device binding for platform device
   - extcon: usb-gpio: add support for ACPI gpio interface
 - In patch "regulator: fixed: add support for ACPI interface",
   a static gpio name is used to get the regulator gpio.
 - In patch "mfd: intel_vuport: Add Intel virtual USB port MFD Driver",
   unnecessary "gpio-name" string property has been removed.

v5->v6:
 Work internally with Felipe to improve the whole patch series.
 Below changes have been made since last version.
 - rework the common code to make it a generic interface for mux devices;
 - split the vbus gpio handling to a fixed regulator device;
 - removed unnecessary filtering for state change;
 - removed unnecessary WARN statement;
 - removed globals in mux drivers;
 - removed unnecessary register polling and waiting in drcfg driver;

v4->v5:
 - Change the extcon interfaces with the new ones suggested by
   2a9de9c0f08d6 (extcon: Use the unique id for external connector
   instead of string)
 - remove patch "usb: pci-quirks: add Intel USB drcfg mux device"
   from this serial due to that it's not driver staff. Will be
   submitted seperately.

v3->v4:
 - Check all patches with "checkpatch.pl --strict", and fix all
   CHECKs;
 - Change sysfs node from "intel_mux" to "port_mux";
 - Refines below confusing functions:
   intel_usb_mux_register() -> intel_usb_mux_bind_cable()
   intel_usb_mux_unregister() -> intel_usb_mux_unbind_cable();
 - Remove unnecessary struct intel_mux_dev.

v2->v3:
 - uvport mfd driver got reviewed by Lee Jones, the following
   changes were made accordingly.
 - seperate uvport driver from the mux drivers in MAINTAINERS file
 - refine the description in Kconfig
 - refine the mfd_cell structure data

v1->v2:
 - move mux driver from drivers/usb/misc to drivers/usb/mux;
 - replace debugfs with sysfs for user level mux control;
 - remove unnecessary register restore if mux registeration failed;
 - Add "Acked-by: Chanwoo Choi " to extcon changes;
 - Make the file names and exported function names more specific;
 - Remove the usb_mux_get_dev() interface;
 - Move "struct intel_usb_mux" from .h to .c file;
 - Fix various kbuild robot warnings.

 Documentation/ABI/testing/sysfs-bus-platform |  17 +++
 MAINTAINERS  |  10 ++
 drivers/mfd/Kconfig  |   8 +
 drivers/mfd/Makefile |   1 +
 drivers/mfd/intel-vuport.c   |  89 +++
 drivers/regulator/fixed.c|  46 ++
 drivers/usb/Kconfig  |   2 +
 drivers/usb/Makefile |   1 +
 drivers/usb/host/pci-quirks.c|  45 +-
 drivers/usb/host/xhci-ext-caps.h |   2 +
 drivers/usb/mux/Kconfig  |  30 
 drivers/usb/mux/Makefile |   6 +
 drivers/usb/mux/portmux-core.c   | 217 +++
 drivers/usb/mux/portmux-intel-drcfg.c| 171 +
 drivers/usb/

[PATCH v7 3/7] usb: mux: add driver for Intel gpio controlled port mux

2016-04-28 Thread Lu Baolu
In some Intel platforms, a single usb port is shared between USB host
and device controller. The shared port is under control of GPIO pins.

This patch adds the support for USB GPIO controlled port mux.

[baolu: removed .owner per platform_no_drv_owner.cocci]
Signed-off-by: David Cohen 
Signed-off-by: Lu Baolu 
Reviewed-by: Heikki Krogerus 
Reviewed-by: Felipe Balbi 
---
 drivers/usb/mux/Kconfig  |  11 +++
 drivers/usb/mux/Makefile |   1 +
 drivers/usb/mux/portmux-intel-gpio.c | 149 +++
 3 files changed, 161 insertions(+)
 create mode 100644 drivers/usb/mux/portmux-intel-gpio.c

diff --git a/drivers/usb/mux/Kconfig b/drivers/usb/mux/Kconfig
index d91909f..1dc1f33 100644
--- a/drivers/usb/mux/Kconfig
+++ b/drivers/usb/mux/Kconfig
@@ -8,4 +8,15 @@ config USB_PORTMUX
def_bool n
help
  Generic USB dual role port mux support.
+
+config INTEL_MUX_GPIO
+   tristate "Intel dual role port mux controlled by GPIOs"
+   depends on GPIOLIB
+   depends on REGULATOR
+   depends on X86 && ACPI
+   select USB_PORTMUX
+   help
+ Say Y here to enable support for Intel dual role port mux
+ controlled by GPIOs.
+
 endmenu
diff --git a/drivers/usb/mux/Makefile b/drivers/usb/mux/Makefile
index f85df92..4eb5582 100644
--- a/drivers/usb/mux/Makefile
+++ b/drivers/usb/mux/Makefile
@@ -2,3 +2,4 @@
 # Makefile for USB port mux drivers
 #
 obj-$(CONFIG_USB_PORTMUX)  += portmux-core.o
+obj-$(CONFIG_INTEL_MUX_GPIO)   += portmux-intel-gpio.o
diff --git a/drivers/usb/mux/portmux-intel-gpio.c 
b/drivers/usb/mux/portmux-intel-gpio.c
new file mode 100644
index 000..b07ae2c
--- /dev/null
+++ b/drivers/usb/mux/portmux-intel-gpio.c
@@ -0,0 +1,149 @@
+/*
+ * USB Dual Role Port Mux driver controlled by gpios
+ *
+ * Copyright (c) 2016, Intel Corporation.
+ * Author: David Cohen 
+ * Author: Lu Baolu 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct vuport {
+   struct portmux_desc desc;
+   struct portmux_dev *pdev;
+   struct regulator *regulator;
+   struct gpio_desc *gpio_usb_mux;
+};
+
+/*
+ * id == 0, HOST connected, USB port should be set to peripheral
+ * id == 1, HOST disconnected, USB port should be set to host
+ *
+ * Peripheral: set USB mux to peripheral and disable VBUS
+ * Host: set USB mux to host and enable VBUS
+ */
+static inline int vuport_set_port(struct device *dev, int id)
+{
+   struct vuport *vup;
+
+   dev_dbg(dev, "USB PORT ID: %s\n", id ? "HOST" : "PERIPHERAL");
+
+   vup = dev_get_drvdata(dev);
+
+   gpiod_set_value_cansleep(vup->gpio_usb_mux, !id);
+
+   if (!id ^ regulator_is_enabled(vup->regulator))
+   return id ? regulator_disable(vup->regulator) :
+   regulator_enable(vup->regulator);
+
+   return 0;
+}
+
+static int vuport_cable_set(struct device *dev)
+{
+   return vuport_set_port(dev, 1);
+}
+
+static int vuport_cable_unset(struct device *dev)
+{
+   return vuport_set_port(dev, 0);
+}
+
+static const struct portmux_ops vuport_ops = {
+   .cable_set_cb = vuport_cable_set,
+   .cable_unset_cb = vuport_cable_unset,
+};
+
+static int vuport_probe(struct platform_device *pdev)
+{
+   struct device *dev = &pdev->dev;
+   struct vuport *vup;
+
+   vup = devm_kzalloc(dev, sizeof(*vup), GFP_KERNEL);
+   if (!vup)
+   return -ENOMEM;
+
+   vup->regulator = devm_regulator_get_exclusive(dev,
+ "regulator-usb-gpio");
+   if (IS_ERR(vup->regulator))
+   return -EPROBE_DEFER;
+
+   vup->gpio_usb_mux = devm_gpiod_get_optional(dev,
+   "usb_mux", GPIOD_ASIS);
+   if (IS_ERR(vup->gpio_usb_mux))
+   return PTR_ERR(vup->gpio_usb_mux);
+
+   vup->desc.dev = dev;
+   vup->desc.name = "intel-mux-gpio";
+   vup->desc.extcon_name = "extcon-usb-gpio";
+   vup->desc.ops = &vuport_ops;
+   vup->desc.initial_state = -1;
+   dev_set_drvdata(dev, vup);
+   vup->pdev = portmux_register(&vup->desc);
+
+   return PTR_ERR_OR_ZERO(vup->pdev);
+}
+
+static int vuport_remove(struct platform_device *pdev)
+{
+   struct vuport *vup;
+
+   vup = platform_get_drvdata(pdev);
+   portmux_unregister(vup->pdev);
+
+   return 0;
+}
+
+#ifdef CONFIG_PM_SLEEP
+/*
+ * In case a micro A cable was plugged in while device was sleeping,
+ * we missed the interrupt. We need to poll usb id gpio when waking the
+ * driver to detect the missed event.
+ * We use 'complete' callback to give time to all extcon listeners to
+ * resume before we send new events.
+ */
+static void vuport_complete(struct device *dev)
+{
+   struct

[PATCH v7 7/7] MAINTAINERS: add maintainer entry for Intel USB dual role mux drivers

2016-04-28 Thread Lu Baolu
Add a maintainer entry for Intel USB dual role mux drivers and
add myself as a maintainer.

Signed-off-by: Lu Baolu 
---
 MAINTAINERS | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1d5b4be..682c8a5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5879,6 +5879,16 @@ S:   Maintained
 F: arch/x86/include/asm/intel_telemetry.h
 F: drivers/platform/x86/intel_telemetry*
 
+INTEL USB DUAL ROLE PORT MUX DRIVERS
+M: Lu Baolu 
+L: linux-...@vger.kernel.org
+S: Supported
+F: include/linux/usb/intel-mux.h
+F: drivers/usb/mux/intel-mux.c
+F: drivers/usb/mux/intel-mux-gpio.c
+F: drivers/usb/mux/intel-mux-drcfg.c
+F: drivers/mfd/intel-vuport.c
+
 IOC3 ETHERNET DRIVER
 M: Ralf Baechle 
 L: linux-m...@linux-mips.org
-- 
2.1.4



[PATCH v3 1/5] crypto: DRBG - externalize DRBG functions for LRNG

2016-04-28 Thread Stephan Mueller
>From 443dd61dcf2cf5a7a516c552da463ee2d8aea949 Mon Sep 17 00:00:00 2001
From: Stephan Mueller 
Date: Mon, 18 Apr 2016 10:04:33 +0200
Subject: 

This patch allows several DRBG functions to be called by the LRNG kernel
code paths outside the drbg.c file.

Signed-off-by: Stephan Mueller 
---
 crypto/drbg.c | 11 +--
 include/crypto/drbg.h |  7 +++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/crypto/drbg.c b/crypto/drbg.c
index 0a3538f..c339a2e 100644
--- a/crypto/drbg.c
+++ b/crypto/drbg.c
@@ -113,7 +113,7 @@
  * the SHA256 / AES 256 over other ciphers. Thus, the favored
  * DRBGs are the latest entries in this array.
  */
-static const struct drbg_core drbg_cores[] = {
+struct drbg_core drbg_cores[] = {
 #ifdef CONFIG_CRYPTO_DRBG_CTR
{
.flags = DRBG_CTR | DRBG_STRENGTH128,
@@ -205,7 +205,7 @@ static int drbg_uninstantiate(struct drbg_state *drbg);
  * Return: normalized strength in *bytes* value or 32 as default
  *to counter programming errors
  */
-static inline unsigned short drbg_sec_strength(drbg_flag_t flags)
+unsigned short drbg_sec_strength(drbg_flag_t flags)
 {
switch (flags & DRBG_STRENGTH_MASK) {
case DRBG_STRENGTH128:
@@ -1140,7 +1140,7 @@ static int drbg_seed(struct drbg_state *drbg, struct 
drbg_string *pers,
 }
 
 /* Free all substructures in a DRBG state without the DRBG state structure */
-static inline void drbg_dealloc_state(struct drbg_state *drbg)
+void drbg_dealloc_state(struct drbg_state *drbg)
 {
if (!drbg)
return;
@@ -1159,7 +1159,7 @@ static inline void drbg_dealloc_state(struct drbg_state 
*drbg)
  * Allocate all sub-structures for a DRBG state.
  * The DRBG state structure must already be allocated.
  */
-static inline int drbg_alloc_state(struct drbg_state *drbg)
+int drbg_alloc_state(struct drbg_state *drbg)
 {
int ret = -ENOMEM;
unsigned int sb_size = 0;
@@ -1682,8 +1682,7 @@ static int drbg_kcapi_sym(struct drbg_state *drbg, const 
unsigned char *key,
  *
  * return: flags
  */
-static inline void drbg_convert_tfm_core(const char *cra_driver_name,
-int *coreref, bool *pr)
+void drbg_convert_tfm_core(const char *cra_driver_name, int *coreref, bool *pr)
 {
int i = 0;
size_t start = 0;
diff --git a/include/crypto/drbg.h b/include/crypto/drbg.h
index d961b2b..d24ec22 100644
--- a/include/crypto/drbg.h
+++ b/include/crypto/drbg.h
@@ -268,4 +268,11 @@ enum drbg_prefixes {
DRBG_PREFIX3
 };
 
+extern int drbg_alloc_state(struct drbg_state *drbg);
+extern void drbg_dealloc_state(struct drbg_state *drbg);
+extern void drbg_convert_tfm_core(const char *cra_driver_name, int *coreref,
+ bool *pr);
+extern struct drbg_core drbg_cores[];
+extern unsigned short drbg_sec_strength(drbg_flag_t flags);
+
 #endif /* _DRBG_H */
-- 
2.5.5




[PATCH v7 2/7] usb: mux: add generic code for dual role port mux

2016-04-28 Thread Lu Baolu
Several Intel platforms implement USB dual role by having completely
separate xHCI and dwc3 IPs in PCH or SOC silicons. These two IPs share
a single USB port. There is another external port mux which controls
where the data lines should go. While the USB controllers are part of
the silicon, the port mux design are platform specific.

This patch adds the generic code to handle such usb port mux. It listens
to the USB HOST extcon cable, and switch the port by calling the port
switch ops provided by the individual port mux driver. It also registers
the mux device with sysfs, so that users can control the port mux from
user space.

Some other archs (e.g. Renesas R-Car gen2 SoCs) need an external mux to
swap usb roles as well. This code could be leveraged for those archs
as well.

Signed-off-by: Lu Baolu 
Reviewed-by: Heikki Krogerus 
Reviewed-by: Felipe Balbi 
Reviewed-by: Chanwoo Choi 
[baolu: extcon usage reviewed by Chanwoo Choi]
---
 Documentation/ABI/testing/sysfs-bus-platform |  17 +++
 drivers/usb/Kconfig  |   2 +
 drivers/usb/Makefile |   1 +
 drivers/usb/mux/Kconfig  |  11 ++
 drivers/usb/mux/Makefile |   4 +
 drivers/usb/mux/portmux-core.c   | 217 +++
 include/linux/usb/portmux.h  |  78 ++
 7 files changed, 330 insertions(+)
 create mode 100644 drivers/usb/mux/Kconfig
 create mode 100644 drivers/usb/mux/Makefile
 create mode 100644 drivers/usb/mux/portmux-core.c
 create mode 100644 include/linux/usb/portmux.h

diff --git a/Documentation/ABI/testing/sysfs-bus-platform 
b/Documentation/ABI/testing/sysfs-bus-platform
index 5172a61..f33f0a5 100644
--- a/Documentation/ABI/testing/sysfs-bus-platform
+++ b/Documentation/ABI/testing/sysfs-bus-platform
@@ -18,3 +18,20 @@ Description:
devices to opt-out of driver binding using a driver_override
name such as "none".  Only a single driver may be specified in
the override, there is no support for parsing delimiters.
+
+What:  /sys/bus/platform/devices/.../portmux.N/name
+   /sys/bus/platform/devices/.../portmux.N/state
+Date:  April 2016
+Contact:   Lu Baolu 
+Description:
+   In some platforms, a single USB port is shared between a USB 
host
+   controller and a device controller. A USB mux driver is needed 
to
+   handle the port mux. Read-only attribute "name" shows the name 
of
+   the port mux device. "state" attribute shows and stores the mux
+   state.
+   For read:
+   'peripheral' - mux switched to PERIPHERAL controller;
+   'host'   - mux switched to HOST controller.
+   For write:
+   'peripheral' - mux will be switched to PERIPHERAL controller;
+   'host'   - mux will be switched to HOST controller.
diff --git a/drivers/usb/Kconfig b/drivers/usb/Kconfig
index 8689dcb..328916e 100644
--- a/drivers/usb/Kconfig
+++ b/drivers/usb/Kconfig
@@ -148,6 +148,8 @@ endif # USB
 
 source "drivers/usb/phy/Kconfig"
 
+source "drivers/usb/mux/Kconfig"
+
 source "drivers/usb/gadget/Kconfig"
 
 config USB_LED_TRIG
diff --git a/drivers/usb/Makefile b/drivers/usb/Makefile
index dca7856..9a92338 100644
--- a/drivers/usb/Makefile
+++ b/drivers/usb/Makefile
@@ -6,6 +6,7 @@
 
 obj-$(CONFIG_USB)  += core/
 obj-$(CONFIG_USB_SUPPORT)  += phy/
+obj-$(CONFIG_USB_SUPPORT)  += mux/
 
 obj-$(CONFIG_USB_DWC3) += dwc3/
 obj-$(CONFIG_USB_DWC2) += dwc2/
diff --git a/drivers/usb/mux/Kconfig b/drivers/usb/mux/Kconfig
new file mode 100644
index 000..d91909f
--- /dev/null
+++ b/drivers/usb/mux/Kconfig
@@ -0,0 +1,11 @@
+#
+# USB port mux driver configuration
+#
+
+menu "USB Port MUX drivers"
+config USB_PORTMUX
+   select EXTCON
+   def_bool n
+   help
+ Generic USB dual role port mux support.
+endmenu
diff --git a/drivers/usb/mux/Makefile b/drivers/usb/mux/Makefile
new file mode 100644
index 000..f85df92
--- /dev/null
+++ b/drivers/usb/mux/Makefile
@@ -0,0 +1,4 @@
+#
+# Makefile for USB port mux drivers
+#
+obj-$(CONFIG_USB_PORTMUX)  += portmux-core.o
diff --git a/drivers/usb/mux/portmux-core.c b/drivers/usb/mux/portmux-core.c
new file mode 100644
index 000..0e3548b
--- /dev/null
+++ b/drivers/usb/mux/portmux-core.c
@@ -0,0 +1,217 @@
+/**
+ * intel_mux.c - USB Port Mux support
+ *
+ * Copyright (C) 2016 Intel Corporation
+ *
+ * Author: Lu Baolu 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int usb_mux_change_state(struct portmux_dev *pdev, int state)
+{
+   int ret;
+   struct device *dev = &pdev->dev;
+
+   dev_WARN_ONCE(d

[PATCH v3 3/5] crypto: Linux Random Number Generator

2016-04-28 Thread Stephan Mueller
The LRNG with all its properties is documented in [1]. This
documentation covers the functional discussion as well as testing of all
aspects of entropy processing. In addition, the documentation explains
the conducted regression tests to verify that the LRNG is API and ABI
compatible with the legacy /dev/random implementation.

[1] http://www.chronox.de/lrng.html

Signed-off-by: Stephan Mueller 
---
 crypto/lrng.c | 1914 +
 1 file changed, 1914 insertions(+)
 create mode 100644 crypto/lrng.c

diff --git a/crypto/lrng.c b/crypto/lrng.c
new file mode 100644
index 000..40fbdc0
--- /dev/null
+++ b/crypto/lrng.c
@@ -0,0 +1,1914 @@
+/*
+ * Linux Random Number Generator (LRNG)
+ *
+ * Documentation and test code: http://www.chronox.de/lrng.html
+ *
+ * Copyright (C) 2016, Stephan Mueller 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, and the entire permission notice in its entirety,
+ *including the disclaimer of warranties.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. The name of the author may not be used to endorse or promote
+ *products derived from this software without specific prior
+ *written permission.
+ *
+ * ALTERNATIVELY, this product may be distributed under the terms of
+ * the GNU General Public License, in which case the provisions of the GPL2
+ * are required INSTEAD OF the above restrictions.  (This clause is
+ * necessary due to a potential bad interaction between the GPL and
+ * the restrictions contained in a BSD-style copyright.)
+ *
+ * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESS OR IMPLIED
+ * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ALL OF
+ * WHICH ARE HEREBY DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT
+ * OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+ * BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE
+ * USE OF THIS SOFTWARE, EVEN IF NOT ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/*
+ * Define one DRBG out of each type with 256 bits of security strength.
+ *
+ * This definition is allowed to be changed.
+ */
+#ifdef CONFIG_CRYPTO_DRBG_HMAC
+# if 0
+#  define LRNG_DRBG_BLOCKLEN_BYTES 64
+#  define LRNG_DRBG_SECURITY_STRENGTH_BYTES 32
+#  define LRNG_DRBG_CORE "drbg_nopr_hmac_sha512"   /* HMAC DRBG SHA-512 */
+# else
+#  define LRNG_DRBG_BLOCKLEN_BYTES 32
+#  define LRNG_DRBG_SECURITY_STRENGTH_BYTES 32
+#  define LRNG_DRBG_CORE "drbg_nopr_hmac_sha256"   /* HMAC DRBG SHA-256 */
+# endif
+#elif defined CONFIG_CRYPTO_DRBG_HASH
+# if 0
+#  define LRNG_DRBG_BLOCKLEN_BYTES 64
+#  define LRNG_DRBG_SECURITY_STRENGTH_BYTES 32
+#  define LRNG_DRBG_CORE "drbg_nopr_sha512"/* Hash DRBG SHA-512 */
+# else
+#  define LRNG_DRBG_BLOCKLEN_BYTES 32
+#  define LRNG_DRBG_SECURITY_STRENGTH_BYTES 32
+#  define LRNG_DRBG_CORE "drbg_nopr_sha256"/* Hash DRBG SHA-256 */
+# endif
+#elif defined CONFIG_CRYPTO_DRBG_CTR
+# define LRNG_DRBG_BLOCKLEN_BYTES 16
+# define LRNG_DRBG_SECURITY_STRENGTH_BYTES 32
+# define LRNG_DRBG_CORE "drbg_nopr_ctr_aes256" /* CTR DRBG AES-256 */
+#else
+# error "LRNG requires the presence of a DRBG"
+#endif
+
+/* Primary DRBG state handle */
+struct lrng_pdrbg {
+   struct drbg_state *pdrbg;   /* DRBG handle */
+   bool pdrbg_fully_seeded;/* Is DRBG fully seeded? */
+   bool pdrbg_min_seeded;  /* Is DRBG minimally seeded? */
+   u32 pdrbg_entropy_bits; /* Is DRBG entropy level */
+   struct work_struct lrng_seed_work;  /* (re)seed work queue */
+   spinlock_t lock;
+};
+
+/* Secondary DRBG state handle */
+struct lrng_sdrbg {
+   struct drbg_state *sdrbg;   /* DRBG handle */
+   atomic_t requests;  /* Number of DRBG requests */
+   unsigned long last_seeded;  /* Last time it was seeded */
+   bool fully_seeded;  /* Is DRBG fully seeded? */
+   spinlock_t lock;
+};
+
+#define LRNG_DRBG_BLOCKLEN_BITS (LRNG_DRBG_BLOCKLEN_BYTES * 8)
+#define LRNG_DRBG_SECURITY_STRENGTH_BITS (LRNG_DRBG_SECURITY_STRENGTH_BYT

[PATCH] staging: sm750fb: Braces {} on all arms

2016-04-28 Thread Manav Batra
Signed-off-by: Manav Batra 

Applied braces on all arms of the statement.
---
 drivers/staging/sm750fb/ddk750_chip.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/sm750fb/ddk750_chip.c 
b/drivers/staging/sm750fb/ddk750_chip.c
index f80ee77..7748862 100644
--- a/drivers/staging/sm750fb/ddk750_chip.c
+++ b/drivers/staging/sm750fb/ddk750_chip.c
@@ -19,15 +19,16 @@ logical_chip_type_t getChipType(void)
physicalID = devId750; /* either 0x718 or 0x750 */
physicalRev = revId750;
 
-   if (physicalID == 0x718)
+   if (physicalID == 0x718) {
chip = SM718;
-   else if (physicalID == 0x750) {
+   } else if (physicalID == 0x750) {
chip = SM750;
/* SM750 and SM750LE are different in their revision ID only. */
if (physicalRev == SM750LE_REVISION_ID)
chip = SM750LE;
-   } else
+   } else {
chip = SM_UNKNOWN;
+   }
 
return chip;
 }
-- 
1.9.1



[PATCH v7 1/7] regulator: fixed: add support for ACPI interface

2016-04-28 Thread Lu Baolu
Add support to retrieve fixed voltage configure information through
ACPI interface. This is needed for Intel Bay Trail devices, where a
GPIO is used to control the USB vbus.

Signed-off-by: Lu Baolu 
---
 drivers/regulator/fixed.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/drivers/regulator/fixed.c b/drivers/regulator/fixed.c
index ff62d69..68057dc 100644
--- a/drivers/regulator/fixed.c
+++ b/drivers/regulator/fixed.c
@@ -30,6 +30,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 struct fixed_voltage_data {
struct regulator_desc desc;
@@ -104,6 +107,44 @@ of_get_fixed_voltage_config(struct device *dev,
return config;
 }
 
+/**
+ * acpi_get_fixed_voltage_config - extract fixed_voltage_config structure info
+ * @dev: device requesting for fixed_voltage_config
+ * @desc: regulator description
+ *
+ * Populates fixed_voltage_config structure by extracting data through ACPI
+ * interface, returns a pointer to the populated structure of NULL if memory
+ * alloc fails.
+ */
+static struct fixed_voltage_config *
+acpi_get_fixed_voltage_config(struct device *dev,
+ const struct regulator_desc *desc)
+{
+   struct fixed_voltage_config *config;
+   const char *supply_name, *gpio_name;
+   struct gpio_desc *gpiod;
+   int ret;
+
+   config = devm_kzalloc(dev, sizeof(*config), GFP_KERNEL);
+   if (!config)
+   return ERR_PTR(-ENOMEM);
+
+   ret = device_property_read_string(dev, "supply-name", &supply_name);
+   if (!ret)
+   config->supply_name = supply_name;
+
+   gpiod = gpiod_get(dev, "vbus_en", GPIOD_ASIS);
+   if (IS_ERR(gpiod))
+   return PTR_ERR(gpiod);
+
+   config->gpio = desc_to_gpio(gpiod);
+   config->enable_high = device_property_read_bool(dev,
+   "enable-active-high");
+   gpiod_put(gpiod);
+
+   return config;
+}
+
 static struct regulator_ops fixed_voltage_ops = {
 };
 
@@ -124,6 +165,11 @@ static int reg_fixed_voltage_probe(struct platform_device 
*pdev)
 &drvdata->desc);
if (IS_ERR(config))
return PTR_ERR(config);
+   } else if (ACPI_HANDLE(&pdev->dev)) {
+   config = acpi_get_fixed_voltage_config(&pdev->dev,
+  &drvdata->desc);
+   if (IS_ERR(config))
+   return PTR_ERR(config);
} else {
config = dev_get_platdata(&pdev->dev);
}
-- 
2.1.4



[PATCH v7 5/7] mfd: intel_vuport: Add Intel virtual USB port MFD Driver

2016-04-28 Thread Lu Baolu
Some Intel platforms have an USB port mux controlled by GPIOs.
There's a single ACPI platform device that provides 1) USB ID
extcon device; 2) USB vbus regulator device; and 3) USB port
switch device. This MFD driver will split these 3 devices for
their respective drivers.

[baolu: removed .owner per platform_no_drv_owner.cocci]
Suggested-by: David Cohen 
Signed-off-by: Lu Baolu 
Reviewed-by: Felipe Balbi 
---
 drivers/mfd/Kconfig|  8 +
 drivers/mfd/Makefile   |  1 +
 drivers/mfd/intel-vuport.c | 89 ++
 3 files changed, 98 insertions(+)
 create mode 100644 drivers/mfd/intel-vuport.c

diff --git a/drivers/mfd/Kconfig b/drivers/mfd/Kconfig
index eea61e3..7e115ab 100644
--- a/drivers/mfd/Kconfig
+++ b/drivers/mfd/Kconfig
@@ -1578,5 +1578,13 @@ config MFD_VEXPRESS_SYSREG
  System Registers are the platform configuration block
  on the ARM Ltd. Versatile Express board.
 
+config MFD_INTEL_VUPORT
+   tristate "Intel virtual USB port controller"
+   select MFD_CORE
+   depends on X86 && ACPI
+   help
+ Say Y here to enable support for Intel's dual role port mux
+ controlled by GPIOs.
+
 endmenu
 endif
diff --git a/drivers/mfd/Makefile b/drivers/mfd/Makefile
index 5eaa6465d..65b0518 100644
--- a/drivers/mfd/Makefile
+++ b/drivers/mfd/Makefile
@@ -203,3 +203,4 @@ intel-soc-pmic-objs := intel_soc_pmic_core.o 
intel_soc_pmic_crc.o
 intel-soc-pmic-$(CONFIG_INTEL_PMC_IPC) += intel_soc_pmic_bxtwc.o
 obj-$(CONFIG_INTEL_SOC_PMIC)   += intel-soc-pmic.o
 obj-$(CONFIG_MFD_MT6397)   += mt6397-core.o
+obj-$(CONFIG_MFD_INTEL_VUPORT) += intel-vuport.o
diff --git a/drivers/mfd/intel-vuport.c b/drivers/mfd/intel-vuport.c
new file mode 100644
index 000..1cb4ea3
--- /dev/null
+++ b/drivers/mfd/intel-vuport.c
@@ -0,0 +1,89 @@
+/*
+ * MFD driver for Intel virtual USB port
+ *
+ * Copyright(c) 2016 Intel Corporation.
+ * Author: Lu Baolu 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* ACPI GPIO Mappings */
+static const struct acpi_gpio_params id_gpio = { 0, 0, false };
+static const struct acpi_gpio_params vbus_gpio = { 1, 0, false };
+static const struct acpi_gpio_params mux_gpio = { 2, 0, false };
+static const struct acpi_gpio_mapping acpi_usb_gpios[] = {
+   { "id-gpios", &id_gpio, 1 },
+   { "vbus_en-gpios", &vbus_gpio, 1 },
+   { "usb_mux-gpios", &mux_gpio, 1 },
+   { },
+};
+
+static struct property_entry reg_properties[] = {
+   PROPERTY_ENTRY_STRING("supply-name", "regulator-usb-gpio"),
+   { },
+};
+
+static const struct property_set reg_properties_pset = {
+   .properties = reg_properties,
+};
+
+static const struct mfd_cell intel_vuport_mfd_cells[] = {
+   { .name = "extcon-usb-gpio", },
+   {
+   .name = "reg-fixed-voltage",
+   .pset = ®_properties_pset,
+   },
+   { .name = "intel-mux-gpio", },
+};
+
+static int vuport_probe(struct platform_device *pdev)
+{
+   struct device *dev = &pdev->dev;
+   int ret;
+
+   ret = acpi_dev_add_driver_gpios(ACPI_COMPANION(dev), acpi_usb_gpios);
+   if (ret)
+   return ret;
+
+   return mfd_add_devices(&pdev->dev, PLATFORM_DEVID_NONE,
+   intel_vuport_mfd_cells,
+   ARRAY_SIZE(intel_vuport_mfd_cells), NULL, 0,
+   NULL);
+}
+
+static int vuport_remove(struct platform_device *pdev)
+{
+   mfd_remove_devices(&pdev->dev);
+   acpi_dev_remove_driver_gpios(ACPI_COMPANION(&pdev->dev));
+
+   return 0;
+}
+
+static struct acpi_device_id vuport_acpi_match[] = {
+   { "INT3496" },
+   { }
+};
+MODULE_DEVICE_TABLE(acpi, vuport_acpi_match);
+
+static struct platform_driver vuport_driver = {
+   .driver = {
+   .name = "intel-vuport",
+   .acpi_match_table = ACPI_PTR(vuport_acpi_match),
+   },
+   .probe = vuport_probe,
+   .remove = vuport_remove,
+};
+
+module_platform_driver(vuport_driver);
+
+MODULE_AUTHOR("Lu Baolu ");
+MODULE_DESCRIPTION("Intel virtual USB port");
+MODULE_LICENSE("GPL v2");
-- 
2.1.4



[PATCH v3 5/5] random: add interrupt callback to VMBus IRQ handler

2016-04-28 Thread Stephan Mueller
The Hyper-V Linux Integration Services use the VMBus implementation for
communication with the Hypervisor. VMBus registers its own interrupt
handler that completely bypasses the common Linux interrupt handling.
This implies that the interrupt entropy collector is not triggered.

This patch adds the interrupt entropy collection callback into the VMBus
interrupt handler function.

Signed-off-by: Stephan Mueller 
Signed-off-by: Stephan Mueller 
---
 drivers/char/random.c  | 1 +
 drivers/hv/vmbus_drv.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 92c2174..9632976 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -947,6 +947,7 @@ void add_interrupt_randomness(int irq, int irq_flags)
/* award one bit for the contents of the fast pool */
credit_entropy_bits(r, credit + 1);
 }
+EXPORT_SYMBOL_GPL(add_interrupt_randomness);
 
 #ifdef CONFIG_BLOCK
 void add_disk_randomness(struct gendisk *disk)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 64713ff..9af61bb 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "hyperv_vmbus.h"
 
 static struct acpi_device  *hv_acpi_dev;
@@ -801,6 +802,8 @@ static void vmbus_isr(void)
else
tasklet_schedule(hv_context.msg_dpc[cpu]);
}
+
+   add_interrupt_randomness(HYPERVISOR_CALLBACK_VECTOR, 0);
 }
 
 
-- 
2.5.5




Re: linux-next: manual merge of the akpm-current tree with the tip tree

2016-04-28 Thread Ingo Molnar

* Stephen Rothwell  wrote:

> Hi Andrew,
> 
> Today's linux-next merge of the akpm-current tree got a conflict in:
> 
>   include/linux/efi.h
> 
> between commit:
> 
>   2c23b73c2d02 ("Ard Biesheuvel ")
> 
> from the tip tree and commit:
> 
>   9f2c36a7b097 ("include/linux/efi.h: redefine type, constant, macro from 
> generic code")
> 
> from the akpm-current tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Btw., while looking at this, I noticed that akpm-current introduced this 
namespace 
collision:

include/acpi/acconfig.h:#define UUID_STRING_LENGTH  36  /* Total length 
of a UUID string */
include/linux/uuid.h:#defineUUID_STRING_LEN 36

I suspect the include/acpi/acconfig.h define should be renamed:

UUID_STRING_LENGTH -> ACPI_UUID_STRING_LENGTH
UUID_BUFFER_LENGTH -> ACPI_UUID_BUFFER_LENGTH

... before the collision causes any trouble.

Thanks,

Ingo


[PATCH v3 0/5] /dev/random - a new approach

2016-04-28 Thread Stephan Mueller
Hi Herbert, Ted, Andi,

The following patch set provides a different approach to /dev/random which
I call Linux Random Number Generator (LRNG) to collect entropy within the Linux
kernel. The main improvements compared to the legacy /dev/random is to provide
sufficient entropy during boot time as well as in virtual environments and when
using SSDs. A secondary design goal is to limit the impact of the entropy
collection on massive parallel systems and also allow the use accelerated
cryptographic primitives. Also, all steps of the entropic data processing are
testable. Finally massive performance improvements are visible at /dev/urandom
and get_random_bytes.

The design and implementation is driven by a set of goals described in [1]
that the LRNG completely implements. Furthermore, [1] includes a
comparison with RNG design suggestions such as SP800-90B, SP800-90C, and
AIS20/31.

To Joe Perches: I have not forgotten the request to move the docuementation and
test code into patches for the kernel tree. But I would like first let the dust
settle before trying to integrate them.

To Andi Kleen: Would it be possible that you test the per-NUMA secondary DRBG
code, please? Simply apply the patches, compile the LRNG (found in the
Cryptographic API menuconfig) and then run your performance tests. Note,
I tested the correctness of the implementation on a per-CPU instantiation test
and using the fake-NUMA setup. But I do not have a real NUMA system. You may
see kernel logs when you boot with the kernel command line option of:
dyndbg="file lrng.c line 1-1900 +p"

Changes v3:
* Convert debug printk to pr_debug as suggested by Joe Perches
* Add missing \n as suggested by Joe Perches
* Do not mix in struck IRQ measurements as requested by Pavel Machek
* Add handling logic for systems without high-res timer as suggested by Pavel
  Machek -- it uses ideas from the add_interrupt_randomness of the legacy
  /dev/random implementation
* add per NUMA node secondary DRBGs as suggested by Andi Kleen -- the
  explanation of how the logic works is given in section 2.1.1 of my
  documentation [1], especially how the initial seeding is performed.

Changes v2:
* Removal of the Jitter RNG fast noise source as requested by Ted
* Addition of processing of add_input_randomness as suggested by Ted
* Update documentation and testing in [1] to cover the updates
* Addition of a SystemTap script to test add_input_randomness
* To clarify the question whether sufficient entropy is present during boot
  I added one more test in 3.3.1 [1] which demonstrates the providing of
  sufficient entropy during initialization. In the worst case of no fast noise
  sources, in the worst case of a virtual machine with only very few hardware
  devices, the testing shows that the secondary DRBG is fully seeded with 256
  bits of entropy before user space injects the random data obtained
  during shutdown of the previous boot (i.e. the requirement phrased by the
  legacy /dev/random implementation). As the writing of the random data into
  /dev/random by user space will happen before any cryptographic service
  is initialized in user space, this test demonstrates that sufficient
  entropy is already present in the LRNG at the time user space requires it
  for seeding cryptographic daemons. Note, this test result was obtained
  for different architectures, such as x86 64 bit, x86 32 bit, ARM 32 bit and
  MIPS 32 bit.

[1] http://www.chronox.de/lrng/doc/lrng.pdf

[2] http://www.chronox.de/lrng.html

Stephan Mueller (5):
  crypto: DRBG - externalize DRBG functions for LRNG
  random: conditionally compile code depending on LRNG
  crypto: Linux Random Number Generator
  crypto: LRNG - enable compile
  random: add interrupt callback to VMBus IRQ handler

 crypto/Kconfig |   10 +
 crypto/Makefile|1 +
 crypto/drbg.c  |   11 +-
 crypto/lrng.c  | 1914 
 drivers/char/random.c  |9 +
 drivers/hv/vmbus_drv.c |3 +
 include/crypto/drbg.h  |7 +
 include/linux/genhd.h  |5 +
 include/linux/random.h |7 +-
 9 files changed, 1960 insertions(+), 7 deletions(-)
 create mode 100644 crypto/lrng.c

-- 
2.5.5




[PATCH v3 2/5] random: conditionally compile code depending on LRNG

2016-04-28 Thread Stephan Mueller
When selecting the LRNG for compilation, disable the legacy /dev/random
implementation.

The LRNG is a drop-in replacement for the legacy /dev/random which
implements the same in-kernel and user space API. Only the hooks of
/dev/random into other parts of the kernel need to be disabled.

Signed-off-by: Stephan Mueller 
---
 drivers/char/random.c  | 8 
 include/linux/genhd.h  | 5 +
 include/linux/random.h | 7 ++-
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/char/random.c b/drivers/char/random.c
index b583e53..92c2174 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -267,6 +267,8 @@
 #include 
 #include 
 
+#ifndef CONFIG_CRYPTO_LRNG
+
 #define CREATE_TRACE_POINTS
 #include 
 
@@ -1620,6 +1622,7 @@ SYSCALL_DEFINE3(getrandom, char __user *, buf, size_t, 
count,
}
return urandom_read(NULL, buf, count, NULL);
 }
+#endif /* CONFIG_CRYPTO_LRNG */
 
 /***
  * Random UUID interface
@@ -1647,6 +1650,7 @@ EXPORT_SYMBOL(generate_random_uuid);
  *
  /
 
+#ifndef CONFIG_CRYPTO_LRNG
 #ifdef CONFIG_SYSCTL
 
 #include 
@@ -1784,6 +1788,8 @@ struct ctl_table random_table[] = {
 };
 #endif /* CONFIG_SYSCTL */
 
+#endif /* CONFIG_CRYPTO_LRNG */
+
 static u32 random_int_secret[MD5_MESSAGE_BYTES / 4] cacheline_aligned;
 
 int random_int_secret_init(void)
@@ -1859,6 +1865,7 @@ randomize_range(unsigned long start, unsigned long end, 
unsigned long len)
return PAGE_ALIGN(get_random_int() % range + start);
 }
 
+#ifndef CONFIG_CRYPTO_LRNG
 /* Interface for in-kernel drivers of true hardware RNGs.
  * Those devices may produce endless random bits and will be throttled
  * when our pool is full.
@@ -1878,3 +1885,4 @@ void add_hwgenerator_randomness(const char *buffer, 
size_t count,
credit_entropy_bits(poolp, entropy);
 }
 EXPORT_SYMBOL_GPL(add_hwgenerator_randomness);
+#endif /* CONFIG_CRYPTO_LRNG */
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 5c70676..962c82f 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -450,8 +450,13 @@ extern void disk_flush_events(struct gendisk *disk, 
unsigned int mask);
 extern unsigned int disk_clear_events(struct gendisk *disk, unsigned int mask);
 
 /* drivers/char/random.c */
+#ifdef CONFIG_CRYPTO_LRNG
+#define add_disk_randomness(disk) do {} while (0)
+#define rand_initialize_disk(disk) do {} while (0)
+#else
 extern void add_disk_randomness(struct gendisk *disk);
 extern void rand_initialize_disk(struct gendisk *disk);
+#endif
 
 static inline sector_t get_start_sect(struct block_device *bdev)
 {
diff --git a/include/linux/random.h b/include/linux/random.h
index 9c29122..4d9fa6e 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -17,10 +17,15 @@ struct random_ready_callback {
struct module *owner;
 };
 
-extern void add_device_randomness(const void *, unsigned int);
 extern void add_input_randomness(unsigned int type, unsigned int code,
 unsigned int value);
 extern void add_interrupt_randomness(int irq, int irq_flags);
+#ifdef CONFIG_CRYPTO_LRNG
+#define add_device_randomness(buf, nbytes) do {} while (0)
+#else  /* CONFIG_CRYPTO_LRNG */
+extern void add_device_randomness(const void *, unsigned int);
+#define lrng_irq_process()
+#endif /* CONFIG_CRYPTO_LRNG */
 
 extern void get_random_bytes(void *buf, int nbytes);
 extern int add_random_ready_callback(struct random_ready_callback *rdy);
-- 
2.5.5




Re: [PATCH v2] sched/completion: convert completions to use simple wait queues

2016-04-28 Thread Daniel Wagner
On 04/28/2016 02:57 PM, Daniel Wagner wrote:
> Only one complete_all() user could been identified so far, which happens
> to be drivers/base/power/main.c. Several waiters appear when suspend
> to disk or mem is executed.

BTW, this is what I get when doing a 'echo "disk" > /sys/power/state' on
a 4 socket E5-4610 (Ivy Bridge EP) system.


swait_stat version 0.1
-
  class name 1 waiter2 waiters3 waiters 
  4+ waiters
-
[...]
 &x->wait#12   90   115 
   1
 [] dpm_wait+0x32/0x40
   20  
[] __device_suspend+0x1b4/0x370
4  
[] __device_suspend_late+0x74/0x210
   22  
[] __device_suspend_noirq+0x51/0x200
2  
[] device_resume_early+0x69/0x1b0
   59  
[] device_resume+0x50/0x1f0
[...]


[PATCH] staging: rts5208: Alignment should match open paranthesis

2016-04-28 Thread Manav Batra
Signed-off-by: Manav Batra 

Fixed alignment of parantheses.
---
 drivers/staging/rts5208/ms.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rts5208/ms.c b/drivers/staging/rts5208/ms.c
index 0f0cd4a..6c5ef29 100644
--- a/drivers/staging/rts5208/ms.c
+++ b/drivers/staging/rts5208/ms.c
@@ -63,7 +63,7 @@ static int ms_transfer_tpc(struct rtsx_chip *chip, u8 
trans_mode,
rtsx_add_cmd(chip, WRITE_REG_CMD, MS_BYTE_CNT, 0xFF, cnt);
rtsx_add_cmd(chip, WRITE_REG_CMD, MS_TRANS_CFG, 0xFF, cfg);
rtsx_add_cmd(chip, WRITE_REG_CMD, CARD_DATA_SOURCE,
-   0x01, PINGPONG_BUFFER);
+0x01, PINGPONG_BUFFER);
 
rtsx_add_cmd(chip, WRITE_REG_CMD, MS_TRANSFER,
0xFF, MS_TRANSFER_START | trans_mode);
-- 
1.9.1



Re: [PATCH] usb: dwc3: usb/dwc3: fake dissconnect event when turn off pullup

2016-04-28 Thread Felipe Balbi

Hi,

John Youn  writes:
>> "Du, Changbin"  writes:
>>> Hi, Balbi,
>>>
>>> The step to reproduce this issue is:
>>> 1) connect device to a host and wait its enumeration.
>>> 2) trigger software disconnect by calling function
>>> usb_gadget_disconnect(), which finally call
>>>dwc3_gadget_pullup(false). Do not reconnect device
>>>   (I mean no enumeration go on, keep bit Run/Stop 0.).
>>>
>>> At here, gadget driver's disconnect callback should be
>>> Called, right? We has been disconnected. But no, as
>>> You said " not generating disconnect IRQ after you
>>> drop Run/Stop is expected".
>>>
>>> And I am testing on an Android device, Android only
>>> use dwc3_gadget_pullup(false) to issue a soft disconnection.
>>> This confused user that the UI still show usb as connected
>>> State, caused by missing a disconnect event.
>> 
>> okay, so I know what this is. This is caused by Android gadget itself
>> not notifying the gadget that a disconnect has happened. Just look at
>> udc-core's soft_connect implementation for the sysfs interface, and
>> you'll see what I mean.
>> 
>> This should be fixed at Android gadget itself. The only thing we could
>> do is introduce a new usb_gadget_soft_connect()/disconnect() to wrap the
>> logic so it's easier for Android gadget to use; but even that I'm a
>> little bit reluctant to do because Android should be using our
>> soft_connect interface instead of reimplementing it (wrongly) by its
>> own.
>> 
>
> We've run in to the same issue with our usb_gadget_driver.
>
> If the usb_gadget_disconnect() API function, which seems like it is
> intended to be called by the gadget_driver, does cause the gadget to
> disconnect, it seems reasonable to expect the gadget or the UDC core
> to notify the gadget_driver via the callback.

Well, the API is supposed to disconnect D+ pullup and that's about it.

> As you mentioned this is handled in the soft_disconnect sysfs. Why
> shouldn't usb_gadget_disconnect() do the same thing, if not the gadget

because there might be cases where we don't need/want the gadget to know
about this disconnect.

> itself? Exposing the sysfs as an API function would work too. Though

it already _is_ exported. I just don't know why people are re-inventing
the same solution :-)

> both functions are "soft" disconnects and both are called
> "disconnect".
>
> In our gadget_driver we do the workaround where we notify ourself that
> we called the usb_gadget_disconnect() and that we should now be

you could just rely on the sysfs interface, right ? :-)

> disconnected. It just seems more correct that we shouldn't have to
> handle that.
>
> By the way, I'm not completely sure of the correct terminology, but
> I'm referring to the struct usb_gadget (dwc3, dwc2, etc) as the
> "gadget" and the struct usb_gadget_driver as the "gadget_driver"
> (normally this would be the composite gadget framework, but we are
> using our own driver in this case). Is there a less confusing way to
> refer to these :)

what I've been doing is that I refer to dwc3, dwc3, etc as UDC (as in
USB Device Controller) and g_mass_storage, g_ether, g_zero, etc as
gadget driver.

-- 
balbi


signature.asc
Description: PGP signature


linux-next: manual merge of the akpm-current tree with the tip tree

2016-04-28 Thread Stephen Rothwell
Hi Andrew,

Today's linux-next merge of the akpm-current tree got a conflict in:

  include/linux/efi.h

between commit:

  2c23b73c2d02 ("Ard Biesheuvel ")

from the tip tree and commit:

  9f2c36a7b097 ("include/linux/efi.h: redefine type, constant, macro from 
generic code")

from the akpm-current tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/linux/efi.h
index aa36fb8bea4b,5b1d5c5b4080..
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@@ -21,7 -21,7 +21,8 @@@
  #include 
  #include 
  #include 
 +#include 
+ #include 
  
  #include 
  


Re: [PATCH 0/2] scop GFP_NOFS api

2016-04-28 Thread NeilBrown
On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally.  I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context?  Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode.  This needs to flush out dirty pages and sync the
inode to storage.  Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd.  I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
 - nfs and any filesystem using fscache can block for up to 1 second
   in ->releasepage().  They used to block waiting for some IO, but that
   caused deadlocks and wasn't really needed.  I left the timeout because
   it seemed likely that some throttling would help.  I suspect that a
   careful analysis will show that there is sufficient throttling
   elsewhere.

 - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
   for IO so it can free some quotainfo things.  If it could be changed
   to just schedule the IO without waiting for it then I think this
   would be safe to be called in any FS allocation context.  It already
   uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
   if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them.  If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups.  There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save().  I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test?  Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature


linux-next: manual merge of the akpm-current tree with the tile tree

2016-04-28 Thread Stephen Rothwell
Hi Andrew,

Today's linux-next merge of the akpm-current tree got a conflict in:

  arch/tile/Kconfig

between commit:

  4ef00aa30a3f ("tile: sort the "select" lines in the TILE/TILEGX configs")

from the tile tree and commits:

  628b7a1e7049 ("exit_thread: remove empty bodies")
  803ae84888bb ("printk/nmi: generic solution for safe printk in NMI")

from the akpm-current tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/tile/Kconfig
index c3bbb295bc4a,76989b878f3c..
--- a/arch/tile/Kconfig
+++ b/arch/tile/Kconfig
@@@ -18,21 -35,17 +18,23 @@@ config TIL
select GENERIC_STRNCPY_FROM_USER
select GENERIC_STRNLEN_USER
select HAVE_ARCH_SECCOMP_FILTER
 -
 -# FIXME: investigate whether we need/want these options.
 -# select HAVE_IOREMAP_PROT
 -# select HAVE_OPTPROBES
 -# select HAVE_REGS_AND_STACK_ACCESS_API
 -# select HAVE_HW_BREAKPOINT
 -# select PERF_EVENTS
 -# select HAVE_USER_RETURN_NOTIFIER
 -# config NO_BOOTMEM
 -# config ARCH_SUPPORTS_DEBUG_PAGEALLOC
 -# config HUGETLB_PAGE_SIZE_VARIABLE
 +  select HAVE_ARCH_TRACEHOOK
 +  select HAVE_CONTEXT_TRACKING
 +  select HAVE_DEBUG_BUGVERBOSE
 +  select HAVE_DEBUG_KMEMLEAK
 +  select HAVE_DEBUG_STACKOVERFLOW
 +  select HAVE_DMA_API_DEBUG
++  select HAVE_EXIT_THREAD
 +  select HAVE_KVM if !TILEGX
++  select HAVE_NMI if USE_PMC
 +  select HAVE_PERF_EVENTS
 +  select HAVE_SYSCALL_TRACEPOINTS
 +  select MODULES_USE_ELF_RELA
 +  select SYSCTL_EXCEPTION_TRACE
 +  select SYS_HYPERVISOR
 +  select USER_STACKTRACE_SUPPORT
 +  select USE_PMC if PERF_EVENTS
 +  select VIRT_TO_BUS
  
  config MMU
def_bool y


Re: [PATCH] proc: prevent accessing /proc//environ until it's ready

2016-04-28 Thread Mathias Krause
On 28 April 2016 at 23:30, Andrew Morton  wrote:
> On Thu, 28 Apr 2016 21:04:18 +0200 Mathias Krause  
> wrote:
>
>> If /proc//environ gets read before the envp[] array is fully set
>> up in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying
>> to read more bytes than are actually written, as env_start will already
>> be set but env_end will still be zero, making the range calculation
>> underflow, allowing to read beyond the end of what has been written.
>>
>> Fix this as it is done for /proc//cmdline by testing env_end for
>> zero. It is, apparently, intentionally set last in create_*_tables().
>
> Also, if this is indeed our design then
>
> a) the various create_*_tables() should have comments in there which
>explain this subtlety to the reader.  Or, better, they use a common
>helper function for this readiness-signaling operation because..
>
> b) we'll need some barriers there to ensure that the environ_read()
>caller sees the create_*_tables() writes in the correct order.

I totally agree that this kind of "synchronization" is rather fragile.
Adding comments won't help much, I fear. Rather a dedicated flag,
signaling "process ready for inspection" may be needed. So far, that's
what env_end is (ab-)used for.

Regards,
Mathias


Re: [PATCH] proc: prevent accessing /proc//environ until it's ready

2016-04-28 Thread Mathias Krause
On 28 April 2016 at 23:26, Andrew Morton  wrote:
> On Thu, 28 Apr 2016 21:04:18 +0200 Mathias Krause  
> wrote:
>
>> If /proc//environ gets read before the envp[] array is fully set
>> up in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying
>> to read more bytes than are actually written, as env_start will already
>> be set but env_end will still be zero, making the range calculation
>> underflow, allowing to read beyond the end of what has been written.
>>
>> Fix this as it is done for /proc//cmdline by testing env_end for
>> zero. It is, apparently, intentionally set last in create_*_tables().
>>
>> This bug was found by the PaX size_overflow plugin that detected the
>> arithmetic underflow of 'this_len = env_end - (env_start + src)' when
>> env_end is still zero.
>
> So what are the implications of this?  From my reading, a craftily
> constructed application could occasionally read arbitrarily large
> amounts of kernel memory?

I don't think access_remote_vm() is capable of that. So, the only
consequence is, userland trying to access /proc//environ of a not
yet fully set up process may get inconsistent data as we're in the
middle of copying in the environment variables.

Regards,
Mathias


Re: cgroup namespace and user namespace interactions

2016-04-28 Thread Aleksa Sarai
> The new cgroup namespace currently only allows for superficial
> interaction with the user namespace (it checks against the namespace
> it was created in whether or not a user has the right capabilities
> before allowing mounting, and things like that). However, there is one
> glaring feature that appears to be missing from the new cgroup
> namespace implementation: unprivileged user namespaces can't modify
> their sub-hierarchy. This is particularly frustrating for the
> containerisation community, where we are working on adding support for
> "rootless containers" in runC (the execution driver of Docker)[1]. It
> essentially means that we can't use cgroup resource limiting to limit
> *the resources of our own processes*. It also makes things like the
> freezer cgroup unusable.
>
> Here follows how I think we can solve this issue: the most obvious way
> of dealing with this would be (in the cgroupv1 view) to create a new
> subtree in every controller when you CLONE_NEWCGROUP. This new subtree
> is the root of the process's cgroup hierarchy. This doesn't affect any
> resource control, but it will result in the process only being able to
> affect its *own* resources. However, for cgroupv2 we have the "No
> Internal Process Constraint". So, maybe we could also move all of the
> other processes into a sibling subtree (with the *exact same* access
> permissions as the parent). Thus, the operation would look like this:
>
> - C0 - P00
>\ P01
>\ P02 (about to setns)
>
> becomes
>
> - C0 - C00 - P00
>  \ P01
>\ C01 - P02
>
> But then we have C00 which is just a waste of cycles (it doesn't have
> any resource settings). So maybe there's some optimisation we can do
> there, but that's as far as I've gotten into thinking about how to
> deal with the constraints of cgroupv2. After that's been solved we can
> reuse how we store the user namespace the cgroup was created in
> (cgroup_namespace.user_ns), and just check that whatever user is
> trying to modify the cgroup has CAP_SYS_ADMIN in that user namespace.
>
> Do you think this would work? Are there any recommendations on whether
> we can make this work better? Also, can you clarify whether or not
> CLONE_NEWCGROUP only works for cgroupv2 or does it also work on
> cgroupv1 (we haven't yet transitioned to cgroupv2 in runC).
>
> Thanks.
>
> [1]: https://github.com/opencontainers/runc/pull/774


Does anyone have an opinion on this proposal?

-- 
Aleksa Sarai (cyphar)
www.cyphar.com


Re: [PATCH v2 4/9] of: Add bindings of hw throttle for soctherm

2016-04-28 Thread Wei Ni


On 2016年04月28日 22:48, Eduardo Valentin wrote:
> On Thu, Apr 28, 2016 at 02:48:41PM +0800, Wei Ni wrote:
>>
>>
>> On 2016年04月28日 07:30, Eduardo Valentin wrote:
>>> From: Eduardo Valentin 
>>> To: Wei Ni 
>>> Cc: thierry.red...@gmail.com, robh...@kernel.org, rui.zh...@intel.com,
>>> mlongnec...@nvidia.com, swar...@wwwdotorg.org,
>>> mikko.perttu...@kapsi.fi, linux-te...@vger.kernel.org,
>>> linux...@vger.kernel.org, devicet...@vger.kernel.org,
>>> linux-kernel@vger.kernel.org
>>> Bcc: 
>>> Subject: Re: [PATCH v2 4/9] of: Add bindings of hw throttle for soctherm
>>> Reply-To: 
>>> In-Reply-To: <1461727554-15065-5-git-send-email-...@nvidia.com>
>>>
>>> The patch title must say something about the fact that this is specific
>>> to nvidia thermal driver.
>>
>> Yes, it's my mistake, will fix it in next series.
>>
>>>
>>> On Wed, Apr 27, 2016 at 11:25:49AM +0800, Wei Ni wrote:
 Add HW throttle configuration sub-node for soctherm, which
 is used to describe the throttle event, and worked as a
 cooling device. The "hot" type trip in thermal zone can
 be bound to this cooling device, and trigger the throttle
 function.

 Signed-off-by: Wei Ni 
 ---
  .../bindings/thermal/nvidia,tegra124-soctherm.txt  | 89 
 +-
  1 file changed, 87 insertions(+), 2 deletions(-)

 diff --git 
 a/Documentation/devicetree/bindings/thermal/nvidia,tegra124-soctherm.txt 
 b/Documentation/devicetree/bindings/thermal/nvidia,tegra124-soctherm.txt
 index edebfa0a985e..dc337d139f49 100644
 --- 
 a/Documentation/devicetree/bindings/thermal/nvidia,tegra124-soctherm.txt
 +++ 
 b/Documentation/devicetree/bindings/thermal/nvidia,tegra124-soctherm.txt
 @@ -10,8 +10,14 @@ Required properties :
  - compatible : For Tegra124, must contain "nvidia,tegra124-soctherm".
For Tegra132, must contain "nvidia,tegra132-soctherm".
For Tegra210, must contain "nvidia,tegra210-soctherm".
 -- reg : Should contain 1 entry:
 +- reg : Should contain at least 2 entries for each entry in reg-names:
- SOCTHERM register set
 +  - Tegra CAR register set: Required for Tegra124 and Tegra210.
 +  - CCROC register set: Required for Tegra132.
 +- reg-names :  Should contain at least 2 entries:
 +  - soctherm-reg
 +  - car-reg
 +  - ccroc-reg
  - interrupts : Defines the interrupt used by SOCTHERM
  - clocks : Must contain an entry for each entry in clock-names.
See ../clocks/clock-bindings.txt for details.
 @@ -25,17 +31,44 @@ Required properties :
  - #thermal-sensor-cells : Should be 1. See ./thermal.txt for a description
  of this property. See  for a
  list of valid values when referring to thermal sensors.
 +- throttle-cfgs: A sub-node which is a container of configuration for each
 +hardware throttle events. These events can be set as cooling devices.
 +  * throttle events: Sub-nodes must be named as "light" or "heavy".
 +  Properties:
 +  - priority: Each throttles has its own throttle settings, so the SW 
 need
 +to set priorities for various throttle, the HW arbiter can select 
 the
 +final throttle settings.
 +Bigger value indicates higher priority, In general, higher 
 priority
 +translates to lower target frequency. SW needs to ensure that 
 critical
 +thermal alarms are given higher priority, and ensure that there is
 +no race if priority of two vectors is set to the same value.
 +  - cpu-throt-depth:  This property is for Tegra124 and Tegra210. It 
 is
 +the throttling depth of pulse skippers, it's the percentage
 +throttling.
 +  - cpu-throt-level: This property is only for Tegra132, it is the 
 level
 +of pulse skippers, which used to throttle clock frequencies. It
 +indicates cpu clock throttling depth, and the depth can be 
 programmed.
 +Must set as following values:
 +TEGRA_SOCTHERM_THROT_LEVEL_LOW, TEGRA_SOCTHERM_THROT_LEVEL_MED
 +TEGRA_SOCTHERM_THROT_LEVEL_HIGH, TEGRA_SOCTHERM_THROT_LEVEL_NONE
>>>
>>> These properties are not generic properties. My understanding is that
>>> you must have vendor prefix in such case. Same applies to the new nodes.
>>
>> Ok, will do it.
>>
>>>
 +  - #cooling-cells: Should be 1. This cooling device only support 
 on/off state.
 +See ./thermal.txt for a description of this property.
  
  Note:
  - the "critical" type trip points will be set to SOC_THERM hardware as the
  shut down temperature. Once the temperature of this thermal zone is higher
  than it, the system will be shutdown or reset by hardware.
 +- the "hot" type trip points will be set to SOC_THERM hardware as the 
 throttle
 +temperature. Once the the temperature of thi

Re: [PATCH v3 3/3] block: avoid to call .bi_end_io() recursively

2016-04-28 Thread Ming Lei
On Fri, Apr 29, 2016 at 12:59 AM, Mikulas Patocka  wrote:
>
>
> On Fri, 29 Apr 2016, Ming Lei wrote:
>
>> On Thu, Apr 28, 2016 at 11:58 PM, Mikulas Patocka  
>> wrote:
>> >
>> >
>> > On Thu, 28 Apr 2016, Ming Lei wrote:
>> >
>> >> Hi Mikulas,
>> >>
>> >> On Thu, Apr 28, 2016 at 11:29 PM, Mikulas Patocka  
>> >> wrote:
>> >> >
>> >> >
>> >> > On Thu, 28 Apr 2016, Ming Lei wrote:
>> >> >
>> >> >> There were reports about heavy stack use by recursive calling
>> >> >> .bi_end_io()([1][2][3]). For example, more than 16K stack is
>> >> >> consumed in a single bio complete path[3], and in [2] stack
>> >> >> overflow can be triggered if 20 nested dm-crypt is used.
>> >> >>
>> >> >> Also patches[1] [2] [3] were posted for addressing the issue,
>> >> >> but never be merged. And the idea in these patches is basically
>> >> >> similar, all serializes the recursive calling of .bi_end_io() by
>> >> >> percpu list.
>> >> >>
>> >> >> This patch still takes the same idea, but uses bio_list to
>> >> >> implement it, which turns out more simple and the code becomes
>> >> >> more readable meantime.
>> >> >>
>> >> >> One corner case which wasn't covered before is that
>> >> >> bi_endio() may be scheduled to run in process context(such
>> >> >> as btrfs), and this patch just bypasses the optimizing for
>> >> >> that case because one new context should have enough stack space,
>> >> >> and this approach isn't capable of optimizing it too because
>> >> >> there isn't easy way to get a per-task linked list head.
>> >> >
>> >> > Hi
>> >> >
>> >> > You could use preempt_disable() and then you could use per-cpu list even
>> >> > in the process context.
>> >>
>> >> Image why the .bi_end_io() is scheduled to process context, and the only
>> >> workable/simple way I thought of is to use per-task list because it may 
>> >> sleep.
>> >
>> > The bi_end_io callback should not sleep, even if it is called from the
>> > process context.
>>
>> If it shouldn't sleep, why is it scheduled to run in process context by 
>> paying
>> extra context switch cost?
>
> Some device mapper (and other) drivers use a worker thread to process
> bios. So the bio may be finished from the worker thread. It would be
> advantageous to prevent stack overflow even in this case.

If the .bi_end_io wouldn't sleep, it can be put back into interrupt context
for the sake of performance when this patch is merged. The cost of context
switch in high IOPS case isn't cheap.

It isn't easy to avoid the recursive calling in process context except you
can add something 'task_struct' or introduce 'alloca()' in kernel. Or do you
have better ideas?

>
>> And you can find that btrfs_subio_endio_read() does sleep for checksum stuff.
>
> I'm not an expert on btrfs. What happens if it is called from an
> interrupt? Do you have an actual stracktrace when this function is called

What do you expect if sleepable function is called in softirq or
hardirq handler? :-)

> from bio_endio and when it sleeps?

The problem is observed in xfstests generic/323 by this patch v1. Sometimes the
test hangs, and sometimes kernel oops is triggered. and the issue is
fixed by introducing 'if (!in_interrupt())' block for handling running
.bi_end_io() from
process context.

If the block of 'if (!in_interrupt())' is removed and
preempt_disable()/preempt_enable() is added around bio->bi_end_io(),
the following kernel warning can be seen easily:

[   51.086303] BUG: sleeping function called from invalid context at
mm/slab.h:388
[   51.087442] in_atomic(): 1, irqs_disabled(): 0, pid: 633, name: kworker/u8:4
[   51.088575] CPU: 3 PID: 633 Comm: kworker/u8:4 Not tainted 4.6.0-rc3+ #2017
[   51.088578] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.9.0-0-g01a84be-prebuilt.qemu-project.org 04/01/2014
[   51.088637] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
[   51.088640]   88007bbebc00 813d92d3
88007ba6ce00
[   51.088643]  0184 88007bbebc18 810a38bb
81a35310
[   51.088645]  88007bbebc40 810a3949 02400040
02400040
[   51.088648] Call Trace:
[   51.088651]  [] dump_stack+0x63/0x90
[   51.088655]  [] ___might_sleep+0xdb/0x120
[   51.088657]  [] __might_sleep+0x49/0x80
[   51.088659]  [] kmem_cache_alloc+0x1a7/0x210
[   51.088670]  [] ? alloc_extent_state+0x21/0xe0 [btrfs]
[   51.088680]  [] alloc_extent_state+0x21/0xe0 [btrfs]
[   51.088689]  [] __clear_extent_bit+0x2ae/0x3d0 [btrfs]
[   51.088698]  [] clear_extent_bit+0x2a/0x30 [btrfs]
[   51.088708]  [] btrfs_endio_direct_read+0x70/0xf0 [btrfs]
[   51.088711]  [] bio_endio+0xf7/0x140
[   51.088718]  [] end_workqueue_fn+0x3c/0x40 [btrfs]
[   51.088728]  [] normal_work_helper+0xc7/0x310 [btrfs]
[   51.088737]  [] btrfs_endio_helper+0x12/0x20 [btrfs]
[   51.088740]  [] process_one_work+0x157/0x420
[   51.088742]  [] worker_thread+0x12b/0x4d0
[   51.088744]  [] ? __schedule+0x368/0x950
[   51.088746]  [] ? rescuer_thread+0x380/0x380
[   51.088748]  [] kthrea

[git pull] drm fixes

2016-04-28 Thread Dave Airlie

Hi Linus,

A few fixes all over the place:

radeon is probably the biggest standout, it's a fix for screen 
corruption or hung black outputs so I thought it was worth pulling in.

Otherwise some amdgpu power control fixes, some misc vmwgfx fixes,
one etnaviv fix, one virtio-gpu fix, two DP MST fixes, and a single
TTM fix.

Dave.

The following changes since commit 02da2d72174c61988eb4456b53f405e3ebdebce4:

  Linux 4.6-rc5 (2016-04-24 16:17:05 -0700)

are available in the git repository at:

  git://people.freedesktop.org/~airlied/linux drm-fixes

for you to fetch changes up to ea99697814d6e64927e228675a6eb7fa76014648:

  Merge branch 'drm-fixes-4.6' of git://people.freedesktop.org/~agd5f/linux 
into drm-fixes (2016-04-29 14:31:44 +1000)


Alex Deucher (2):
  Revert "drm/amdgpu: disable runtime pm on PX laptops without dGPU power 
control"
  drm/amdgpu: print a message if ATPX dGPU power control is missing

Charmaine Lee (2):
  drm/vmwgfx: Enable SVGA_3D_CMD_DX_SET_PREDICATION
  drm/vmwgfx: use vmw_cmd_dx_cid_check for query commands.

Dave Airlie (3):
  Merge branch 'drm-etnaviv-fixes' of 
git://git.pengutronix.de:/git/lst/linux into drm-fixes
  Merge branch 'drm-vmwgfx-fixes' of 
git://people.freedesktop.org/~syeh/repos_linux into drm-fixes
  Merge branch 'drm-fixes-4.6' of git://people.freedesktop.org/~agd5f/linux 
into drm-fixes

Flora Cui (2):
  drm/ttm: fix kref count mess in ttm_bo_move_to_lru_tail
  drm/amdgpu: disable vm interrupts with vm_fault_stop=2

Gustavo Padovan (1):
  drm/virtio: send vblank event after crtc updates

Lucas Stach (1):
  drm/etnaviv: don't move linear memory window on 3D cores without MC2.0

Lyude (1):
  drm/dp/mst: Restore primary hub guid on resume

Sinclair Yeh (1):
  drm/vmwgfx: Fix order of operation

Vitaly Prosyak (1):
  drm/radeon: fix vertical bars appear on monitor (v2)

cp...@redhat.com (1):
  drm/dp/mst: Get validated port ref in drm_dp_update_payload_part1()

 drivers/gpu/drm/amd/amdgpu/amdgpu_atpx_handler.c |  11 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c   |   8 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c|   5 +-
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c|   5 +-
 drivers/gpu/drm/drm_dp_mst_topology.c|  20 +++
 drivers/gpu/drm/etnaviv/etnaviv_gpu.c|  31 +++--
 drivers/gpu/drm/radeon/evergreen.c   | 154 ++-
 drivers/gpu/drm/radeon/evergreen_reg.h   |  46 +++
 drivers/gpu/drm/ttm/ttm_bo.c |  17 +--
 drivers/gpu/drm/virtio/virtgpu_display.c |  12 ++
 drivers/gpu/drm/vmwgfx/vmwgfx_execbuf.c  |  10 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_fb.c   |   6 +-
 12 files changed, 277 insertions(+), 48 deletions(-)


[PATCH] Use existing helper to convert "on/off" to boolean

2016-04-28 Thread Minfei Huang
It's more convenient to use existing function helper to convert string
"on/off" to boolean.

Signed-off-by: Minfei Huang 
---
 lib/kstrtox.c| 2 +-
 mm/page_alloc.c  | 9 +
 mm/page_poison.c | 8 +---
 3 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/lib/kstrtox.c b/lib/kstrtox.c
index d8a5cf6..3c66fc4 100644
--- a/lib/kstrtox.c
+++ b/lib/kstrtox.c
@@ -326,7 +326,7 @@ EXPORT_SYMBOL(kstrtos8);
  * @s: input string
  * @res: result
  *
- * This routine returns 0 iff the first character is one of 'Yy1Nn0', or
+ * This routine returns 0 if the first character is one of 'Yy1Nn0', or
  * [oO][NnFf] for "on" and "off". Otherwise it will return -EINVAL.  Value
  * pointed to by res is updated upon finding a match.
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 59de90d..d31426d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -513,14 +513,7 @@ static int __init early_debug_pagealloc(char *buf)
 {
if (!buf)
return -EINVAL;
-
-   if (strcmp(buf, "on") == 0)
-   _debug_pagealloc_enabled = true;
-
-   if (strcmp(buf, "off") == 0)
-   _debug_pagealloc_enabled = false;
-
-   return 0;
+   return kstrtobool(buf, &_debug_pagealloc_enabled);
 }
 early_param("debug_pagealloc", early_debug_pagealloc);
 
diff --git a/mm/page_poison.c b/mm/page_poison.c
index 479e7ea..1eae5fa 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -13,13 +13,7 @@ static int early_page_poison_param(char *buf)
 {
if (!buf)
return -EINVAL;
-
-   if (strcmp(buf, "on") == 0)
-   want_page_poisoning = true;
-   else if (strcmp(buf, "off") == 0)
-   want_page_poisoning = false;
-
-   return 0;
+   return strtobool(buf, &want_page_poisoning);
 }
 early_param("page_poison", early_page_poison_param);
 
-- 
2.6.3



Re: [PATCH 1/2] zsmalloc: require GFP in zs_malloc()

2016-04-28 Thread Minchan Kim
On Fri, Apr 29, 2016 at 01:17:09AM +0900, Sergey Senozhatsky wrote:
> Pass GFP flags to zs_malloc() instead of using a fixed set
> (supplied during pool creation), so we can be more flexible,
> but, more importantly, this will be need to switch zram to
> per-cpu compression streams.
> 
> Apart from that, this also align zs_malloc() interface with
> zspool/zbud.
> 
> Signed-off-by: Sergey Senozhatsky 
> ---
>  drivers/block/zram/zram_drv.c |  2 +-
>  include/linux/zsmalloc.h  |  2 +-
>  mm/zsmalloc.c | 15 ++-
>  3 files changed, 8 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 370c2f7..9030992 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -717,7 +717,7 @@ static int zram_bvec_write(struct zram *zram, struct 
> bio_vec *bvec, u32 index,
>   src = uncmem;
>   }
>  
> - handle = zs_malloc(meta->mem_pool, clen);
> + handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO | __GFP_HIGHMEM);
>   if (!handle) {
>   pr_err("Error allocating memory for compressed page: %u, 
> size=%zu\n",
>   index, clen);
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> index 34eb160..6d89f8b 100644
> --- a/include/linux/zsmalloc.h
> +++ b/include/linux/zsmalloc.h
> @@ -44,7 +44,7 @@ struct zs_pool;
>  struct zs_pool *zs_create_pool(const char *name, gfp_t flags);
>  void zs_destroy_pool(struct zs_pool *pool);
>  
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size);
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
>  void zs_free(struct zs_pool *pool, unsigned long obj);
>  
>  void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index a0890e9..2c22aff 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -247,7 +247,6 @@ struct zs_pool {
>   struct size_class **size_class;
>   struct kmem_cache *handle_cachep;
>  
> - gfp_t flags;/* allocation flags used when growing pool */
>   atomic_long_t pages_allocated;
>  
>   struct zs_pool_stats stats;
> @@ -295,10 +294,10 @@ static void destroy_handle_cache(struct zs_pool *pool)
>   kmem_cache_destroy(pool->handle_cachep);
>  }
>  
> -static unsigned long alloc_handle(struct zs_pool *pool)
> +static unsigned long alloc_handle(struct zs_pool *pool, gfp_t gfp)
>  {
>   return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
> - pool->flags & ~__GFP_HIGHMEM);
> + gfp & ~__GFP_HIGHMEM);
>  }
>  
>  static void free_handle(struct zs_pool *pool, unsigned long handle)
> @@ -335,7 +334,7 @@ static void zs_zpool_destroy(void *pool)
>  static int zs_zpool_malloc(void *pool, size_t size, gfp_t gfp,
>   unsigned long *handle)
>  {
> - *handle = zs_malloc(pool, size);
> + *handle = zs_malloc(pool, size, gfp);
>   return *handle ? 0 : -1;
>  }
>  static void zs_zpool_free(void *pool, unsigned long handle)
> @@ -1391,7 +1390,7 @@ static unsigned long obj_malloc(struct size_class 
> *class,
>   * otherwise 0.
>   * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
>   */
> -unsigned long zs_malloc(struct zs_pool *pool, size_t size)
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  {
>   unsigned long handle, obj;
>   struct size_class *class;
> @@ -1400,7 +1399,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t 
> size)
>   if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
>   return 0;
>  
> - handle = alloc_handle(pool);
> + handle = alloc_handle(pool, gfp);
>   if (!handle)
>   return 0;
>  
> @@ -1413,7 +1412,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t 
> size)
>  
>   if (!first_page) {
>   spin_unlock(&class->lock);
> - first_page = alloc_zspage(class, pool->flags);
> + first_page = alloc_zspage(class, gfp);
>   if (unlikely(!first_page)) {
>   free_handle(pool, handle);
>   return 0;
> @@ -1945,8 +1944,6 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t 
> flags)

So, we can remove flags parameter passing and comment about that.

Other than that,

Acked-by: Minchan Kim 


[PATCH] staging: rts5208: Avoid multiple assignment in one line

2016-04-28 Thread Manav Batra
Signed-off-by: Manav Batra 

Separates out assignment in one line to two lines.
---
 drivers/staging/rts5208/rtsx.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/rts5208/rtsx.c b/drivers/staging/rts5208/rtsx.c
index 25d095a..77c2580 100644
--- a/drivers/staging/rts5208/rtsx.c
+++ b/drivers/staging/rts5208/rtsx.c
@@ -105,13 +105,13 @@ static int slave_configure(struct scsi_device *sdev)
 * the actual value or the modified one, depending on where the
 * data comes from.
 */
-   if (sdev->scsi_level < SCSI_2)
-   sdev->scsi_level = sdev->sdev_target->scsi_level = SCSI_2;
-
+   if (sdev->scsi_level < SCSI_2) {
+   sdev->scsi_level  = SCSI_2;
+   sdev->sdev_target->scsi_level = SCSI_2;
+   }
return 0;
 }
 
-
 /***
  * /proc/scsi/ functions
  ***/
-- 
1.9.1



Re: [PATCH] staging: rts5208: Avoid multiple assignment in one line

2016-04-28 Thread Greg KH
On Thu, Apr 28, 2016 at 10:30:49PM -0700, Manav Batra wrote:
> Signed-off-by: Manav Batra 

I can't take patches without any changelog text :(


Re: [PATCH] mm/zsmalloc: don't fail if can't create debugfs info

2016-04-28 Thread Minchan Kim
On Fri, Apr 29, 2016 at 09:38:24AM +0900, Sergey Senozhatsky wrote:
> On (04/28/16 15:07), Andrew Morton wrote:
> > Needed a bit of tweaking due to
> > http://ozlabs.org/~akpm/mmotm/broken-out/zsmalloc-reordering-function-parameter.patch
> 
> Thanks.
> 
> > From: Dan Streetman 
> > Subject: mm/zsmalloc: don't fail if can't create debugfs info
> > 
> > Change the return type of zs_pool_stat_create() to void, and
> > remove the logic to abort pool creation if the stat debugfs
> > dir/file could not be created.
> > 
> > The debugfs stat file is for debugging/information only, and doesn't
> > affect operation of zsmalloc; there is no reason to abort creating
> > the pool if the stat file can't be created.  This was seen with
> > zswap, which used the same name for all pool creations, which caused
> > zsmalloc to fail to create a second pool for zswap if
> > CONFIG_ZSMALLOC_STAT was enabled.
> 
> no real objections from me. given that both zram and zswap now provide
> unique names for zsmalloc stats dir, this patch does not fix any "real"
> (observed) problem /* ENOMEM in debugfs_create_dir() is a different
> case */.  so it's more of a cosmetic patch.
> 

Logically, I agree with Dan that debugfs is just optional so it
shouldn't affect the module running *but* practically, debugfs_create_dir
failure with no memory would be rare. Rather than it, we would see
error from same entry naming like Dan's case.

If we removes such error propagation logic in case of same naming,
how do zsmalloc user can notice that debugfs entry was not created
although zs_creation was successful returns success?

Otherwise, future user of zsmalloc can miss it easily if they repeates
same mistakes. So, what's the gain with this patch in real practice?


> FWIW,
> Reviewed-by: Sergey Senozhatsky 
> 
>   -ss


[PATCH] staging: rts5208: Avoid multiple assignment in one line

2016-04-28 Thread Manav Batra
Signed-off-by: Manav Batra 
---
 drivers/staging/rts5208/rtsx.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/rts5208/rtsx.c b/drivers/staging/rts5208/rtsx.c
index 25d095a..77c2580 100644
--- a/drivers/staging/rts5208/rtsx.c
+++ b/drivers/staging/rts5208/rtsx.c
@@ -105,13 +105,13 @@ static int slave_configure(struct scsi_device *sdev)
 * the actual value or the modified one, depending on where the
 * data comes from.
 */
-   if (sdev->scsi_level < SCSI_2)
-   sdev->scsi_level = sdev->sdev_target->scsi_level = SCSI_2;
-
+   if (sdev->scsi_level < SCSI_2) {
+   sdev->scsi_level  = SCSI_2;
+   sdev->sdev_target->scsi_level = SCSI_2;
+   }
return 0;
 }
 
-
 /***
  * /proc/scsi/ functions
  ***/
-- 
1.9.1



[PATCH] staging: rts5208: Unnecessary parantheses around chip->sd_card

2016-04-28 Thread Manav Batra
Signed-off-by: Manav Batra 
---
 drivers/staging/rts5208/sd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/rts5208/sd.c b/drivers/staging/rts5208/sd.c
index 87d6976..fbd2f90 100644
--- a/drivers/staging/rts5208/sd.c
+++ b/drivers/staging/rts5208/sd.c
@@ -56,21 +56,21 @@ static u16 REG_SD_DCMPS1_CTL;
 
 static inline void sd_set_err_code(struct rtsx_chip *chip, u8 err_code)
 {
-   struct sd_info *sd_card = &(chip->sd_card);
+   struct sd_info *sd_card = &chip->sd_card;
 
sd_card->err_code |= err_code;
 }
 
 static inline void sd_clr_err_code(struct rtsx_chip *chip)
 {
-   struct sd_info *sd_card = &(chip->sd_card);
+   struct sd_info *sd_card = &chip->sd_card;
 
sd_card->err_code = 0;
 }
 
 static inline int sd_check_err_code(struct rtsx_chip *chip, u8 err_code)
 {
-   struct sd_info *sd_card = &(chip->sd_card);
+   struct sd_info *sd_card = &chip->sd_card;
 
return sd_card->err_code & err_code;
 }
-- 
1.9.1



Re: [PATCH 1/2] staging: wilc1000: fix double unlock

2016-04-28 Thread Greg Kroah-Hartman
On Thu, Apr 14, 2016 at 08:48:48PM +0530, Sudip Mukherjee wrote:
> The semaphore was being released twice, once at the beginning of the
> thread and then again when the thread is about to close.
> The semaphore is acquired immediately after creating the thread so we
> should be releasing it when the thread ends.
> 
> Signed-off-by: Sudip Mukherjee 
> ---
>  drivers/staging/wilc1000/linux_wlan.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/staging/wilc1000/linux_wlan.c 
> b/drivers/staging/wilc1000/linux_wlan.c
> index a858552..5643a3d 100644
> --- a/drivers/staging/wilc1000/linux_wlan.c
> +++ b/drivers/staging/wilc1000/linux_wlan.c
> @@ -313,7 +313,6 @@ static int linux_wlan_txq_task(void *vp)
>   vif = netdev_priv(dev);
>   wl = vif->wilc;
>  
> - up(&wl->txq_thread_started);
>   while (1) {
>   down(&wl->txq_event);
>  

Doesn't apply to my tree at all :(


Re: [PATCH v6 3/7] perf record: Split output into multiple files via '--switch-output'

2016-04-28 Thread Wangnan (F)



On 2016/4/28 5:32, Arnaldo Carvalho de Melo wrote:

Em Wed, Apr 20, 2016 at 06:59:50PM +, Wang Nan escreveu:

Allow 'perf record' to split its output into multiple files.

For example:

I squashed:

->  360   T 04/20 Wang Nan(1.7K) ├─>[PATCH v6 6/7]
perf record: Re-synthesize tracking events after output switching

Into this patch, so that we don't have the problem in the bisection
history where samples don't get resolved to the existing threads not
synthesized in the perf.data.N where N > the first timestamp.

Please holler if you disagree, I doubt you will tho :-)


Sorry for the late. I'm okay for your work.

Thank you.


- Arnaldo

  

   # ~/perf record -a --timestamp-filename --switch-output &
   [1] 10763
   # kill -s SIGUSR2 10763
   [ perf record: dump data: Woken up 1 times ]
   # [ perf record: Dump perf.data.2015122622314468 ]

   # kill -s SIGUSR2 10763
   [ perf record: dump data: Woken up 1 times ]
   # [ perf record: Dump perf.data.2015122622314762 ]

   # kill -s SIGUSR2 10763
   [ perf record: dump data: Woken up 1 times ]
   #[ perf record: Dump perf.data.2015122622315171 ]

   # fg
   perf record -a --timestamp-filename --switch-output
   ^C[ perf record: Woken up 1 times to write data ]
   [ perf record: Dump perf.data.2015122622315513 ]
   [ perf record: Captured and wrote 0.014 MB perf.data. (296 
samples) ]

   # ls -l
   total 920
   -rw--- 1 root root 797692 Dec 26 22:31 perf.data.2015122622314468
   -rw--- 1 root root  59960 Dec 26 22:31 perf.data.2015122622314762
   -rw--- 1 root root  59912 Dec 26 22:31 perf.data.2015122622315171
   -rw--- 1 root root  19220 Dec 26 22:31 perf.data.2015122622315513

Signed-off-by: Wang Nan 
Tested-by: Arnaldo Carvalho de Melo 
Cc: Adrian Hunter 
Cc: Jiri Olsa 
Cc: Masami Hiramatsu 
Cc: Namhyung Kim 
Cc: Zefan Li 
Cc: pi3or...@163.com
Link: 
http://lkml.kernel.org/r/1460643725-167413-3-git-send-email-wangn...@huawei.com
Signed-off-by: He Kuang 
[ Added man page entry ]
Signed-off-by: Arnaldo Carvalho de Melo 
---
  tools/perf/Documentation/perf-record.txt |  8 
  tools/perf/builtin-record.c  | 33 ++--
  2 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index 19aa175..a77a431 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -347,6 +347,14 @@ Configure all used events to run in kernel space.
  --all-user::
  Configure all used events to run in user space.
  
+--switch-output::

+Generate multiple perf.data files, timestamp prefixed, switching to a new one
+when receiving a SIGUSR2.
+
+A possible use case is to, given an external event, slice the perf.data file
+that gets then processed, possibly via a perf script, to decide if that
+particular perf.data snapshot should be kept or not.
+
  SEE ALSO
  
  linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index f4710c8..72246e2 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -58,6 +58,7 @@ struct record {
boolno_buildid_cache_set;
boolbuildid_all;
booltimestamp_filename;
+   boolswitch_output;
unsigned long long  samples;
  };
  
@@ -130,6 +131,7 @@ static volatile int child_finished;
  
  static volatile int auxtrace_record__snapshot_started;

  static DEFINE_TRIGGER(auxtrace_snapshot_trigger);
+static DEFINE_TRIGGER(switch_output_trigger);
  
  static void sig_handler(int sig)

  {
@@ -650,9 +652,12 @@ static int __cmd_record(struct record *rec, int argc, 
const char **argv)
signal(SIGINT, sig_handler);
signal(SIGTERM, sig_handler);
  
-	if (rec->opts.auxtrace_snapshot_mode) {

+   if (rec->opts.auxtrace_snapshot_mode || rec->switch_output) {
signal(SIGUSR2, snapshot_sig_handler);
-   trigger_on(&auxtrace_snapshot_trigger);
+   if (rec->opts.auxtrace_snapshot_mode)
+   trigger_on(&auxtrace_snapshot_trigger);
+   if (rec->switch_output)
+   trigger_on(&switch_output_trigger);
} else {
signal(SIGUSR2, SIG_IGN);
}
@@ -782,11 +787,13 @@ static int __cmd_record(struct record *rec, int argc, 
const char **argv)
}
  
  	trigger_ready(&auxtrace_snapshot_trigger);

+   trigger_ready(&switch_output_trigger);
for (;;) {
unsigned long long hits = rec->samples;
  
  		if (record__mmap_read_all(rec) < 0) {

trigger_error(&auxtrace_snapshot_trigger);
+   trigger_error(&switch_output_trigger);
err = -1;
goto out_child;
}
@@ -802,6 +809,22 @@ static int __cmd_record(struct record *rec, 

Re: random(4) changes

2016-04-28 Thread Stephan Mueller
Am Dienstag, 26. April 2016, 20:23:46 schrieb George Spelvin:

Hi George,

> > And considering that I only want to have 0.9 bits of entropy, why
> > should I not collapse it? The XOR operation does not destroy the existing
> > entropy, it only caps it to at most one bit of information theoretical
> > entropy.
> 
> No.  Absolutely, demonstrably false.
> 
> The XOR operation certainly *does* destroy entropy.
> If you have 0.9 bits of entropy to start, you will have less
> after the XOR.  It does NOT return min(input, 1) bits.

As I am having difficulties following your explanation, let us start at the 
definition:

XOR is defined as an entropy preserving operation, provided the two arguments 
to the XOR operation are statistically independent (let us remember that 
caveat for later).

That means, the entropy behavior of H(A XOR B) is max(H(A), H(B)) if they are 
independent. For example, A has 5 bits of entropy and B has 7 bits of entropy, 
A XOR B has 7 bits of entropy. Similarly, if A has zero bits of entropy the 
XORed result will still have 7 bits of entropy from B. That applies regardless 
of the size of A or B, including one bit sized chunks. The same applies when 
XORing more values:

A XOR B XOR C = (A XOR B) XOR C

Now, the entropy behaves like:

max(max(H(A), H(B)), H(C)) = max(H(A), H(B), H(C))

Now, with that definition, let us look at the LRNG method. The LRNG obtains a 
time stamp and uses the low 32 bits of it. The LRNG now slices those 32 bits 
up in individual bits, let us call them b0 through b31.

The LRNG XORs these individual bits together. This means:

b0 XOR b1 XOR b2 XOR ... XOR b31

This operation gives us one bit.

How is the entropy behaving here? Let us use the definition from above:

H(XORed bit) = max(H(b0), H(b1), ..., H(b31))

We know that each individual bit can hold at most one bit. Thus the formula 
implies that the XOR operation in the LRNG can at most get one bit of entropy.


Given these findings, I now have to show and demonstrate that:

1. the individual bits of a given 32 bit time stamp are independent (or IID in 
terms of NIST)

2. show that the maximum entropy of each of the individual bits is equal or 
more to my entropy estimate I apply.


Regarding 1: The time stamp (or cycle counter) is a 32 bit value where each 
of the bits does not depend on the other bits. When considering one and only 
one time stamp value and we look at, say, the first 20 bits, there is no way 
it is clear what the missing 12 bits will be. Note I am not saying that when 
comparing two or more time stamps that one cannot deduce the bits! And here it 
is clear that the bits within one given time stamp are independent, but 
multiple time stamps are not independent. This finding is supported with 
measurements given in 3.4.1 (I understand that the measurements are only 
supportive and no proof). Figure 3.1 shows an (almost) rectangular 
distribution which is the hint to an equidistribution which in turn supports 
the finding that the individual bits within a time stamp are independent. In 
addition, when you look at the Shannon/Min Entropy values (which do not give 
an entropy estimate here, but only help in understanding the distribution!), 
the values show that the distribution has hardly any discontinuities -- please 
read the explanation surrounding the figure.

Regarding 2: I did numerous measurements that show that the low bits do have 
close to one bit of entropy per data bit. If I may ask to consider section 
3.4.1 again (please consider that I tried to break the logic by applying a 
pathological generation of interrupts here to stimulate the worst case): The 
entropy is not found in the absolute time stamps, but visible in the time 
deltas (and the uncertainty of the variations of those). So I calculated the 
time deltas from the collected set of time stamps of events. Now, when simply 
using the four (you may also use three or perhaps five) lower bits of the time 
delta values, we can calculate an interesting and very important Minimum 
Entropy value: the Markov Min Entropy. Using the table 2, I calculated the 
Markov Min Entropy of the data set of the 4 low bit time delta values. The 
result shows that the 4 bit values still have 3.92 bits of entropy (about 0.98 
bits of entropy per data bit). Ok, one worst case measurement may not be good 
enough. So I continued on other environments with the same testing. Table 3 
provides the results on those environments. And they have even more entropy 
than the first measurement! So, with all the measurements I always see that 
each of the four low bits has around 0.98 bits of entropy. Thus, with the XOR 
value I can conclude that these measurements show that the XOR result will 
have 0.98 bits of Markov Min Entropy based on these measurements.

Please note that I assume an entropy content of 256/288 bits of entropy per 
data bit which is slightly less than 0.9. This lower level is significantly 
less than the measured values -- a safety margin

[PATCH 04/32] perf/x86/intel/cqm: make read of RMIDs per package (Temporal)

2016-04-28 Thread David Carrillo-Cisneros
The previous version of Intel's CQM introduced pmu::count as a replacement
for reading CQM events. This was done to avoid using an IPI to read the
CQM occupancy event when reading events attached to a thread.
Using pmu->count in place of pmu->read is inconsistent with the usage by
other PMUs and introduces several problems such as:
  1) pmu::read for thread events returns bogus values when called from
  interrupts disabled contexts.
  2) perf_event_count(), behavior depends on whether interruptions are
  enabled or not.
  3) perf_event_count() will always read a fresh value from the PMU, which
  is inconsistent with the behavior of other events.
  4) perf_event_count() will perform slow MSR read and writes and IPIs.

This patches removes pmu::count from CQM and makes pmu::read always
read from the local socket (package). Future patches will add a mechanism
to add the event count from other packages.

This patch also removes the unused field rmid_usecnt from intel_pqr_state.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 125 ++--
 1 file changed, 16 insertions(+), 109 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 3c1e247..afd60dd 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -20,7 +20,6 @@ static unsigned int cqm_l3_scale; /* supposedly cacheline 
size */
  * struct intel_pqr_state - State cache for the PQR MSR
  * @rmid:  The cached Resource Monitoring ID
  * @closid:The cached Class Of Service ID
- * @rmid_usecnt:   The usage counter for rmid
  *
  * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
  * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
@@ -32,7 +31,6 @@ static unsigned int cqm_l3_scale; /* supposedly cacheline 
size */
 struct intel_pqr_state {
u32 rmid;
u32 closid;
-   int rmid_usecnt;
 };
 
 /*
@@ -44,6 +42,19 @@ struct intel_pqr_state {
 static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
 
 /*
+ * Updates caller cpu's cache.
+ */
+static inline void __update_pqr_rmid(u32 rmid)
+{
+   struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
+
+   if (state->rmid == rmid)
+   return;
+   state->rmid = rmid;
+   wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
+}
+
+/*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
  * Also protects event->hw.cqm_rmid
  *
@@ -309,7 +320,7 @@ struct rmid_read {
atomic64_t value;
 };
 
-static void __intel_cqm_event_count(void *info);
+static void intel_cqm_event_read(struct perf_event *event);
 
 /*
  * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
@@ -376,12 +387,6 @@ static void intel_cqm_event_read(struct perf_event *event)
u32 rmid;
u64 val;
 
-   /*
-* Task events are handled by intel_cqm_event_count().
-*/
-   if (event->cpu == -1)
-   return;
-
raw_spin_lock_irqsave(&cache_lock, flags);
rmid = event->hw.cqm_rmid;
 
@@ -401,123 +406,28 @@ out:
raw_spin_unlock_irqrestore(&cache_lock, flags);
 }
 
-static void __intel_cqm_event_count(void *info)
-{
-   struct rmid_read *rr = info;
-   u64 val;
-
-   val = __rmid_read(rr->rmid);
-
-   if (val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL))
-   return;
-
-   atomic64_add(val, &rr->value);
-}
-
 static inline bool cqm_group_leader(struct perf_event *event)
 {
return !list_empty(&event->hw.cqm_groups_entry);
 }
 
-static u64 intel_cqm_event_count(struct perf_event *event)
-{
-   unsigned long flags;
-   struct rmid_read rr = {
-   .value = ATOMIC64_INIT(0),
-   };
-
-   /*
-* We only need to worry about task events. System-wide events
-* are handled like usual, i.e. entirely with
-* intel_cqm_event_read().
-*/
-   if (event->cpu != -1)
-   return __perf_event_count(event);
-
-   /*
-* Only the group leader gets to report values. This stops us
-* reporting duplicate values to userspace, and gives us a clear
-* rule for which task gets to report the values.
-*
-* Note that it is impossible to attribute these values to
-* specific packages - we forfeit that ability when we create
-* task events.
-*/
-   if (!cqm_group_leader(event))
-   return 0;
-
-   /*
-* Getting up-to-date values requires an SMP IPI which is not
-* possible if we're being called in interrupt context. Return
-* the cached values instead.
-*/
-   if (unlikely(in_interrupt()))
-   goto out;
-
-   /*
-* Notice that we don't perform the reading of an RMID
-* atomically, because we can't hold a spin lock across the
-

[PATCH 05/32] perf/core: remove unused pmu->count

2016-04-28 Thread David Carrillo-Cisneros
CQM was the only user of pmu->count, no need to have it anymore.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 include/linux/perf_event.h |  6 --
 kernel/events/core.c   | 10 --
 kernel/trace/bpf_trace.c   |  5 ++---
 3 files changed, 2 insertions(+), 19 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 00bb6b5..8bb1532 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -373,12 +373,6 @@ struct pmu {
 */
size_t  task_ctx_size;
 
-
-   /*
-* Return the count value for a counter.
-*/
-   u64 (*count)(struct perf_event *event); /*optional*/
-
/*
 * Set up pmu-private data structures for an AUX area
 */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index aae72d3..4aaec01 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3313,9 +3313,6 @@ unlock:
 
 static inline u64 perf_event_count(struct perf_event *event)
 {
-   if (event->pmu->count)
-   return event->pmu->count(event);
-
return __perf_event_count(event);
 }
 
@@ -3325,7 +3322,6 @@ static inline u64 perf_event_count(struct perf_event 
*event)
  *   - either for the current task, or for this CPU
  *   - does not have inherit set, for inherited task events
  * will not be local and we cannot read them atomically
- *   - must not have a pmu::count method
  */
 u64 perf_event_read_local(struct perf_event *event)
 {
@@ -3353,12 +3349,6 @@ u64 perf_event_read_local(struct perf_event *event)
WARN_ON_ONCE(event->attr.inherit);
 
/*
-* It must not have a pmu::count method, those are not
-* NMI safe.
-*/
-   WARN_ON_ONCE(event->pmu->count);
-
-   /*
 * If the event is currently on this CPU, its either a per-task event,
 * or local to this CPU. Furthermore it means its ACTIVE (otherwise
 * oncpu == -1).
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 3e4ffb3..7ef81b3 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -200,9 +200,8 @@ static u64 bpf_perf_event_read(u64 r1, u64 index, u64 r3, 
u64 r4, u64 r5)
 
event = file->private_data;
 
-   /* make sure event is local and doesn't have pmu::count */
-   if (event->oncpu != smp_processor_id() ||
-   event->pmu->count)
+   /* make sure event is local */
+   if (event->oncpu != smp_processor_id())
return -EINVAL;
 
/*
-- 
2.8.0.rc3.226.g39d4020



[PATCH 07/32] perf/x86/intel/cqm: separate CQM PMU's attributes from x86 PMU

2016-04-28 Thread David Carrillo-Cisneros
Create a CQM_EVENT_ATTR_STR to use in CQM to remove dependency
on the unrelated x86's PMU EVENT_ATTR_STR.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 8457dd0..d5eac8f 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -38,6 +38,13 @@ static inline void __update_pqr_rmid(u32 rmid)
 static DEFINE_MUTEX(cache_mutex);
 static DEFINE_RAW_SPINLOCK(cache_lock);
 
+#define CQM_EVENT_ATTR_STR(_name, v, str)  
\
+static struct perf_pmu_events_attr event_attr_##v = {  
\
+   .attr   = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), 
\
+   .id = 0,
\
+   .event_str  = str,  
\
+}
+
 /*
  * Groups of events that have the same target(s), one RMID per group.
  */
@@ -504,11 +511,11 @@ static int intel_cqm_event_init(struct perf_event *event)
return 0;
 }
 
-EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
-EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
-EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
-EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
-EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
+CQM_EVENT_ATTR_STR(llc_occupancy, intel_cqm_llc, "event=0x01");
+CQM_EVENT_ATTR_STR(llc_occupancy.per-pkg, intel_cqm_llc_pkg, "1");
+CQM_EVENT_ATTR_STR(llc_occupancy.unit, intel_cqm_llc_unit, "Bytes");
+CQM_EVENT_ATTR_STR(llc_occupancy.scale, intel_cqm_llc_scale, NULL);
+CQM_EVENT_ATTR_STR(llc_occupancy.snapshot, intel_cqm_llc_snapshot, "1");
 
 static struct attribute *intel_cqm_events_attr[] = {
EVENT_PTR(intel_cqm_llc),
-- 
2.8.0.rc3.226.g39d4020



[PATCH 08/32] perf/x86/intel/cqm: prepare for next patches

2016-04-28 Thread David Carrillo-Cisneros
Move code around, delete unnecesary code and do some renaming in
in order to increase readibility of next patches. Create cqm.h file.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 170 +++-
 arch/x86/events/intel/cqm.h |  42 +++
 include/linux/perf_event.h  |   8 +--
 3 files changed, 103 insertions(+), 117 deletions(-)
 create mode 100644 arch/x86/events/intel/cqm.h

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index d5eac8f..f678014 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -4,10 +4,9 @@
  * Based very, very heavily on work by Peter Zijlstra.
  */
 
-#include 
 #include 
 #include 
-#include 
+#include "cqm.h"
 #include "../perf_event.h"
 
 #define MSR_IA32_QM_CTR0x0c8e
@@ -16,13 +15,26 @@
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
+#define RMID_VAL_ERROR (1ULL << 63)
+#define RMID_VAL_UNAVAIL   (1ULL << 62)
+
+#define QOS_L3_OCCUP_EVENT_ID  (1 << 0)
+
+#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
+
+#define CQM_EVENT_ATTR_STR(_name, v, str)  
\
+static struct perf_pmu_events_attr event_attr_##v = {  
\
+   .attr   = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), 
\
+   .id = 0,
\
+   .event_str  = str,  
\
+}
+
 /*
  * Updates caller cpu's cache.
  */
 static inline void __update_pqr_rmid(u32 rmid)
 {
struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
if (state->rmid == rmid)
return;
state->rmid = rmid;
@@ -30,37 +42,18 @@ static inline void __update_pqr_rmid(u32 rmid)
 }
 
 /*
- * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
- * Also protects event->hw.cqm_rmid
- *
- * Hold either for stability, both for modification of ->hw.cqm_rmid.
- */
-static DEFINE_MUTEX(cache_mutex);
-static DEFINE_RAW_SPINLOCK(cache_lock);
-
-#define CQM_EVENT_ATTR_STR(_name, v, str)  
\
-static struct perf_pmu_events_attr event_attr_##v = {  
\
-   .attr   = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), 
\
-   .id = 0,
\
-   .event_str  = str,  
\
-}
-
-/*
  * Groups of events that have the same target(s), one RMID per group.
+ * Protected by cqm_mutex.
  */
 static LIST_HEAD(cache_groups);
+static DEFINE_MUTEX(cqm_mutex);
+static DEFINE_RAW_SPINLOCK(cache_lock);
 
 /*
  * Mask of CPUs for reading CQM values. We only need one per-socket.
  */
 static cpumask_t cqm_cpumask;
 
-#define RMID_VAL_ERROR (1ULL << 63)
-#define RMID_VAL_UNAVAIL   (1ULL << 62)
-
-#define QOS_L3_OCCUP_EVENT_ID  (1 << 0)
-
-#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
 
 /*
  * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -71,8 +64,6 @@ static cpumask_t cqm_cpumask;
  */
 static u32 intel_cqm_rotation_rmid;
 
-#define INVALID_RMID   (-1)
-
 /*
  * Is @rmid valid for programming the hardware?
  *
@@ -140,7 +131,7 @@ struct cqm_rmid_entry {
  * rotation worker moves RMIDs from the limbo list to the free list once
  * the occupancy value drops below __intel_cqm_threshold.
  *
- * Both lists are protected by cache_mutex.
+ * Both lists are protected by cqm_mutex.
  */
 static LIST_HEAD(cqm_rmid_free_lru);
 static LIST_HEAD(cqm_rmid_limbo_lru);
@@ -172,13 +163,13 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 
rmid)
 /*
  * Returns < 0 on fail.
  *
- * We expect to be called with cache_mutex held.
+ * We expect to be called with cqm_mutex held.
  */
 static u32 __get_rmid(void)
 {
struct cqm_rmid_entry *entry;
 
-   lockdep_assert_held(&cache_mutex);
+   lockdep_assert_held(&cqm_mutex);
 
if (list_empty(&cqm_rmid_free_lru))
return INVALID_RMID;
@@ -193,7 +184,7 @@ static void __put_rmid(u32 rmid)
 {
struct cqm_rmid_entry *entry;
 
-   lockdep_assert_held(&cache_mutex);
+   lockdep_assert_held(&cqm_mutex);
 
WARN_ON(!__rmid_valid(rmid));
entry = __rmid_entry(rmid);
@@ -237,9 +228,9 @@ static int intel_cqm_setup_rmid_cache(void)
entry = __rmid_entry(0);
list_del(&entry->list);
 
-   mutex_lock(&cache_mutex);
+   mutex_lock(&cqm_mutex);
intel_cqm_rotation_rmid = __get_rmid();
-   mutex_unlock(&cache_mutex);
+   mutex_unlock(&cqm_mutex);
 
return 0;
 fail:
@@ -250,6 +241,7 @@ fail:
return -ENOMEM;
 }
 
+
 /*
  * Determine if @a and @b measure the same set of tasks.
  *
@@ -287,49 +279,11 @@ static bool __match_event(struct perf_event *a, struct 
perf

[PATCH 11/32] perf/x86/intel/cqm: (I)state and limbo prmids

2016-04-28 Thread David Carrillo-Cisneros
CQM defines a dirty threshold that is the minimum number of dirty
cache lines that a prmid can hold before being eligible to be reused.
This threshold is zero unless there exist significant contention of prmids
(more on this on the patch that introduces rotation of RMIDs).

A limbo prmid is a prmid that is no longer utilized by any pmonr, yet, its
occupancy exceeds the dirty threshold. This is a consequence of the
hardware design that do not provide a mechanism to flush cache lines
associated with a RMID.

If no pmonr schedules a limbo prmid, it's expected that it's occupancy
will eventually drop below the dirty threshold. Nevertheless, the cache
lines tagged to a limbo prmid still hold valid occupancy for the previous
owner of the prmid. This creates a difference in the way the occupancy of
pmonr is read depending on whether it has hold a prmid recently or not.

This patch introduces the (I)state mentioned in previous changelog.
The (I)state is a superstate conformed by two substates:
  - (IL)state: (I)state with limbo prmid, this pmonr held a prmid in
(A)state before its transition to (I)state.
  - (IN)state: (I)state without limbo prmid, this pmonr did not held a
prmid recently.

A pmonr in (IL)state keeps the reference to its former prmid in the field
limbo_prmid, this occupancy is counted towards the occupancy
of the ancestors of the pmonr, reducing the error caused by stealing
of prmids during RMID rotation.

In future patches (rotation logic), the occupancy of limbo_prmids is
polled periodically and (IL)state pmonrs with limbo prmids that had become
clean will transition to (IN)state.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 203 ++--
 arch/x86/events/intel/cqm.h |  88 +--
 2 files changed, 277 insertions(+), 14 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 65551bb..caf7152 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -39,16 +39,34 @@ struct monr *monr_hrchy_root;
 
 struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
 
+static inline bool __pmonr__in_istate(struct pmonr *pmonr)
+{
+   lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+   return pmonr->ancestor_pmonr;
+}
+
+static inline bool __pmonr__in_ilstate(struct pmonr *pmonr)
+{
+   lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+   return __pmonr__in_istate(pmonr) && pmonr->limbo_prmid;
+}
+
+static inline bool __pmonr__in_instate(struct pmonr *pmonr)
+{
+   lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+   return __pmonr__in_istate(pmonr) && !__pmonr__in_ilstate(pmonr);
+}
+
 static inline bool __pmonr__in_astate(struct pmonr *pmonr)
 {
lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
-   return pmonr->prmid;
+   return pmonr->prmid && !pmonr->ancestor_pmonr;
 }
 
 static inline bool __pmonr__in_ustate(struct pmonr *pmonr)
 {
lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
-   return !pmonr->prmid;
+   return !pmonr->prmid && !pmonr->ancestor_pmonr;
 }
 
 static inline bool monr__is_root(struct monr *monr)
@@ -210,9 +228,12 @@ static int pkg_data_init_cpu(int cpu)
 
INIT_LIST_HEAD(&pkg_data->free_prmids_pool);
INIT_LIST_HEAD(&pkg_data->active_prmids_pool);
+   INIT_LIST_HEAD(&pkg_data->pmonr_limbo_prmids_pool);
INIT_LIST_HEAD(&pkg_data->nopmonr_limbo_prmids_pool);
 
INIT_LIST_HEAD(&pkg_data->astate_pmonrs_lru);
+   INIT_LIST_HEAD(&pkg_data->istate_pmonrs_lru);
+   INIT_LIST_HEAD(&pkg_data->ilstate_pmonrs_lru);
 
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);
@@ -261,7 +282,15 @@ static struct pmonr *pmonr_alloc(int cpu)
if (!pmonr)
return ERR_PTR(-ENOMEM);
 
+   pmonr->ancestor_pmonr = NULL;
+
+   /*
+* Since (A)state and (I)state have union in members,
+* initialize one of them only.
+*/
+   INIT_LIST_HEAD(&pmonr->pmonr_deps_head);
pmonr->prmid = NULL;
+   INIT_LIST_HEAD(&pmonr->limbo_rotation_entry);
 
pmonr->monr = NULL;
INIT_LIST_HEAD(&pmonr->rotation_entry);
@@ -327,6 +356,44 @@ __pmonr__finish_to_astate(struct pmonr *pmonr, struct 
prmid *prmid)
atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
 }
 
+/*
+ * Transition to (A)state from (IN)state, given a valid prmid.
+ * Cannot fail. Updates ancestor dependants to use this pmonr as new ancestor.
+ */
+static inline void
+__pmonr__instate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
+{
+   struct pmonr *pos, *tmp, *ancestor;
+   union prmid_summary old_summary, summary;
+
+   lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+   /* If in (I) state, cannot have limbo_prmid, otherwise prmid
+* in function's argument is superfluous.
+*/
+   WARN_

[PATCH 10/32] perf/x86/intel/cqm: basic RMID hierarchy with per package rmids

2016-04-28 Thread David Carrillo-Cisneros
Cgroups and/or tasks that require to be monitored using a RMID
are abstracted as a MOnitored Resources (monr's). A CQM event points
to a monr to read occupancy (and in the future other attributes) of the
RMIDs associated to the monr.

The monrs form a hierarchy that captures the dependency within the
monitored cgroups and/or tasks/threads. The monr of a cgroup A which
contains another monitored cgroup, B, is an ancestor of B's monr.

Each monr contains one Package MONitored Resource (pmonr) per package.
The monitoring of a monr in a package starts when its corresponding
pmonr receives an RMID for that package (a prmid).

The prmids are lazily assigned to a pmonr the first time a thread
using the monr is scheduled in the package. When a pmonr with a
valid prmid is scheduled, that pmonr's prmid's RMID is written to the
msr MSR_IA32_PQR_ASSOC. If no prmid is available, the prmid of the lowest
ancestor in the monr hierarchy with a valid prmid for that package is
used instead.

A pmonr can be in one of following three states:
  - (A)ctive: When it has a prmid available.
  - (I)nherited: When no prmid is available. In this state, it "borrows"
the prmid of its lowest ancestor in (A)ctive state during sched in
(writes its ancestor's RMID into hw while any associated thread is
executed). But, since the "borrowed" prmid do not monitor the
occupancy of this monr, the monr cannot report occupancy individually.
  - (U)nused: When the monr does not have a prmid yet and have no failed
acquiring one (either because no thread has been scheduled while
monitoring for this pmonr is active or because it has been completed
a transition to (U)state, ie. termination of the associated
event/cgroup).

To avoid synchronization overhead, each prmid contains a prmid_summary.
The union prmid_summary is a concise representation of the prmid state
and its raw RMIDs. Due to its size, the prmid_summary can be read
atomically without a LOCK instruction. Every state transition atomically
updates the prmid_summary. This avoids locking during sched in and out
of threads, except in the cases that a prmid needs to be allocated,
but this only occurs the first time a monr is scheduled in a package.

This patch introduces a first iteration of the monr hierarchy
that maintains two levels: the root monr, at top, and all other monrs
as leaves. The root monr is always (A)ctive.

This patch also implements the essential mechanism of per-package lazy
allocation of RMID.

The (I)state and the transitions from and to it are introduced in the
next patch in this series.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 633 
 arch/x86/events/intel/cqm.h | 149 +++
 include/linux/perf_event.h  |   2 +-
 3 files changed, 674 insertions(+), 110 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 541e515..65551bb 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -35,28 +35,66 @@ static struct perf_pmu_events_attr event_attr_##v = {   
\
 static LIST_HEAD(cache_groups);
 static DEFINE_MUTEX(cqm_mutex);
 
+struct monr *monr_hrchy_root;
+
 struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
 
-/*
- * Is @rmid valid for programming the hardware?
- *
- * rmid 0 is reserved by the hardware for all non-monitored tasks, which
- * means that we should never come across an rmid with that value.
- * Likewise, an rmid value of -1 is used to indicate "no rmid currently
- * assigned" and is used as part of the rotation code.
- */
-static inline bool __rmid_valid(u32 rmid)
+static inline bool __pmonr__in_astate(struct pmonr *pmonr)
 {
-   if (!rmid || rmid == INVALID_RMID)
-   return false;
+   lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+   return pmonr->prmid;
+}
 
-   return true;
+static inline bool __pmonr__in_ustate(struct pmonr *pmonr)
+{
+   lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+   return !pmonr->prmid;
 }
 
-static u64 __rmid_read(u32 rmid)
+static inline bool monr__is_root(struct monr *monr)
 {
-   /* XXX: Placeholder, will be removed in next patch. */
-   return 0;
+   return monr_hrchy_root == monr;
+}
+
+static inline bool monr__is_mon_active(struct monr *monr)
+{
+   return monr->flags & MONR_MON_ACTIVE;
+}
+
+static inline void __monr__set_summary_read_rmid(struct monr *monr, u32 rmid)
+{
+   int i;
+   struct pmonr *pmonr;
+   union prmid_summary summary;
+
+   monr_hrchy_assert_held_raw_spin_locks();
+
+   cqm_pkg_id_for_each_online(i) {
+   pmonr = monr->pmonrs[i];
+   WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
+   summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+   summary.read_rmid = rmid;
+   atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+  

[PATCH 17/32] perf/core: adding pmu::event_terminate

2016-04-28 Thread David Carrillo-Cisneros
Allow a PMU to clean an event before the event's torn down in
perf_events begins.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 include/linux/perf_event.h | 6 ++
 kernel/events/core.c   | 4 
 2 files changed, 10 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index b010b55..81e29c6 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -265,6 +265,12 @@ struct pmu {
int (*event_init)   (struct perf_event *event);
 
/*
+* Terminate the event for this PMU. Optional complement for a
+* successful event_init. Called before the event fields are tear down.
+*/
+   void (*event_terminate) (struct perf_event *event);
+
+   /*
 * Notification that the event was mapped or unmapped.  Called
 * in the context of the mapping task.
 */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6fd226f..2a868a6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3787,6 +3787,8 @@ static void _free_event(struct perf_event *event)
ring_buffer_attach(event, NULL);
mutex_unlock(&event->mmap_mutex);
}
+   if (event->pmu->event_terminate)
+   event->pmu->event_terminate(event);
 
if (is_cgroup_event(event))
perf_detach_cgroup(event);
@@ -8293,6 +8295,8 @@ err_per_task:
exclusive_event_destroy(event);
 
 err_pmu:
+   if (event->pmu->event_terminate)
+   event->pmu->event_terminate(event);
if (event->destroy)
event->destroy(event);
module_put(pmu->module);
-- 
2.8.0.rc3.226.g39d4020



[PATCH 15/32] perf/core: add hooks to expose architecture specific features in perf_cgroup

2016-04-28 Thread David Carrillo-Cisneros
The hooks allows architectures to extend the behavior of the
perf subsystem.

In this patch series, the hooks will be used by Intel's CQM PMU to
provide support for the llc_occupancy event.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 include/linux/perf_event.h | 28 +++-
 kernel/events/core.c   | 27 +++
 2 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf29258..b010b55 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -690,7 +690,9 @@ struct perf_cgroup_info {
 };
 
 struct perf_cgroup {
-   struct cgroup_subsys_state  css;
+   /* Architecture specific information. */
+   void *arch_info;
+   struct cgroup_subsys_state   css;
struct perf_cgroup_info __percpu *info;
 };
 
@@ -1228,4 +1230,28 @@ _name##_show(struct device *dev, 
\
\
 static struct device_attribute format_attr_##_name = __ATTR_RO(_name)
 
+
+/*
+ * Hooks for architecture specific extensions for perf_cgroup.
+ */
+#ifndef perf_cgroup_arch_css_alloc
+# define perf_cgroup_arch_css_alloc(parent_css, new_css) 0
+#endif
+
+#ifndef perf_cgroup_arch_css_online
+# define perf_cgroup_arch_css_online(css) 0
+#endif
+
+#ifndef perf_cgroup_arch_css_offline
+# define perf_cgroup_arch_css_offline(css) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_css_released
+# define perf_cgroup_arch_css_released(css) do { } while (0)
+#endif
+
+#ifndef perf_cgroup_arch_css_free
+# define perf_cgroup_arch_css_free(css) do { } while (0)
+#endif
+
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4aaec01..6fd226f 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9794,6 +9794,7 @@ static struct cgroup_subsys_state *
 perf_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 {
struct perf_cgroup *jc;
+   int ret;
 
jc = kzalloc(sizeof(*jc), GFP_KERNEL);
if (!jc)
@@ -9805,13 +9806,36 @@ perf_cgroup_css_alloc(struct cgroup_subsys_state 
*parent_css)
return ERR_PTR(-ENOMEM);
}
 
+   jc->arch_info = NULL;
+
+   ret = perf_cgroup_arch_css_alloc(parent_css, &jc->css);
+   if (ret)
+   return ERR_PTR(ret);
+
return &jc->css;
 }
 
+static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
+{
+   return perf_cgroup_arch_css_online(css);
+}
+
+static void perf_cgroup_css_offline(struct cgroup_subsys_state *css)
+{
+   perf_cgroup_arch_css_offline(css);
+}
+
+static void perf_cgroup_css_released(struct cgroup_subsys_state *css)
+{
+   perf_cgroup_arch_css_released(css);
+}
+
 static void perf_cgroup_css_free(struct cgroup_subsys_state *css)
 {
struct perf_cgroup *jc = container_of(css, struct perf_cgroup, css);
 
+   perf_cgroup_arch_css_free(css);
+
free_percpu(jc->info);
kfree(jc);
 }
@@ -9836,6 +9860,9 @@ static void perf_cgroup_attach(struct cgroup_taskset 
*tset)
 
 struct cgroup_subsys perf_event_cgrp_subsys = {
.css_alloc  = perf_cgroup_css_alloc,
+   .css_online = perf_cgroup_css_online,
+   .css_offline= perf_cgroup_css_offline,
+   .css_released   = perf_cgroup_css_released,
.css_free   = perf_cgroup_css_free,
.attach = perf_cgroup_attach,
 };
-- 
2.8.0.rc3.226.g39d4020



[PATCH 16/32] perf/x86/intel/cqm: add cgroup support

2016-04-28 Thread David Carrillo-Cisneros
Create a monr per monitored cgroup. Inserts monrs in the monr hierarchy.
Task events are leaves of the lowest monitored ancestor cgroup (the lowest
cgroup ancestor with a monr).

CQM starts after the cgroup subsystem, and uses the cqm_initialized_key
static key to avoid interfering with the perf cgroup logic until
propertly initialized. The cgroup_init_mutex protects the initialization.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c   | 594 +-
 arch/x86/events/intel/cqm.h   |  13 +
 arch/x86/include/asm/perf_event.h |  33 +++
 3 files changed, 637 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 98a919f..f000fd0 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -35,10 +35,17 @@ static struct perf_pmu_events_attr event_attr_##v = {   
\
 static LIST_HEAD(cache_groups);
 static DEFINE_MUTEX(cqm_mutex);
 
+/*
+ * Synchronizes initialization of cqm with cgroups.
+ */
+static DEFINE_MUTEX(cqm_init_mutex);
+
 struct monr *monr_hrchy_root;
 
 struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
 
+DEFINE_STATIC_KEY_FALSE(cqm_initialized_key);
+
 static inline bool __pmonr__in_istate(struct pmonr *pmonr)
 {
lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
@@ -69,6 +76,9 @@ static inline bool __pmonr__in_ustate(struct pmonr *pmonr)
return !pmonr->prmid && !pmonr->ancestor_pmonr;
 }
 
+/* Whether the monr is root. Recall that the cgroups can not be root and yet
+ * point to a root monr.
+ */
 static inline bool monr__is_root(struct monr *monr)
 {
return monr_hrchy_root == monr;
@@ -115,6 +125,23 @@ static inline void __monr__clear_mon_active(struct monr 
*monr)
monr->flags &= ~MONR_MON_ACTIVE;
 }
 
+static inline bool monr__is_cgroup_type(struct monr *monr)
+{
+   return monr->mon_cgrp;
+}
+
+static inline bool monr_is_event_type(struct monr *monr)
+{
+   return !monr->mon_cgrp && monr->mon_event_group;
+}
+
+
+static inline struct cgroup_subsys_state *get_root_perf_css(void)
+{
+   /* Get css for root cgroup */
+   return  init_css_set.subsys[perf_event_cgrp_id];
+}
+
 /*
  * Update if enough time has passed since last read.
  *
@@ -725,6 +752,7 @@ static struct monr *monr_alloc(void)
monr->parent = NULL;
INIT_LIST_HEAD(&monr->children);
INIT_LIST_HEAD(&monr->parent_entry);
+   monr->mon_cgrp = NULL;
monr->mon_event_group = NULL;
 
/* Iterate over all pkgs, even unitialized ones. */
@@ -947,7 +975,7 @@ retry:
 }
 
 /*
- * Wrappers for monr manipulation in events.
+ * Wrappers for monr manipulation in events and cgroups.
  *
  */
 static inline struct monr *monr_from_event(struct perf_event *event)
@@ -960,6 +988,100 @@ static inline void event_set_monr(struct perf_event 
*event, struct monr *monr)
WRITE_ONCE(event->hw.cqm_monr, monr);
 }
 
+#ifdef CONFIG_CGROUP_PERF
+static inline struct monr *monr_from_perf_cgroup(struct perf_cgroup *cgrp)
+{
+   struct monr *monr;
+   struct cgrp_cqm_info *cqm_info;
+
+   cqm_info = (struct cgrp_cqm_info *)READ_ONCE(cgrp->arch_info);
+   WARN_ON_ONCE(!cqm_info);
+   monr = READ_ONCE(cqm_info->monr);
+   return monr;
+}
+
+static inline struct perf_cgroup *monr__get_mon_cgrp(struct monr *monr)
+{
+   WARN_ON_ONCE(!monr);
+   return READ_ONCE(monr->mon_cgrp);
+}
+
+static inline void
+monr__set_mon_cgrp(struct monr *monr, struct perf_cgroup *cgrp)
+{
+   WRITE_ONCE(monr->mon_cgrp, cgrp);
+}
+
+static inline void
+perf_cgroup_set_monr(struct perf_cgroup *cgrp, struct monr *monr)
+{
+   WRITE_ONCE(cgrp_to_cqm_info(cgrp)->monr, monr);
+}
+
+/*
+ * A perf_cgroup is monitored when it's set in a monr->mon_cgrp.
+ * There is a many-to-one relationship between perf_cgroup's monrs
+ * and monrs' mon_cgrp. A monitored cgroup is necesarily referenced
+ * back by its monr's mon_cgrp.
+ */
+static inline bool perf_cgroup_is_monitored(struct perf_cgroup *cgrp)
+{
+   struct monr *monr;
+   struct perf_cgroup *monr_cgrp;
+
+   /* monr can be referenced by a cgroup other than the one in its
+* mon_cgrp, be careful.
+*/
+   monr = monr_from_perf_cgroup(cgrp);
+
+   monr_cgrp = monr__get_mon_cgrp(monr);
+   /* Root monr do not have a cgroup associated before initialization.
+* mon_cgrp and mon_event_group are union, so the pointer must be set
+* for all non-root monrs.
+*/
+   return  monr_cgrp && monr__get_mon_cgrp(monr) == cgrp;
+}
+
+/* Set css's monr to the monr of its lowest monitored ancestor. */
+static inline void __css_set_monr_to_lma(struct cgroup_subsys_state *css)
+{
+   lockdep_assert_held(&cqm_mutex);
+   if (!css->parent) {
+   perf_cgroup_set_monr(css_to_perf_cgroup(css), monr_hrchy_root);
+   return;
+   }
+   perf_cgroup_s

[PATCH 19/32] perf/core: introduce PMU event flag PERF_CGROUP_NO_RECURSION

2016-04-28 Thread David Carrillo-Cisneros
Some events, such as Intel's CQM llc_occupancy, need small deviations
from the traditional behavior in the generic code in a way that depends
on the event itself (and known by the PMU) and not in a field of
perf_event_attrs.

An example is the recursive scope for cgroups: The generic code handles
cgroup hierarchy for a cgroup C by simultaneously adding to the PMU
the events of all cgroups that are ancestors of C. This approach is
incompatible with the CQM hw that only allows one RMID per virtual core
at a time. CQM's PMU work-arounds this limitation by internally
maintaining the hierarchical dependency between monitored cgroups and
only requires that the generic code adds current cgroup's event to
the PMU.

The introduction of the flag PERF_CGROUP_NO_RECURSION allows the PMU to
signal the generic code to avoid using recursive cgroup scope for
llc_occupancy events, preventing an undesired overwrite of RMIDs.

The PERF_CGROUP_NO_RECURSION, introduced in this patch, is the first flag
of this type, more will be added in this patch series.
To keep things tidy, this patch introduces the flag field pmu_event_flag,
intended to contain all flags that:
  - Are not user-configurable event attributes (not suitable for
perf_event_attributes).
  - Are known by the PMU during initialization of struct perf_event.
  - Signal something to the generic code.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 include/linux/perf_event.h | 10 ++
 kernel/events/core.c   |  3 +++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 81e29c6..e4c58b0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -594,9 +594,19 @@ struct perf_event {
 #endif
 
struct list_headsb_list;
+
+   /* Flags to generic code set by PMU. */
+   int pmu_event_flags;
+
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+/*
+ * Possible flags for mpu_event_flags.
+ */
+/* Do not enable cgroup events in descendant cgroups. */
+#define PERF_CGROUP_NO_RECURSION   (1 << 0)
+
 /**
  * struct perf_event_context - event context structure
  *
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2a868a6..33961ec 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -545,6 +545,9 @@ perf_cgroup_match(struct perf_event *event)
if (!cpuctx->cgrp)
return false;
 
+   if (event->pmu_event_flags & PERF_CGROUP_NO_RECURSION)
+   return cpuctx->cgrp->css.cgroup == event->cgrp->css.cgroup;
+
/*
 * Cgroup scoping is recursive.  An event enabled for a cgroup is
 * also enabled for all its descendant cgroups.  If @cpuctx's
-- 
2.8.0.rc3.226.g39d4020



[PATCH 12/32] perf/x86/intel/cqm: add per-package RMID rotation

2016-04-28 Thread David Carrillo-Cisneros
This version of RMID rotation improves over original one by:
  1. Being per-package. No need for IPIs to test for occupancy.
  2. Since the monr hierarchy removed the potential conflicts between
 events, the new RMID rotation logic does not need to check and
 resolve conflicts.
  3. No need to mantain an unused RMID as rotation_rmid, effectively
 freeing one RMID per package.
  4. Guarantee that monitored events and cgroups with a valid RMID keep
 the RMID for an user configurable time: __cqm_min_mon_slice ms.
 Previously, it was likely to receive a RMID in one execution of the
 rotation logic just to have it removed in the next. That was
 specially problematic in the presence of events conflict
 (ie. cgroup events and thread events in a descendant cgroup).
  5. Do not increase the dirty threshold unless strictly necessary to make
 progress. Previous version simultaneously stole RMIDs and increased
 the dirty threshold (the maximum number of cache lines with spurious
 occupancy associated with a "clean" RMID). This version makes sure
 that increasing the dirty threshold is the only way to make progress
 in the RMID rotation (the case when too many RMID in limbo do not
 drop occupancy despite having spent enough time in limbo) before
 increasing the threshold.
 This change reduces spurious occupancy as a source of error.
  6. Do not steal RMIDs unnecesarily. Thanks to a more detailed
 bookeeping, this patch guarantees that the number of RMIDs in limbo
 do not exceed the number of RMIDs needed by pmonrs currently waiting
 for an RMID.
  7. Reutilize dirty limbo RMIDs when appropriate. In this new version, a
 stolen RMID remains referenced by its former pmonr owner until it is
 reutilized by another pmonr or it is moved from limbo into the pool
 of free RMIDs.
 These RMIDs that are referenced and in limbo are not written into the
 MSR_IA32_PQR_ASSOC msr, therefore, they have the chance to drop
 occupancy as any other limbo RMID. If the pmonr with a limbo RMID is
 to be activated, then it reuses its former RMID even if its still
 dirty. The occupancy attributed to that RMID is part of the pmonr
 occupancy and therefore reusing the RMID even when dirty decreases
 the error of the read.
 This feature decreases the negative impact of RMIDs that do not drop
 occupancy in the efficiency of the rotation logic.

For an user perspective, the behavior of the new rotation logic is
controlled by SLO type parameters:
   __cqm_min_mon_slice :Minimum time a monr is to be monitored
before being eligible by rotation logic to loss any of its RMIDs.
  __cqm_max_wait_mon :  Maximum time a monr can be deactivated
before forcing rotation logic to be more aggresive (stealing more
RMIDs per iteration).
  __cqm_min_progress_rate:  Minimum number of pmonrs that must be
activated per second to consider that rotation logic's progress
is acceptable.

Since the minimum progress rate is a SLO, the magnitude of the rotation
period (the rtimer_interval_ms) do not control the speed of RMID rotation,
it only controls the frequency at which rotation logic is executed.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 727 
 arch/x86/events/intel/cqm.h |  59 +++-
 2 files changed, 784 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index caf7152..31f0fd6 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -235,9 +235,14 @@ static int pkg_data_init_cpu(int cpu)
INIT_LIST_HEAD(&pkg_data->istate_pmonrs_lru);
INIT_LIST_HEAD(&pkg_data->ilstate_pmonrs_lru);
 
+   pkg_data->nr_instate_pmonrs = 0;
+   pkg_data->nr_ilstate_pmonrs = 0;
+
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);
 
+   INIT_DELAYED_WORK(
+   &pkg_data->rotation_work, intel_cqm_rmid_rotation_work);
/* XXX: Chose randomly*/
pkg_data->rotation_cpu = cpu;
 
@@ -295,6 +300,10 @@ static struct pmonr *pmonr_alloc(int cpu)
pmonr->monr = NULL;
INIT_LIST_HEAD(&pmonr->rotation_entry);
 
+   pmonr->last_enter_istate = 0;
+   pmonr->last_enter_astate = 0;
+   pmonr->nr_enter_istate = 0;
+
pmonr->pkg_id = topology_physical_package_id(cpu);
summary.sched_rmid = INVALID_RMID;
summary.read_rmid = INVALID_RMID;
@@ -346,6 +355,8 @@ __pmonr__finish_to_astate(struct pmonr *pmonr, struct prmid 
*prmid)
 
pmonr->prmid = prmid;
 
+   pmonr->last_enter_astate = jiffies;
+
list_move_tail(
&prmid->pool_entry, &__pkg_data(pmonr, active_prmids_pool));
list_move_tail(
@@ -373,6 +384,8 @@ __pmonr__instate_to_astate(struct pmonr *pmonr, struct 
prmid *prmid

[PATCH 18/32] perf/x86/intel/cqm: use pmu::event_terminate

2016-04-28 Thread David Carrillo-Cisneros
Utilized to detach a monr from a cgroup before the event's reference
to the cgroup is removed.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index f000fd0..dcf7f4a 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2391,7 +2391,7 @@ static int intel_cqm_event_add(struct perf_event *event, 
int mode)
return prmid_summary__is_istate(summary) ? -1 : 0;
 }
 
-static void intel_cqm_event_destroy(struct perf_event *event)
+static void intel_cqm_event_terminate(struct perf_event *event)
 {
struct perf_event *group_other = NULL;
struct monr *monr;
@@ -2438,6 +2438,17 @@ static void intel_cqm_event_destroy(struct perf_event 
*event)
if (monr__is_root(monr))
goto exit;
 
+   /* Handle cgroup event. */
+   if (event->cgrp) {
+   monr->mon_event_group = NULL;
+   if ((event->cgrp->css.flags & CSS_ONLINE) &&
+   !cgrp_to_cqm_info(event->cgrp)->cont_monitoring)
+   __css_stop_monitoring(&monr__get_mon_cgrp(monr)->css);
+
+   goto exit;
+   }
+   WARN_ON_ONCE(!monr_is_event_type(monr));
+
/* Transition all pmonrs to (U)state. */
monr_hrchy_acquire_locks(flags, i);
 
@@ -2478,8 +2489,6 @@ static int intel_cqm_event_init(struct perf_event *event)
INIT_LIST_HEAD(&event->hw.cqm_event_groups_entry);
INIT_LIST_HEAD(&event->hw.cqm_event_group_entry);
 
-   event->destroy = intel_cqm_event_destroy;
-
mutex_lock(&cqm_mutex);
 
 
@@ -2595,6 +2604,7 @@ static struct pmu intel_cqm_pmu = {
.attr_groups = intel_cqm_attr_groups,
.task_ctx_nr = perf_sw_context,
.event_init  = intel_cqm_event_init,
+   .event_terminate = intel_cqm_event_terminate,
.add = intel_cqm_event_add,
.del = intel_cqm_event_stop,
.start   = intel_cqm_event_start,
-- 
2.8.0.rc3.226.g39d4020



[PATCH 23/32] perf/core: introduce PERF_INACTIVE_*_READ_* flags

2016-04-28 Thread David Carrillo-Cisneros
Some offcore and uncore events, such as the new intel_cqm/llc_occupancy,
can be read even if the event is not active in its CPU (or in any CPU).
In those cases, a freshly read value is more recent, (and therefore
preferable) than the last value stored at event sched out.

There are two cases covered in this patch to allow Intel's CQM (and
potentially other per package events) to obtain updated values regardless
of the scheduling event in a particular CPU. Each case is covered by a
new event::pmu_event_flag:
1) PERF_INACTIVE_CPU_READ_PKG: An event attached to a CPU that can
be read in any CPU in its event:cpu's package, even if inactive.
2) PERF_INACTIVE_EV_READ_ANY_CPU: An event that can be read in any
CPU in any package in the system even if inactive.

A consequence of reading a new value from hw on each call to
perf_event_read() is that reading and saving the event value in sched out
can be avoided since the value will never be utilized. Therefore, a PMU
that sets any of the PERF_INACTIVE_*_READ_* flags can choose not to read
in context switch, at the cost of inherit_stats not working properly.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 include/linux/perf_event.h | 15 
 kernel/events/core.c   | 59 +++---
 2 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e4c58b0..054d7f4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,21 @@ struct perf_event {
 /* Do not enable cgroup events in descendant cgroups. */
 #define PERF_CGROUP_NO_RECURSION   (1 << 0)
 
+/* CPU Event can read from event::cpu's package even if not in
+ * PERF_EVENT_STATE_ACTIVE, event::cpu must be a valid CPU.
+ */
+#define PERF_INACTIVE_CPU_READ_PKG (1 << 1)
+
+/* Event can read from any package even if not in PERF_EVENT_STATE_ACTIVE. */
+#define PERF_INACTIVE_EV_READ_ANY_CPU  (1 << 2)
+
+static inline bool __perf_can_read_inactive(struct perf_event *event)
+{
+   return (event->pmu_event_flags & PERF_INACTIVE_EV_READ_ANY_CPU) ||
+   ((event->pmu_event_flags & PERF_INACTIVE_CPU_READ_PKG) &&
+   (event->cpu != -1));
+}
+
 /**
  * struct perf_event_context - event context structure
  *
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 33961ec..28d1b51 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3266,15 +3266,28 @@ static void __perf_event_read(void *info)
struct perf_event_context *ctx = event->ctx;
struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
struct pmu *pmu = event->pmu;
+   bool read_inactive = __perf_can_read_inactive(event);
+
+   WARN_ON_ONCE(event->cpu == -1 &&
+   (event->pmu_event_flags & PERF_INACTIVE_CPU_READ_PKG));
+
+   /* If inactive, we should be reading in the adequate package. */
+   WARN_ON_ONCE(
+   event->state != PERF_EVENT_STATE_ACTIVE &&
+   (event->pmu_event_flags & PERF_INACTIVE_CPU_READ_PKG) &&
+   (topology_physical_package_id(event->cpu) !=
+   topology_physical_package_id(smp_processor_id(;
 
/*
 * If this is a task context, we need to check whether it is
-* the current task context of this cpu.  If not it has been
+* the current task context of this cpu or if the event
+* can be read while inactive.  If cannot read while inactive
+* and not in current cpu, then the event has been
 * scheduled out before the smp call arrived.  In that case
 * event->count would have been updated to a recent sample
 * when the event was scheduled out.
 */
-   if (ctx->task && cpuctx->task_ctx != ctx)
+   if (ctx->task && cpuctx->task_ctx != ctx && !read_inactive)
return;
 
raw_spin_lock(&ctx->lock);
@@ -3284,9 +3297,11 @@ static void __perf_event_read(void *info)
}
 
update_event_times(event);
-   if (event->state != PERF_EVENT_STATE_ACTIVE)
+
+   if (event->state != PERF_EVENT_STATE_ACTIVE && !read_inactive)
goto unlock;
 
+
if (!data->group) {
pmu->read(event);
data->ret = 0;
@@ -3299,7 +3314,8 @@ static void __perf_event_read(void *info)
 
list_for_each_entry(sub, &event->sibling_list, group_entry) {
update_event_times(sub);
-   if (sub->state == PERF_EVENT_STATE_ACTIVE) {
+   if (sub->state == PERF_EVENT_STATE_ACTIVE ||
+   __perf_can_read_inactive(sub)) {
/*
 * Use sibling's PMU rather than @event's since
 * sibling could be on different (eg: software) PMU.
@@ -3368,19 +3384,34 @@ u64 perf_event_read_local(struct perf_event *event)
 static int perf_event_read(

[PATCH 24/32] perf/x86/intel/cqm: use PERF_INACTIVE_*_READ_* flags in CQM

2016-04-28 Thread David Carrillo-Cisneros
Use newly added pmu_event flags to:
  - Allow thread events to be read from any CPU even if not in
  ACTIVE state. Since inter-package values are polled, a thread's
  occupancy is always:

local occupancy (read from hw) + remote occupancy (polled values)

  - Allow cpu/cgroup events to be read from any CPU in the package where
  they run. This potentially saves IPIs when the read function runs in the
  same package but in a distinct CPU than the event.

Since reading will always return a new value and inherit_stats is not
supported (due to all children events sharing the same RMID), there is no
need to read during sched_out of an event.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index c14f1c7..daf9fdf 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2702,6 +2702,16 @@ static int intel_cqm_event_init(struct perf_event *event)
 */
event->pmu_event_flags |= PERF_CGROUP_NO_RECURSION;
 
+   /* Events in CQM PMU are per-package and can be read even when
+* the cpu is not running the event.
+*/
+   if (event->cpu < 0) {
+   WARN_ON_ONCE(!(event->attach_state & PERF_ATTACH_TASK));
+   event->pmu_event_flags |= PERF_INACTIVE_EV_READ_ANY_CPU;
+   } else  {
+   event->pmu_event_flags |= PERF_INACTIVE_CPU_READ_PKG;
+   }
+
mutex_lock(&cqm_mutex);
 
 
-- 
2.8.0.rc3.226.g39d4020



[PATCH 13/32] perf/x86/intel/cqm: add polled update of RMID's llc_occupancy

2016-04-28 Thread David Carrillo-Cisneros
To avoid IPIs from IRQ disabled contexts, the occupancy for a RMID in a
remote package (a package other than the one the current cpu belongs) is
obtained from a cache that is periodically updated.
This removes the need for an IPI when reading occupancy for a task event,
that was the reason to add the problematic pmu::count and dummy
perf_event_read() in the previous CQM version.

The occupancy of all active prmids is updated every
__rmid_timed_update_period ms .

To avoid holding raw_spin_locks on the prmid hierarchy for too long, the
raw rmids to be read are copied to a temporal array list. The array list
is consumed to perform the wrmsrl and rdmsrl in each RMID required to
read its llc_occupancy.

This decoupling of traversing the RMID hierarchy and read occupancy is
specially useful due to high latency of the wrmsrl and rdmsl for the
llc_occupancy event (thousand of cycles in my test machine).

To avoid unnecessary memory allocations, the objects used to temporarily
store RMIDs are pooled in a per-package list and allocated on demand.

The infrastructure introduced in this patch will be used in future patches
in this series to perform reads on subtrees of a prmid hierarchy.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 251 +++-
 arch/x86/events/intel/cqm.h |  36 +++
 2 files changed, 286 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 31f0fd6..904f2d3 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -189,6 +189,8 @@ static inline bool __valid_pkg_id(u16 pkg_id)
return pkg_id < PQR_MAX_NR_PKGS;
 }
 
+static int anode_pool__alloc_one(u16 pkg_id);
+
 /* Init cqm pkg_data for @cpu 's package. */
 static int pkg_data_init_cpu(int cpu)
 {
@@ -241,11 +243,19 @@ static int pkg_data_init_cpu(int cpu)
mutex_init(&pkg_data->pkg_data_mutex);
raw_spin_lock_init(&pkg_data->pkg_data_lock);
 
+   INIT_LIST_HEAD(&pkg_data->anode_pool_head);
+   raw_spin_lock_init(&pkg_data->anode_pool_lock);
+
INIT_DELAYED_WORK(
&pkg_data->rotation_work, intel_cqm_rmid_rotation_work);
/* XXX: Chose randomly*/
pkg_data->rotation_cpu = cpu;
 
+   INIT_DELAYED_WORK(
+   &pkg_data->timed_update_work, intel_cqm_timed_update_work);
+   /* XXX: Chose randomly*/
+   pkg_data->timed_update_cpu = cpu;
+
cqm_pkgs_data[pkg_id] = pkg_data;
return 0;
 }
@@ -744,6 +754,189 @@ static void monr_dealloc(struct monr *monr)
 }
 
 /*
+ * Logic for reading sets of rmids into per-package lists.
+ * This package lists can be used to update occupancies without
+ * holding locks in the hierarchies of pmonrs.
+ * @pool: free pool.
+ */
+struct astack {
+   struct list_headpool;
+   struct list_headitems;
+   int top_idx;
+   int max_idx;
+   u16 pkg_id;
+};
+
+static void astack__init(struct astack *astack, int max_idx, u16 pkg_id)
+{
+   INIT_LIST_HEAD(&astack->items);
+   INIT_LIST_HEAD(&astack->pool);
+   astack->top_idx = -1;
+   astack->max_idx = max_idx;
+   astack->pkg_id = pkg_id;
+}
+
+/* Try to enlarge astack->pool with a anode from this pkgs pool. */
+static int astack__try_add_pool(struct astack *astack)
+{
+   unsigned long flags;
+   int ret = -1;
+   struct pkg_data *pkg_data = cqm_pkgs_data[astack->pkg_id];
+
+   raw_spin_lock_irqsave(&pkg_data->anode_pool_lock, flags);
+
+   if (!list_empty(&pkg_data->anode_pool_head)) {
+   list_move_tail(pkg_data->anode_pool_head.prev, &astack->pool);
+   ret = 0;
+   }
+
+   raw_spin_unlock_irqrestore(&pkg_data->anode_pool_lock, flags);
+   return ret;
+}
+
+static int astack__push(struct astack *astack)
+{
+   if (!list_empty(&astack->items) && astack->top_idx < astack->max_idx) {
+   astack->top_idx++;
+   return 0;
+   }
+
+   if (list_empty(&astack->pool) && astack__try_add_pool(astack))
+   return -1;
+   list_move_tail(astack->pool.prev, &astack->items);
+   astack->top_idx = 0;
+   return 0;
+}
+
+/* Must be non-empty */
+# define __astack__top(astack_, member_) \
+   list_last_entry(&(astack_)->items, \
+   struct anode, entry)->member_[(astack_)->top_idx]
+
+static void astack__clear(struct astack *astack)
+{
+   list_splice_tail_init(&astack->items, &astack->pool);
+   astack->top_idx = -1;
+}
+
+/* Put back into pkg_data's pool. */
+static void astack__release(struct astack *astack)
+{
+   unsigned long flags;
+   struct pkg_data *pkg_data = cqm_pkgs_data[astack->pkg_id];
+
+   astack__clear(astack);
+   raw_spin_lock_irqsave(&pkg_data->anode_pool_lock, flags);
+   list_splice_tail_init(&astack->pool, &pkg_data->anode_pool_head);
+  

[PATCH 20/32] x86/intel/cqm: use PERF_CGROUP_NO_RECURSION in CQM

2016-04-28 Thread David Carrillo-Cisneros
The CQM hardware is not compatible with the way generic code handles
cgroup hierarchies (simultaneously adding the events of for all ancestors
of the current cgroup). This version of Intel's CQM driver handles
cgroup hierarchy internally.

Set PERF_CGROUP_NO_RECURSION for llc_occupancy events to
signal perf's generic code to not add events for ancestors of current
cgroup.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index dcf7f4a..d8d3191 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2489,6 +2489,14 @@ static int intel_cqm_event_init(struct perf_event *event)
INIT_LIST_HEAD(&event->hw.cqm_event_groups_entry);
INIT_LIST_HEAD(&event->hw.cqm_event_group_entry);
 
+   /*
+* CQM driver handles cgroup recursion and since only noe
+* RMID can be programmed at the time in each core, then
+* it is incompatible with the way generic code handles
+* cgroup hierarchies.
+*/
+   event->pmu_event_flags |= PERF_CGROUP_NO_RECURSION;
+
mutex_lock(&cqm_mutex);
 
 
-- 
2.8.0.rc3.226.g39d4020



[PATCH 21/32] perf/x86/intel/cqm: handle inherit event and inherit_stat flag

2016-04-28 Thread David Carrillo-Cisneros
Since inherited events are part of the same cqm cache group, they share the
RMID and therefore they cannot provide the granularity required by
inherit_stats. Changing this would require to create a subtree of monrs for
each parent event and its inherited events, a potential improvement for
future patches.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index d8d3191..6e85021 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2483,6 +2483,7 @@ static int intel_cqm_event_init(struct perf_event *event)
event->attr.exclude_idle   ||
event->attr.exclude_host   ||
event->attr.exclude_guest  ||
+   event->attr.inherit_stat   || /* cqm groups share rmid */
event->attr.sample_period) /* no sampling */
return -EINVAL;
 
-- 
2.8.0.rc3.226.g39d4020



[PATCH 28/32] perf/x86/intel/cqm: add CQM attributes to perf_event cgroup

2016-04-28 Thread David Carrillo-Cisneros
Expose the boolean attribute intel_cqm.cont_monitoring . When set, the
associated group will be monitored even if no perf cgroup event is
attached to it.

The occupancy of a cgroup must be read using a perf_event, regardless of
the value of intel_cqm.cont_monitoring .

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c   | 81 +++
 arch/x86/include/asm/perf_event.h |  6 +++
 2 files changed, 87 insertions(+)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 4ece0a4..33691c1 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -3194,4 +3194,85 @@ no_rmid:
 #endif
 }
 
+#ifdef CONFIG_CGROUP_PERF
+
+/* kernfs guarantees that css doesn't need to be pinned. */
+static u64 cqm_cont_monitoring_read_u64(struct cgroup_subsys_state *css,
+   struct cftype *cft)
+{
+   int ret = -1;
+   struct perf_cgroup *perf_cgrp = css_to_perf_cgroup(css);
+   struct monr *monr;
+
+   mutex_lock(&cqm_init_mutex);
+   if (!static_branch_likely(&cqm_initialized_key))
+   goto out;
+
+   mutex_lock(&cqm_mutex);
+
+   ret = css_to_cqm_info(css)->cont_monitoring;
+   monr = monr_from_perf_cgroup(perf_cgrp);
+   WARN_ON(!monr->mon_event_group &&
+   (ret != perf_cgroup_is_monitored(perf_cgrp)));
+
+   mutex_unlock(&cqm_mutex);
+out:
+   mutex_unlock(&cqm_init_mutex);
+   return ret;
+}
+
+/* kernfs guarantees that css doesn't need to be pinned. */
+static int cqm_cont_monitoring_write_u64(struct cgroup_subsys_state *css,
+struct cftype *cft, u64 value)
+{
+   int ret = 0;
+   struct perf_cgroup *perf_cgrp = css_to_perf_cgroup(css);
+   struct monr *monr;
+
+   if (value > 1)
+   return -1;
+
+   mutex_lock(&cqm_init_mutex);
+   if (!static_branch_likely(&cqm_initialized_key)) {
+   ret = -1;
+   goto out;
+   }
+
+   /* Root cgroup cannot stop being monitored. */
+   if (css == get_root_perf_css())
+   goto out;
+
+   mutex_lock(&cqm_mutex);
+
+   monr = monr_from_perf_cgroup(perf_cgrp);
+
+   if (value && !perf_cgroup_is_monitored(perf_cgrp))
+   ret = __css_start_monitoring(css);
+   else if (!value &&
+!monr->mon_event_group && perf_cgroup_is_monitored(perf_cgrp))
+   ret = __css_stop_monitoring(css);
+
+   WARN_ON(!monr->mon_event_group &&
+   (value != perf_cgroup_is_monitored(perf_cgrp)));
+
+   css_to_cqm_info(css)->cont_monitoring = value;
+
+   mutex_unlock(&cqm_mutex);
+out:
+   mutex_unlock(&cqm_init_mutex);
+   return ret;
+}
+
+struct cftype perf_event_cgrp_arch_subsys_cftypes[] = {
+   {
+   .name = "cqm_cont_monitoring",
+   .read_u64 = cqm_cont_monitoring_read_u64,
+   .write_u64 = cqm_cont_monitoring_write_u64,
+   },
+
+   {}  /* terminate */
+};
+
+#endif
+
 device_initcall(intel_cqm_init);
diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index c22d9e0..99fc206 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -326,6 +326,12 @@ inline void perf_cgroup_arch_css_released(struct 
cgroup_subsys_state *css);
perf_cgroup_arch_css_free
 inline void perf_cgroup_arch_css_free(struct cgroup_subsys_state *css);
 
+extern struct cftype perf_event_cgrp_arch_subsys_cftypes[];
+
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS \
+   .dfl_cftypes = perf_event_cgrp_arch_subsys_cftypes, \
+   .legacy_cftypes = perf_event_cgrp_arch_subsys_cftypes,
+
 #else
 
 #define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
-- 
2.8.0.rc3.226.g39d4020



[PATCH 27/32] perf/core: add perf_event cgroup hooks for subsystem attributes

2016-04-28 Thread David Carrillo-Cisneros
Allow architectures to define additional attributes for the perf cgroup.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 include/linux/perf_event.h | 4 
 kernel/events/core.c   | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 054d7f4..b0f6088 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1285,4 +1285,8 @@ static struct device_attribute format_attr_##_name = 
__ATTR_RO(_name)
 # define perf_cgroup_arch_css_free(css) do { } while (0)
 #endif
 
+#ifndef PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
+#endif
+
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 28d1b51..804fdd1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9903,5 +9903,7 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
.css_released   = perf_cgroup_css_released,
.css_free   = perf_cgroup_css_free,
.attach = perf_cgroup_attach,
+   /* Expand architecture specific attributes. */
+   PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
 };
 #endif /* CONFIG_CGROUP_PERF */
-- 
2.8.0.rc3.226.g39d4020



[PATCH 32/32] perf/stat: revamp error handling for snapshot and per_pkg events

2016-04-28 Thread David Carrillo-Cisneros
A package wide event can return a valid read even if it has not run in a
specific cpu, this does not fit well with the assumption that run == 0
is equivalent to a .

To fix the problem, this patch defines special error values for val,
run and ena (~0ULL), and use them to signal read errors, allowing run == 0
to be a valid value for package events. A new value, NA, is output on
read error and when event has not been enabled (timed enabled == 0).

Finally, this patch revamps calculation of deltas and scaling for snapshot
events, removing the calculation of deltas for time running and enabled in
snapshot event, as should be.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 tools/perf/builtin-stat.c | 37 ++---
 tools/perf/util/counts.h  | 19 +++
 tools/perf/util/evsel.c   | 44 +---
 tools/perf/util/evsel.h   |  8 ++--
 tools/perf/util/stat.c| 35 +++
 5 files changed, 95 insertions(+), 48 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index a4e5610..f1c2166 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -63,6 +63,7 @@
 #include "util/tool.h"
 #include "asm/bug.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -290,10 +291,8 @@ static int read_counter(struct perf_evsel *counter)
 
count = perf_counts(counter->counts, cpu, thread);
if (perf_evsel__read(counter, cpu, thread, count)) {
-   counter->counts->scaled = -1;
-   perf_counts(counter->counts, cpu, thread)->ena 
= 0;
-   perf_counts(counter->counts, cpu, thread)->run 
= 0;
-   return -1;
+   /* do not write stat for failed reads. */
+   continue;
}
 
if (STAT_RECORD) {
@@ -668,12 +667,16 @@ static int run_perf_stat(int argc, const char **argv)
 
 static void print_running(u64 run, u64 ena)
 {
+   bool is_na = run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || !ena;
+
if (csv_output) {
-   fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
-   csv_sep,
-   run,
-   csv_sep,
-   ena ? 100.0 * run / ena : 100.0);
+   if (is_na)
+   fprintf(stat_config.output, "%sNA%sNA", csv_sep, 
csv_sep);
+   else
+   fprintf(stat_config.output, "%s%" PRIu64 "%s%.2f",
+   csv_sep, run, csv_sep, 100.0 * run / ena);
+   } else if (is_na) {
+   fprintf(stat_config.output, "  (NA)");
} else if (run != ena) {
fprintf(stat_config.output, "  (%.2f%%)", 100.0 * run / ena);
}
@@ -1046,7 +1049,7 @@ static void printout(int id, int nr, struct perf_evsel 
*counter, double uval,
if (counter->cgrp)
os.nfields++;
}
-   if (run == 0 || ena == 0 || counter->counts->scaled == -1) {
+   if (run == PERF_COUNTS_NA || ena == PERF_COUNTS_NA || 
counter->counts->scaled == -1) {
if (metric_only) {
pm(&os, NULL, "", "", 0);
return;
@@ -1152,12 +1155,17 @@ static void print_aggr(char *prefix)
id = aggr_map->map[s];
first = true;
evlist__for_each(evsel_list, counter) {
+   bool all_nan = true;
val = ena = run = 0;
nr = 0;
for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); 
cpu++) {
s2 = aggr_get_id(perf_evsel__cpus(counter), 
cpu);
if (s2 != id)
continue;
+   /* skip NA reads. */
+   if 
(perf_counts_values__is_na(perf_counts(counter->counts, cpu, 0)))
+   continue;
+   all_nan = false;
val += perf_counts(counter->counts, cpu, 
0)->val;
ena += perf_counts(counter->counts, cpu, 
0)->ena;
run += perf_counts(counter->counts, cpu, 
0)->run;
@@ -1171,6 +1179,10 @@ static void print_aggr(char *prefix)
fprintf(output, "%s", prefix);
 
uval = val * counter->scale;
+   if (all_nan) {
+   run = PERF_COUNTS_NA;
+   ena = PERF_COUNTS_NA;
+   }
printout(id, nr, counter, uval, prefix, run, ena, 1.0);

[PATCH 29/32] perf,perf/x86,perf/powerpc,perf/arm,perf/*: add int error return to pmu::read

2016-04-28 Thread David Carrillo-Cisneros
New PMUs, such as CQM's, do not guarantee that a read will succeed even
if pmu::add was successful.

In the generic code, this patch adds an int error return and completes the
error checking path up to perf_read().

In CQM's PMU, it adds proper error handling of hw read failure errors.
In other PMUs, pmu::read() simply returns 0.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/alpha/kernel/perf_event.c   |  3 +-
 arch/arc/kernel/perf_event.c |  3 +-
 arch/arm64/include/asm/hw_breakpoint.h   |  2 +-
 arch/arm64/kernel/hw_breakpoint.c|  3 +-
 arch/metag/kernel/perf/perf_event.c  |  5 ++-
 arch/mips/kernel/perf_event_mipsxx.c |  3 +-
 arch/powerpc/include/asm/hw_breakpoint.h |  2 +-
 arch/powerpc/kernel/hw_breakpoint.c  |  3 +-
 arch/powerpc/perf/core-book3s.c  | 11 +++---
 arch/powerpc/perf/core-fsl-emb.c |  5 ++-
 arch/powerpc/perf/hv-24x7.c  |  5 ++-
 arch/powerpc/perf/hv-gpci.c  |  3 +-
 arch/s390/kernel/perf_cpum_cf.c  |  5 ++-
 arch/s390/kernel/perf_cpum_sf.c  |  3 +-
 arch/sh/include/asm/hw_breakpoint.h  |  2 +-
 arch/sh/kernel/hw_breakpoint.c   |  3 +-
 arch/sparc/kernel/perf_event.c   |  2 +-
 arch/tile/kernel/perf_event.c|  3 +-
 arch/x86/events/amd/ibs.c|  2 +-
 arch/x86/events/amd/iommu.c  |  5 ++-
 arch/x86/events/amd/uncore.c |  3 +-
 arch/x86/events/core.c   |  3 +-
 arch/x86/events/intel/bts.c  |  3 +-
 arch/x86/events/intel/cqm.c  | 30 --
 arch/x86/events/intel/cstate.c   |  3 +-
 arch/x86/events/intel/pt.c   |  3 +-
 arch/x86/events/intel/rapl.c |  3 +-
 arch/x86/events/intel/uncore.c   |  3 +-
 arch/x86/events/intel/uncore.h   |  2 +-
 arch/x86/events/msr.c|  3 +-
 arch/x86/include/asm/hw_breakpoint.h |  2 +-
 arch/x86/kernel/hw_breakpoint.c  |  3 +-
 arch/x86/kvm/pmu.h   | 10 +++--
 drivers/bus/arm-cci.c|  3 +-
 drivers/bus/arm-ccn.c|  3 +-
 drivers/perf/arm_pmu.c   |  3 +-
 include/linux/perf_event.h   |  6 +--
 kernel/events/core.c | 68 +---
 38 files changed, 141 insertions(+), 86 deletions(-)

diff --git a/arch/alpha/kernel/perf_event.c b/arch/alpha/kernel/perf_event.c
index 5c218aa..3bf8a60 100644
--- a/arch/alpha/kernel/perf_event.c
+++ b/arch/alpha/kernel/perf_event.c
@@ -520,11 +520,12 @@ static void alpha_pmu_del(struct perf_event *event, int 
flags)
 }
 
 
-static void alpha_pmu_read(struct perf_event *event)
+static int alpha_pmu_read(struct perf_event *event)
 {
struct hw_perf_event *hwc = &event->hw;
 
alpha_perf_event_update(event, hwc, hwc->idx, 0);
+   return 0;
 }
 
 
diff --git a/arch/arc/kernel/perf_event.c b/arch/arc/kernel/perf_event.c
index 8b134cf..6e4f819 100644
--- a/arch/arc/kernel/perf_event.c
+++ b/arch/arc/kernel/perf_event.c
@@ -116,9 +116,10 @@ static void arc_perf_event_update(struct perf_event *event,
local64_sub(delta, &hwc->period_left);
 }
 
-static void arc_pmu_read(struct perf_event *event)
+static int arc_pmu_read(struct perf_event *event)
 {
arc_perf_event_update(event, &event->hw, event->hw.idx);
+   return 0;
 }
 
 static int arc_pmu_cache_event(u64 config)
diff --git a/arch/arm64/include/asm/hw_breakpoint.h 
b/arch/arm64/include/asm/hw_breakpoint.h
index 115ea2a..869ce97 100644
--- a/arch/arm64/include/asm/hw_breakpoint.h
+++ b/arch/arm64/include/asm/hw_breakpoint.h
@@ -126,7 +126,7 @@ extern int hw_breakpoint_exceptions_notify(struct 
notifier_block *unused,
 
 extern int arch_install_hw_breakpoint(struct perf_event *bp);
 extern void arch_uninstall_hw_breakpoint(struct perf_event *bp);
-extern void hw_breakpoint_pmu_read(struct perf_event *bp);
+extern int hw_breakpoint_pmu_read(struct perf_event *bp);
 extern int hw_breakpoint_slots(int type);
 
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/arch/arm64/kernel/hw_breakpoint.c 
b/arch/arm64/kernel/hw_breakpoint.c
index 4ef5373..ac1a6ca 100644
--- a/arch/arm64/kernel/hw_breakpoint.c
+++ b/arch/arm64/kernel/hw_breakpoint.c
@@ -942,8 +942,9 @@ static int __init arch_hw_breakpoint_init(void)
 }
 arch_initcall(arch_hw_breakpoint_init);
 
-void hw_breakpoint_pmu_read(struct perf_event *bp)
+int hw_breakpoint_pmu_read(struct perf_event *bp)
 {
+   return 0;
 }
 
 /*
diff --git a/arch/metag/kernel/perf/perf_event.c 
b/arch/metag/kernel/perf/perf_event.c
index 2478ec6..9721c1a 100644
--- a/arch/metag/kernel/perf/perf_event.c
+++ b/arch/metag/kernel/perf/perf_event.c
@@ -360,15 +360,16 @@ static void metag_pmu_del(struct perf_event *event, int 
flags)
perf_event_update_userpage(event);
 }
 
-static void metag_pmu_read(struct perf_event *event)
+static int metag_pmu_read(struct perf_event *event

[PATCH 30/32] perf,perf/x86: add hook perf_event_arch_exec

2016-04-28 Thread David Carrillo-Cisneros
perf_event context switches events to newly exec'ed tasks using
perf_event_exec. Add a hook for such path.

In x86, perf_event_arch_exec is used to synchronize the software
cache of the PQR_ASSOC msr, setting the right RMID for the new task.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/include/asm/perf_event.h | 2 ++
 include/linux/perf_event.h| 5 +
 kernel/events/core.c  | 1 +
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/perf_event.h 
b/arch/x86/include/asm/perf_event.h
index 99fc206..c13f501 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -332,6 +332,8 @@ extern struct cftype perf_event_cgrp_arch_subsys_cftypes[];
.dfl_cftypes = perf_event_cgrp_arch_subsys_cftypes, \
.legacy_cftypes = perf_event_cgrp_arch_subsys_cftypes,
 
+#define perf_event_arch_exec pqr_update
+
 #else
 
 #define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9c973bd..99b4393 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1289,4 +1289,9 @@ static struct device_attribute format_attr_##_name = 
__ATTR_RO(_name)
 #define PERF_CGROUP_ARCH_CGRP_SUBSYS_ATTS
 #endif
 
+#ifndef perf_event_arch_exec
+#define perf_event_arch_exec() do { } while (0)
+#endif
+
+
 #endif /* _LINUX_PERF_EVENT_H */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index cfffa50..5c675b4 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3248,6 +3248,7 @@ void perf_event_exec(void)
for_each_task_context_nr(ctxn)
perf_event_enable_on_exec(ctxn);
rcu_read_unlock();
+   perf_event_arch_exec();
 }
 
 struct perf_read_data {
-- 
2.8.0.rc3.226.g39d4020



[PATCH 31/32] perf/stat: fix bug in handling events in error state

2016-04-28 Thread David Carrillo-Cisneros
From: Stephane Eranian 

When an event is in error state, read() returns 0
instead of sizeof() buffer. In certain modes, such
as interval printing, ignoring the 0 return value
may cause bogus count deltas to be computed and
thus invalid results printed.

this patch fixes this problem by modifying read_counters()
to mark the event as not scaled (scaled = -1) to force
the printout routine to show .

Signed-off-by: Stephane Eranian 
---
 tools/perf/builtin-stat.c | 12 +---
 tools/perf/util/evsel.c   |  4 ++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 1f19f2f..a4e5610 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -289,8 +289,12 @@ static int read_counter(struct perf_evsel *counter)
struct perf_counts_values *count;
 
count = perf_counts(counter->counts, cpu, thread);
-   if (perf_evsel__read(counter, cpu, thread, count))
+   if (perf_evsel__read(counter, cpu, thread, count)) {
+   counter->counts->scaled = -1;
+   perf_counts(counter->counts, cpu, thread)->ena 
= 0;
+   perf_counts(counter->counts, cpu, thread)->run 
= 0;
return -1;
+   }
 
if (STAT_RECORD) {
if (perf_evsel__write_stat_event(counter, cpu, 
thread, count)) {
@@ -307,12 +311,14 @@ static int read_counter(struct perf_evsel *counter)
 static void read_counters(bool close_counters)
 {
struct perf_evsel *counter;
+   int ret;
 
evlist__for_each(evsel_list, counter) {
-   if (read_counter(counter))
+   ret = read_counter(counter);
+   if (ret)
pr_debug("failed to read counter %s\n", counter->name);
 
-   if (perf_stat_process_counter(&stat_config, counter))
+   if (ret == 0 && perf_stat_process_counter(&stat_config, 
counter))
pr_warning("failed to process counter %s\n", 
counter->name);
 
if (close_counters) {
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 545bb3f..52a0c35 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1150,7 +1150,7 @@ int perf_evsel__read(struct perf_evsel *evsel, int cpu, 
int thread,
if (FD(evsel, cpu, thread) < 0)
return -EINVAL;
 
-   if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) < 0)
+   if (readn(FD(evsel, cpu, thread), count, sizeof(*count)) <= 0)
return -errno;
 
return 0;
@@ -1168,7 +1168,7 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
if (evsel->counts == NULL && perf_evsel__alloc_counts(evsel, cpu + 1, 
thread + 1) < 0)
return -ENOMEM;
 
-   if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) < 0)
+   if (readn(FD(evsel, cpu, thread), &count, nv * sizeof(u64)) <= 0)
return -errno;
 
perf_evsel__compute_deltas(evsel, cpu, thread, &count);
-- 
2.8.0.rc3.226.g39d4020



[PATCH 26/32] perf/x86/intel/cqm: integrate CQM cgroups with scheduler

2016-04-28 Thread David Carrillo-Cisneros
Allow monitored cgroups to update the PQR MSR during task switch even
without an associated perf_event.

The package RMID for the current monr associated with a monitored
cgroup is written to hw during task switch (after perf_events is run)
if perf_event did not write a RMID for an event.

perf_event and any other caller of pqr_cache_update_rmid can update the
CPU's RMID using one of two modes:
  - PQR_RMID_MODE_NOEVENT: A RMID that do not correspond to an event.
e.g. the RMID of the root pmonr when no event is scheduled.
  - PQR_RMID_MODE_EVENT:   A RMID used by an event. Set during pmu::add
unset on pmu::del. This mode prevents from using a non-event
cgroup RMID.

This patch also introduces caching of writes to PQR MSR within the per-pcu
pqr state variable. This interface to update RMIDs and CLOSIDs will be
also utilized in upcoming versions of Intel's MBM and CAT drivers.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c   | 65 +--
 arch/x86/events/intel/cqm.h   |  2 --
 arch/x86/include/asm/pqr_common.h | 53 +++
 arch/x86/kernel/cpu/pqr_common.c  | 46 +++
 4 files changed, 135 insertions(+), 31 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index daf9fdf..4ece0a4 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -198,19 +198,6 @@ static inline int cqm_prmid_update(struct prmid *prmid)
return __cqm_prmid_update(prmid, __rmid_min_update_time);
 }
 
-/*
- * Updates caller cpu's cache.
- */
-static inline void __update_pqr_prmid(struct prmid *prmid)
-{
-   struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-
-   if (state->rmid == prmid->rmid)
-   return;
-   state->rmid = prmid->rmid;
-   wrmsr(MSR_IA32_PQR_ASSOC, prmid->rmid, state->closid);
-}
-
 static inline bool __valid_pkg_id(u16 pkg_id)
 {
return pkg_id < PQR_MAX_NR_PKGS;
@@ -2531,12 +2518,11 @@ static inline bool cqm_group_leader(struct perf_event 
*event)
 static inline void __intel_cqm_event_start(
struct perf_event *event, union prmid_summary summary)
 {
-   u16 pkg_id = topology_physical_package_id(smp_processor_id());
if (!(event->hw.state & PERF_HES_STOPPED))
return;
-
event->hw.state &= ~PERF_HES_STOPPED;
-   __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
+
+   pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_EVENT);
 }
 
 static void intel_cqm_event_start(struct perf_event *event, int mode)
@@ -2566,7 +2552,7 @@ static void intel_cqm_event_stop(struct perf_event 
*event, int mode)
/* Occupancy of CQM events is obtained at read. No need to read
 * when event is stopped since read on inactive cpus succeed.
 */
-   __update_pqr_prmid(__prmid_from_rmid(pkg_id, summary.sched_rmid));
+   pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
 }
 
 static int intel_cqm_event_add(struct perf_event *event, int mode)
@@ -2977,6 +2963,8 @@ static void intel_cqm_cpu_starting(unsigned int cpu)
 
state->rmid = 0;
state->closid = 0;
+   state->next_rmid = 0;
+   state->next_closid = 0;
 
/* XXX: lock */
/* XXX: Make sure this case is handled when hotplug happens. */
@@ -3152,6 +3140,12 @@ static int __init intel_cqm_init(void)
pr_info("Intel CQM monitoring enabled with at least %u rmids per 
package.\n",
min_max_rmid + 1);
 
+   /* Make sure pqr_common_enable_key is enabled after
+* cqm_initialized_key.
+*/
+   barrier();
+
+   static_branch_enable(&pqr_common_enable_key);
return ret;
 
 error_init_mutex:
@@ -3163,4 +3157,41 @@ error:
return ret;
 }
 
+/* Schedule task without a CQM perf_event. */
+inline void __intel_cqm_no_event_sched_in(void)
+{
+#ifdef CONFIG_CGROUP_PERF
+   struct monr *monr;
+   struct pmonr *pmonr;
+   union prmid_summary summary;
+   u16 pkg_id = topology_physical_package_id(smp_processor_id());
+   struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
+
+   /* Assume CQM enabled is likely given that PQR is enabled. */
+   if (!static_branch_likely(&cqm_initialized_key))
+   return;
+
+   /* Safe to call from_task since we are in scheduler lock. */
+   monr = monr_from_perf_cgroup(perf_cgroup_from_task(current, NULL));
+   pmonr = monr->pmonrs[pkg_id];
+
+   /* Utilize most up to date pmonr summary. */
+   monr_hrchy_get_next_prmid_summary(pmonr);
+   summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+
+   if (!prmid_summary__is_mon_active(summary))
+   goto no_rmid;
+
+   if (WARN_ON_ONCE(!__valid_rmid(pkg_id, summary.sched_rmid)))
+   goto no_rmid;
+
+   pqr_cache_update_rmid(summary.sched_rmid, PQR_RMID_MODE_NOEVENT);
+

[PATCH 25/32] sched: introduce the finish_arch_pre_lock_switch() scheduler hook

2016-04-28 Thread David Carrillo-Cisneros
This hook allows architecture specific code to be called at the end of
the task switch and after perf_events' context switch but before the
scheduler lock is released.

The specific use case in this series is to avoid multiple writes to a slow
MSR until all functions which modify such register in task switch have
finished.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/include/asm/processor.h | 4 
 kernel/sched/core.c  | 1 +
 kernel/sched/sched.h | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 9264476..036d94a 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -22,6 +22,7 @@ struct vm86;
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -841,4 +842,7 @@ bool xen_set_default_idle(void);
 
 void stop_this_cpu(void *dummy);
 void df_debug(struct pt_regs *regs, long error_code);
+
+#define finish_arch_pre_lock_switch pqr_update
+
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..bcd5473 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2620,6 +2620,7 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
prev_state = prev->state;
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
+   finish_arch_pre_lock_switch();
finish_lock_switch(rq, prev);
finish_arch_post_lock_switch();
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d2..cb48b5c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1077,6 +1077,9 @@ static inline int task_on_rq_migrating(struct task_struct 
*p)
 #ifndef prepare_arch_switch
 # define prepare_arch_switch(next) do { } while (0)
 #endif
+#ifndef finish_arch_pre_lock_switch
+# define finish_arch_pre_lock_switch() do { } while (0)
+#endif
 #ifndef finish_arch_post_lock_switch
 # define finish_arch_post_lock_switch()do { } while (0)
 #endif
-- 
2.8.0.rc3.226.g39d4020



[PATCH 09/32] perf/x86/intel/cqm: add per-package RMIDs, data and locks

2016-04-28 Thread David Carrillo-Cisneros
First part of new CQM logic. This patch introduces the struct pkg_data
that contains all per-package CQM data required by the new RMID hierarchy.

The raw RMID value is encapsulated in a Package RMID (prmid) structure
that provides atomic updates and caches recent reads. This caching
throttles the frequency at which (slow) hardware reads are performed and
ameliorates the impact of the worst case scenarios while traversing the
hierarchy of RMIDs (hierarchy and operations are introduced in future
patches within this series).

There is a set of prmids per physical package (socket) in the system. Each
package may have different number of prmids (different hw max_rmid_index).

Each package maintains its own pool of prmids (only a free pool as of this
patch, more pools to add in future patches in this series). Also, each
package has its own mutex and lock to protect the RMID pools and rotation
logic. This per-package separation reduces the contention for each lock
and mutex compared with the previous version (with system-wide mutex
and lock).

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c   | 426 +-
 arch/x86/events/intel/cqm.h   | 154 ++
 arch/x86/include/asm/pqr_common.h |   2 +
 3 files changed, 392 insertions(+), 190 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index f678014..541e515 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -12,7 +12,6 @@
 #define MSR_IA32_QM_CTR0x0c8e
 #define MSR_IA32_QM_EVTSEL 0x0c8d
 
-static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
 #define RMID_VAL_ERROR (1ULL << 63)
@@ -30,39 +29,13 @@ static struct perf_pmu_events_attr event_attr_##v = {   
\
 }
 
 /*
- * Updates caller cpu's cache.
- */
-static inline void __update_pqr_rmid(u32 rmid)
-{
-   struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
-   if (state->rmid == rmid)
-   return;
-   state->rmid = rmid;
-   wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
-}
-
-/*
  * Groups of events that have the same target(s), one RMID per group.
  * Protected by cqm_mutex.
  */
 static LIST_HEAD(cache_groups);
 static DEFINE_MUTEX(cqm_mutex);
-static DEFINE_RAW_SPINLOCK(cache_lock);
 
-/*
- * Mask of CPUs for reading CQM values. We only need one per-socket.
- */
-static cpumask_t cqm_cpumask;
-
-
-/*
- * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
- *
- * This rmid is always free and is guaranteed to have an associated
- * near-zero occupancy value, i.e. no cachelines are tagged with this
- * RMID, once __intel_cqm_rmid_rotate() returns.
- */
-static u32 intel_cqm_rotation_rmid;
+struct pkg_data *cqm_pkgs_data[PQR_MAX_NR_PKGS];
 
 /*
  * Is @rmid valid for programming the hardware?
@@ -82,162 +55,220 @@ static inline bool __rmid_valid(u32 rmid)
 
 static u64 __rmid_read(u32 rmid)
 {
+   /* XXX: Placeholder, will be removed in next patch. */
+   return 0;
+}
+
+/*
+ * Update if enough time has passed since last read.
+ *
+ * Must be called in a cpu in the package where prmid belongs.
+ * This function is safe to be called concurrently since it is guaranteed
+ * that entry->last_read_value is updated to a occupancy value obtained
+ * after the time set in entry->last_read_time .
+ * Return 1 if value was updated, 0 if not, negative number if error.
+ */
+static inline int __cqm_prmid_update(struct prmid *prmid,
+unsigned long jiffies_min_delta)
+{
+   unsigned long now = jiffies;
+   unsigned long last_read_time;
u64 val;
 
/*
-* Ignore the SDM, this thing is _NOTHING_ like a regular perfcnt,
-* it just says that to increase confusion.
+* Shortcut the calculation of elapsed time for the
+* case jiffies_min_delta == 0
 */
-   wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, rmid);
+   if (jiffies_min_delta > 0) {
+   last_read_time = atomic64_read(&prmid->last_read_time);
+   if (time_after(last_read_time + jiffies_min_delta, now))
+   return 0;
+   }
+
+   wrmsr(MSR_IA32_QM_EVTSEL, QOS_L3_OCCUP_EVENT_ID, prmid->rmid);
rdmsrl(MSR_IA32_QM_CTR, val);
 
/*
-* Aside from the ERROR and UNAVAIL bits, assume this thing returns
-* the number of cachelines tagged with @rmid.
+* Ignore this reading on error states and do not update the value.
 */
-   return val;
-}
+   WARN_ON_ONCE(val & (RMID_VAL_ERROR | RMID_VAL_UNAVAIL));
+   if (val & RMID_VAL_ERROR)
+   return -EINVAL;
+   if (val & RMID_VAL_UNAVAIL)
+   return -ENODATA;
 
-enum rmid_recycle_state {
-   RMID_YOUNG = 0,
-   RMID_AVAILABLE,
-   RMID_DIRTY,
-};
+   atomic64_set(&prmid

[PATCH 03/32] perf/x86/intel/cqm: remove all code for rotation of RMIDs

2016-04-28 Thread David Carrillo-Cisneros
In preparation for future patches that will introduce a per-package
rotation of RMIDs.

The new rotation logic follows the same ideas as the present rotation
logic being removed but takes advantage of the per-package RMID design
and a more detailed bookkeeping to guarantee to meet user SLOs.
It also avoid IPIs, and does not keep an unused rotation RMID in some
cases (as present version does).

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 371 
 1 file changed, 371 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index a3fde49..3c1e247 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -312,42 +312,6 @@ struct rmid_read {
 static void __intel_cqm_event_count(void *info);
 
 /*
- * Exchange the RMID of a group of events.
- */
-static u32 intel_cqm_xchg_rmid(struct perf_event *group, u32 rmid)
-{
-   struct perf_event *event;
-   struct list_head *head = &group->hw.cqm_group_entry;
-   u32 old_rmid = group->hw.cqm_rmid;
-
-   lockdep_assert_held(&cache_mutex);
-
-   /*
-* If our RMID is being deallocated, perform a read now.
-*/
-   if (__rmid_valid(old_rmid) && !__rmid_valid(rmid)) {
-   struct rmid_read rr = {
-   .value = ATOMIC64_INIT(0),
-   .rmid = old_rmid,
-   };
-
-   on_each_cpu_mask(&cqm_cpumask, __intel_cqm_event_count,
-&rr, 1);
-   local64_set(&group->count, atomic64_read(&rr.value));
-   }
-
-   raw_spin_lock_irq(&cache_lock);
-
-   group->hw.cqm_rmid = rmid;
-   list_for_each_entry(event, head, hw.cqm_group_entry)
-   event->hw.cqm_rmid = rmid;
-
-   raw_spin_unlock_irq(&cache_lock);
-
-   return old_rmid;
-}
-
-/*
  * If we fail to assign a new RMID for intel_cqm_rotation_rmid because
  * cachelines are still tagged with RMIDs in limbo, we progressively
  * increment the threshold until we find an RMID in limbo with <=
@@ -364,44 +328,6 @@ static unsigned int __intel_cqm_threshold;
 static unsigned int __intel_cqm_max_threshold;
 
 /*
- * Test whether an RMID has a zero occupancy value on this cpu.
- */
-static void intel_cqm_stable(void *arg)
-{
-   struct cqm_rmid_entry *entry;
-
-   list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
-   if (entry->state != RMID_AVAILABLE)
-   break;
-
-   if (__rmid_read(entry->rmid) > __intel_cqm_threshold)
-   entry->state = RMID_DIRTY;
-   }
-}
-
-static bool intel_cqm_sched_in_event(u32 rmid)
-{
-   struct perf_event *leader, *event;
-
-   lockdep_assert_held(&cache_mutex);
-
-   leader = list_first_entry(&cache_groups, struct perf_event,
- hw.cqm_groups_entry);
-   event = leader;
-
-   list_for_each_entry_continue(event, &cache_groups,
-hw.cqm_groups_entry) {
-   if (__rmid_valid(event->hw.cqm_rmid))
-   continue;
-
-   intel_cqm_xchg_rmid(event, rmid);
-   return true;
-   }
-
-   return false;
-}
-
-/*
  * Initially use this constant for both the limbo queue time and the
  * rotation timer interval, pmu::hrtimer_interval_ms.
  *
@@ -411,291 +337,8 @@ static bool intel_cqm_sched_in_event(u32 rmid)
  */
 #define RMID_DEFAULT_QUEUE_TIME 250/* ms */
 
-static unsigned int __rmid_queue_time_ms = RMID_DEFAULT_QUEUE_TIME;
-
-/*
- * intel_cqm_rmid_stabilize - move RMIDs from limbo to free list
- * @nr_available: number of freeable RMIDs on the limbo list
- *
- * Quiescent state; wait for all 'freed' RMIDs to become unused, i.e. no
- * cachelines are tagged with those RMIDs. After this we can reuse them
- * and know that the current set of active RMIDs is stable.
- *
- * Return %true or %false depending on whether stabilization needs to be
- * reattempted.
- *
- * If we return %true then @nr_available is updated to indicate the
- * number of RMIDs on the limbo list that have been queued for the
- * minimum queue time (RMID_AVAILABLE), but whose data occupancy values
- * are above __intel_cqm_threshold.
- */
-static bool intel_cqm_rmid_stabilize(unsigned int *available)
-{
-   struct cqm_rmid_entry *entry, *tmp;
-
-   lockdep_assert_held(&cache_mutex);
-
-   *available = 0;
-   list_for_each_entry(entry, &cqm_rmid_limbo_lru, list) {
-   unsigned long min_queue_time;
-   unsigned long now = jiffies;
-
-   /*
-* We hold RMIDs placed into limbo for a minimum queue
-* time. Before the minimum queue time has elapsed we do
-* not recycle RMIDs.
-*
-* The reasoning is that until a sufficient time has
-* passed since we stopped usin

[PATCH 22/32] perf/x86/intel/cqm: introduce read_subtree

2016-04-28 Thread David Carrillo-Cisneros
Read RMIDs llc_occupancy for cgroups by adding the occupancy of all
pmonrs with a read_rmid along its subtree in the pmonr hierarchy for
the event's package.

The RMID to read for a monr is the same as its RMID to schedule in hw if
the monr is in (A)state. If in (IL)state, the RMID to read is that of its
limbo_prmid. This reduces the error introduced by (IL)states since the
llc_occupancy of limbo_prmid is a lower bound of its real llc_occupancy.

monrs in (U)state can be safely ignored since they do not have any
occupancy.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 218 ++--
 1 file changed, 211 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 6e85021..c14f1c7 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -2305,18 +2305,222 @@ intel_cqm_setup_event(struct perf_event *event, struct 
perf_event **group)
return monr_hrchy_attach_event(event);
 }
 
+static struct monr *
+monr_next_child(struct monr *pos, struct monr *parent)
+{
+#ifdef CONFIG_LOCKDEP
+   WARN_ON(!monr_hrchy_count_held_raw_spin_locks());
+#endif
+   if (!pos)
+   return list_first_entry_or_null(
+   &parent->children, struct monr, parent_entry);
+   if (list_is_last(&pos->parent_entry, &parent->children))
+   return NULL;
+   return list_next_entry(pos, parent_entry);
+}
+
+static struct monr *
+monr_next_descendant_pre(struct monr *pos, struct monr *root)
+{
+   struct monr *next;
+
+#ifdef CONFIG_LOCKDEP
+   WARN_ON(!monr_hrchy_count_held_raw_spin_locks());
+#endif
+   if (!pos)
+   return root;
+   next = monr_next_child(NULL, pos);
+   if (next)
+   return next;
+   while (pos != root) {
+   next = monr_next_child(pos, pos->parent);
+   if (next)
+   return next;
+   pos = pos->parent;
+   }
+   return NULL;
+}
+
+/* Read pmonr's summary, safe to call without pkg's prmids lock.
+ * The possible scenarios are:
+ *  - summary's occupancy cannot be read, return -1.
+ *  - summary has no RMID but could be read as zero occupancy, return 0 and set
+ *rmid = INVALID_RMID.
+ *  - summary has valid read RMID, set rmid to it.
+ */
+static inline int
+pmonr__get_read_rmid(struct pmonr *pmonr, u32 *rmid, bool fail_on_inherited)
+{
+   union prmid_summary summary;
+
+   *rmid = INVALID_RMID;
+
+   summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+   /* A pmonr in (I)state that doesn't fail can report it's limbo_prmid
+* or NULL.
+*/
+   if (prmid_summary__is_istate(summary) && fail_on_inherited)
+   return -1;
+   /* A pmonr with inactive monitoring can be safely ignored. */
+   if (!prmid_summary__is_mon_active(summary))
+   return 0;
+
+   /* A pmonr that hasnt run in a pkg is safe to ignore since it
+* cannot have occupancy there.
+*/
+   if (prmid_summary__is_ustate(summary))
+   return 0;
+   /* At this point the pmonr is either in (A)state or (I)state
+* with fail_on_inherited=false . In the latter case,
+* read_rmid is INVALID_RMID and is a successful read_rmid.
+*/
+   *rmid = summary.read_rmid;
+   return 0;
+}
+
+/* Read occupancy for all pmonrs in the subtree rooted at monr
+ * for the current package.
+ * Best effort two-stages read. First, obtain all RMIDs in subtree
+ * with locks held. The rmids are added to stack. If stack is full
+ * proceed to update and read in place. After finish storing the RMIDs,
+ * update and read occupancy for rmids in stack.
+ */
+static int pmonr__read_subtree(struct monr *monr, u16 pkg_id,
+  u64 *total, bool fail_on_inh_descendant)
+{
+   struct monr *pos = NULL;
+   struct astack astack;
+   int ret;
+   unsigned long flags;
+   u64 count;
+   struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+
+   *total = 0;
+   /* Must run in a CPU in the package to read. */
+   if (WARN_ON_ONCE(pkg_id !=
+topology_physical_package_id(smp_processor_id(
+   return -1;
+
+   astack__init(&astack, NR_RMIDS_PER_NODE - 1, pkg_id);
+
+   /* Lock to protect againsts changes in pmonr hierarchy. */
+   raw_spin_lock_irqsave_nested(&pkg_data->pkg_data_lock, flags, pkg_id);
+
+   while ((pos = monr_next_descendant_pre(pos, monr))) {
+   struct prmid *prmid;
+   u32 rmid;
+   /* the pmonr of the monr to read cannot be inherited,
+* descendants may, depending on flag.
+*/
+   bool fail_on_inh = pos == monr || fail_on_inh_descendant;
+
+   ret = pmonr__get_read_rmid(pos->pmonrs[pkg_id],
+  

[PATCH 02/32] perf/x86/intel/cqm: remove check for conflicting events

2016-04-28 Thread David Carrillo-Cisneros
The new version of Intel's CQM uses a RMID hierarchy to avoid conflicts
between cpu, cgroup and task events, making unnecessary to check and
resolve conflicts between events of different types (ie. cgroup vs task).

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 148 
 1 file changed, 148 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 1b064c4..a3fde49 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -304,92 +304,6 @@ static inline struct perf_cgroup *event_to_cgroup(struct 
perf_event *event)
 }
 #endif
 
-/*
- * Determine if @a's tasks intersect with @b's tasks
- *
- * There are combinations of events that we explicitly prohibit,
- *
- *PROHIBITS
- * system-wide->   cgroup and task
- * cgroup->system-wide
- *   ->task in cgroup
- * task  ->system-wide
- *   ->task in cgroup
- *
- * Call this function before allocating an RMID.
- */
-static bool __conflict_event(struct perf_event *a, struct perf_event *b)
-{
-#ifdef CONFIG_CGROUP_PERF
-   /*
-* We can have any number of cgroups but only one system-wide
-* event at a time.
-*/
-   if (a->cgrp && b->cgrp) {
-   struct perf_cgroup *ac = a->cgrp;
-   struct perf_cgroup *bc = b->cgrp;
-
-   /*
-* This condition should have been caught in
-* __match_event() and we should be sharing an RMID.
-*/
-   WARN_ON_ONCE(ac == bc);
-
-   if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
-   cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
-   return true;
-
-   return false;
-   }
-
-   if (a->cgrp || b->cgrp) {
-   struct perf_cgroup *ac, *bc;
-
-   /*
-* cgroup and system-wide events are mutually exclusive
-*/
-   if ((a->cgrp && !(b->attach_state & PERF_ATTACH_TASK)) ||
-   (b->cgrp && !(a->attach_state & PERF_ATTACH_TASK)))
-   return true;
-
-   /*
-* Ensure neither event is part of the other's cgroup
-*/
-   ac = event_to_cgroup(a);
-   bc = event_to_cgroup(b);
-   if (ac == bc)
-   return true;
-
-   /*
-* Must have cgroup and non-intersecting task events.
-*/
-   if (!ac || !bc)
-   return false;
-
-   /*
-* We have cgroup and task events, and the task belongs
-* to a cgroup. Check for for overlap.
-*/
-   if (cgroup_is_descendant(ac->css.cgroup, bc->css.cgroup) ||
-   cgroup_is_descendant(bc->css.cgroup, ac->css.cgroup))
-   return true;
-
-   return false;
-   }
-#endif
-   /*
-* If one of them is not a task, same story as above with cgroups.
-*/
-   if (!(a->attach_state & PERF_ATTACH_TASK) ||
-   !(b->attach_state & PERF_ATTACH_TASK))
-   return true;
-
-   /*
-* Must be non-overlapping.
-*/
-   return false;
-}
-
 struct rmid_read {
u32 rmid;
atomic64_t value;
@@ -465,10 +379,6 @@ static void intel_cqm_stable(void *arg)
}
 }
 
-/*
- * If we have group events waiting for an RMID that don't conflict with
- * events already running, assign @rmid.
- */
 static bool intel_cqm_sched_in_event(u32 rmid)
 {
struct perf_event *leader, *event;
@@ -484,9 +394,6 @@ static bool intel_cqm_sched_in_event(u32 rmid)
if (__rmid_valid(event->hw.cqm_rmid))
continue;
 
-   if (__conflict_event(event, leader))
-   continue;
-
intel_cqm_xchg_rmid(event, rmid);
return true;
}
@@ -592,10 +499,6 @@ static bool intel_cqm_rmid_stabilize(unsigned int 
*available)
continue;
}
 
-   /*
-* If we have groups waiting for RMIDs, hand
-* them one now provided they don't conflict.
-*/
if (intel_cqm_sched_in_event(entry->rmid))
continue;
 
@@ -638,46 +541,8 @@ static void __intel_cqm_pick_and_rotate(struct perf_event 
*next)
 }
 
 /*
- * Deallocate the RMIDs from any events that conflict with @event, and
- * place them on the back of the group list.
- */
-static void intel_cqm_sched_out_conflicting_events(struct perf_event *event)
-{
-   struct perf_event *group, *g;
-   u32 rmid;
-
-   lockdep_assert_held(&cache_mutex);
-
-   list_

[PATCH 06/32] x86/intel,cqm: add CONFIG_INTEL_RDT configuration flag and refactor PQR

2016-04-28 Thread David Carrillo-Cisneros
Add Intel's PQR as its own build target, remove its build dependency
on CQM, and add CONFIG_INTEL_RDT as a configuration flag to build PQR
and all of its related drivers (currently CQM, future: MBM, CAT, CDP).

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/Kconfig  |  6 ++
 arch/x86/events/intel/Makefile|  3 ++-
 arch/x86/events/intel/cqm.c   | 27 +--
 arch/x86/include/asm/pqr_common.h | 31 +++
 arch/x86/kernel/cpu/Makefile  |  4 
 arch/x86/kernel/cpu/pqr_common.c  |  9 +
 include/linux/perf_event.h|  2 ++
 7 files changed, 55 insertions(+), 27 deletions(-)
 create mode 100644 arch/x86/include/asm/pqr_common.h
 create mode 100644 arch/x86/kernel/cpu/pqr_common.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a494fa3..7b81e6a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -160,6 +160,12 @@ config X86
select ARCH_USES_HIGH_VMA_FLAGS if 
X86_INTEL_MEMORY_PROTECTION_KEYS
select ARCH_HAS_PKEYS   if 
X86_INTEL_MEMORY_PROTECTION_KEYS
 
+config INTEL_RDT
+   def_bool y
+   depends on PERF_EVENTS && CPU_SUP_INTEL
+   ---help---
+   Enable Resource Director Technology for Intel Xeon Microprocessors.
+
 config INSTRUCTION_DECODER
def_bool y
depends on KPROBES || PERF_EVENTS || UPROBES
diff --git a/arch/x86/events/intel/Makefile b/arch/x86/events/intel/Makefile
index 3660b2c..7e610bf 100644
--- a/arch/x86/events/intel/Makefile
+++ b/arch/x86/events/intel/Makefile
@@ -1,4 +1,4 @@
-obj-$(CONFIG_CPU_SUP_INTEL)+= core.o bts.o cqm.o
+obj-$(CONFIG_CPU_SUP_INTEL)+= core.o bts.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= ds.o knc.o
 obj-$(CONFIG_CPU_SUP_INTEL)+= lbr.o p4.o p6.o pt.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL)   += intel-rapl.o
@@ -7,3 +7,4 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE)  += intel-uncore.o
 intel-uncore-objs  := uncore.o uncore_nhmex.o uncore_snb.o 
uncore_snbep.o
 obj-$(CONFIG_PERF_EVENTS_INTEL_CSTATE) += intel-cstate.o
 intel-cstate-objs  := cstate.o
+obj-$(CONFIG_INTEL_RDT)+= cqm.o
diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index afd60dd..8457dd0 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -7,40 +7,15 @@
 #include 
 #include 
 #include 
+#include 
 #include "../perf_event.h"
 
-#define MSR_IA32_PQR_ASSOC 0x0c8f
 #define MSR_IA32_QM_CTR0x0c8e
 #define MSR_IA32_QM_EVTSEL 0x0c8d
 
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
 
-/**
- * struct intel_pqr_state - State cache for the PQR MSR
- * @rmid:  The cached Resource Monitoring ID
- * @closid:The cached Class Of Service ID
- *
- * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
- * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
- * contains both parts, so we need to cache them.
- *
- * The cache also helps to avoid pointless updates if the value does
- * not change.
- */
-struct intel_pqr_state {
-   u32 rmid;
-   u32 closid;
-};
-
-/*
- * The cached intel_pqr_state is strictly per CPU and can never be
- * updated from a remote CPU. Both functions which modify the state
- * (intel_cqm_event_start and intel_cqm_event_stop) are called with
- * interrupts disabled, which is sufficient for the protection.
- */
-static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-
 /*
  * Updates caller cpu's cache.
  */
diff --git a/arch/x86/include/asm/pqr_common.h 
b/arch/x86/include/asm/pqr_common.h
new file mode 100644
index 000..0c2001b
--- /dev/null
+++ b/arch/x86/include/asm/pqr_common.h
@@ -0,0 +1,31 @@
+#ifndef _X86_PQR_COMMON_H_
+#define _X86_PQR_COMMON_H_
+
+#if defined(CONFIG_INTEL_RDT)
+
+#include 
+#include 
+
+#define MSR_IA32_PQR_ASSOC 0x0c8f
+
+/**
+ * struct intel_pqr_state - State cache for the PQR MSR
+ * @rmid:  The cached Resource Monitoring ID
+ * @closid:The cached Class Of Service ID
+ *
+ * The upper 32 bits of MSR_IA32_PQR_ASSOC contain closid and the
+ * lower 10 bits rmid. The update to MSR_IA32_PQR_ASSOC always
+ * contains both parts, so we need to cache them.
+ *
+ * The cache also helps to avoid pointless updates if the value does
+ * not change.
+ */
+struct intel_pqr_state {
+   u32 rmid;
+   u32 closid;
+};
+
+DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
+
+#endif
+#endif
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 4a8697f..87e6279 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -34,6 +34,10 @@ obj-$(CONFIG_CPU_SUP_CENTAUR)+= centaur.o
 obj-$(CONFIG_CPU_SUP_TRANSMETA_32) += transmeta.o
 obj-$(CONFIG_

[PATCH 14/32] perf/x86/intel/cqm: add preallocation of anodes

2016-04-28 Thread David Carrillo-Cisneros
Pre-allocate enough anodes to be able to at least hold one set of RMIDs
per package before running out of anodes.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 904f2d3..98a919f 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -194,6 +194,7 @@ static int anode_pool__alloc_one(u16 pkg_id);
 /* Init cqm pkg_data for @cpu 's package. */
 static int pkg_data_init_cpu(int cpu)
 {
+   int i, nr_anodes;
struct pkg_data *pkg_data;
struct cpuinfo_x86 *c = &cpu_data(cpu);
u16 pkg_id = topology_physical_package_id(cpu);
@@ -257,6 +258,15 @@ static int pkg_data_init_cpu(int cpu)
pkg_data->timed_update_cpu = cpu;
 
cqm_pkgs_data[pkg_id] = pkg_data;
+
+   /* Pre-allocate pool with one anode more than minimum needed to contain
+* all the RMIDs in the package.
+*/
+   nr_anodes = (pkg_data->max_rmid + NR_RMIDS_PER_NODE) /
+   NR_RMIDS_PER_NODE + 1;
+
+   for (i = 0; i < nr_anodes; i++)
+   anode_pool__alloc_one(pkg_id);
return 0;
 }
 
-- 
2.8.0.rc3.226.g39d4020



[PATCH 01/32] perf/x86/intel/cqm: temporarily remove MBM from CQM and cleanup

2016-04-28 Thread David Carrillo-Cisneros
Removing MBM code from arch/x86/events/intel/cqm.c. MBM will be added
using the new RMID infrastucture introduced in this patch series.

Also, remove updates to CQM that are superseded by this series.

Reviewed-by: Stephane Eranian 
Signed-off-by: David Carrillo-Cisneros 
---
 arch/x86/events/intel/cqm.c | 486 
 include/linux/perf_event.h  |   1 -
 2 files changed, 44 insertions(+), 443 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 7b5fd81..1b064c4 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -13,16 +13,8 @@
 #define MSR_IA32_QM_CTR0x0c8e
 #define MSR_IA32_QM_EVTSEL 0x0c8d
 
-#define MBM_CNTR_WIDTH 24
-/*
- * Guaranteed time in ms as per SDM where MBM counters will not overflow.
- */
-#define MBM_CTR_OVERFLOW_TIME  1000
-
 static u32 cqm_max_rmid = -1;
 static unsigned int cqm_l3_scale; /* supposedly cacheline size */
-static bool cqm_enabled, mbm_enabled;
-unsigned int mbm_socket_max;
 
 /**
  * struct intel_pqr_state - State cache for the PQR MSR
@@ -50,37 +42,8 @@ struct intel_pqr_state {
  * interrupts disabled, which is sufficient for the protection.
  */
 static DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
-static struct hrtimer *mbm_timers;
-/**
- * struct sample - mbm event's (local or total) data
- * @total_bytes#bytes since we began monitoring
- * @prev_msr   previous value of MSR
- */
-struct sample {
-   u64 total_bytes;
-   u64 prev_msr;
-};
 
 /*
- * samples profiled for total memory bandwidth type events
- */
-static struct sample *mbm_total;
-/*
- * samples profiled for local memory bandwidth type events
- */
-static struct sample *mbm_local;
-
-#define pkg_id topology_physical_package_id(smp_processor_id())
-/*
- * rmid_2_index returns the index for the rmid in mbm_local/mbm_total array.
- * mbm_total[] and mbm_local[] are linearly indexed by socket# * max number of
- * rmids per socket, an example is given below
- * RMID1 of Socket0:  vrmid =  1
- * RMID1 of Socket1:  vrmid =  1 * (cqm_max_rmid + 1) + 1
- * RMID1 of Socket2:  vrmid =  2 * (cqm_max_rmid + 1) + 1
- */
-#define rmid_2_index(rmid)  ((pkg_id * (cqm_max_rmid + 1)) + rmid)
-/*
  * Protects cache_cgroups and cqm_rmid_free_lru and cqm_rmid_limbo_lru.
  * Also protects event->hw.cqm_rmid
  *
@@ -102,13 +65,9 @@ static cpumask_t cqm_cpumask;
 #define RMID_VAL_ERROR (1ULL << 63)
 #define RMID_VAL_UNAVAIL   (1ULL << 62)
 
-/*
- * Event IDs are used to program IA32_QM_EVTSEL before reading event
- * counter from IA32_QM_CTR
- */
-#define QOS_L3_OCCUP_EVENT_ID  0x01
-#define QOS_MBM_TOTAL_EVENT_ID 0x02
-#define QOS_MBM_LOCAL_EVENT_ID 0x03
+#define QOS_L3_OCCUP_EVENT_ID  (1 << 0)
+
+#define QOS_EVENT_MASK QOS_L3_OCCUP_EVENT_ID
 
 /*
  * This is central to the rotation algorithm in __intel_cqm_rmid_rotate().
@@ -252,21 +211,6 @@ static void __put_rmid(u32 rmid)
list_add_tail(&entry->list, &cqm_rmid_limbo_lru);
 }
 
-static void cqm_cleanup(void)
-{
-   int i;
-
-   if (!cqm_rmid_ptrs)
-   return;
-
-   for (i = 0; i < cqm_max_rmid; i++)
-   kfree(cqm_rmid_ptrs[i]);
-
-   kfree(cqm_rmid_ptrs);
-   cqm_rmid_ptrs = NULL;
-   cqm_enabled = false;
-}
-
 static int intel_cqm_setup_rmid_cache(void)
 {
struct cqm_rmid_entry *entry;
@@ -274,7 +218,7 @@ static int intel_cqm_setup_rmid_cache(void)
int r = 0;
 
nr_rmids = cqm_max_rmid + 1;
-   cqm_rmid_ptrs = kzalloc(sizeof(struct cqm_rmid_entry *) *
+   cqm_rmid_ptrs = kmalloc(sizeof(struct cqm_rmid_entry *) *
nr_rmids, GFP_KERNEL);
if (!cqm_rmid_ptrs)
return -ENOMEM;
@@ -305,9 +249,11 @@ static int intel_cqm_setup_rmid_cache(void)
mutex_unlock(&cache_mutex);
 
return 0;
-
 fail:
-   cqm_cleanup();
+   while (r--)
+   kfree(cqm_rmid_ptrs[r]);
+
+   kfree(cqm_rmid_ptrs);
return -ENOMEM;
 }
 
@@ -335,13 +281,9 @@ static bool __match_event(struct perf_event *a, struct 
perf_event *b)
 
/*
 * Events that target same task are placed into the same cache group.
-* Mark it as a multi event group, so that we update ->count
-* for every event rather than just the group leader later.
 */
-   if (a->hw.target == b->hw.target) {
-   b->hw.is_group_event = true;
+   if (a->hw.target == b->hw.target)
return true;
-   }
 
/*
 * Are we an inherited event?
@@ -450,26 +392,10 @@ static bool __conflict_event(struct perf_event *a, struct 
perf_event *b)
 
 struct rmid_read {
u32 rmid;
-   u32 evt_type;
atomic64_t value;
 };
 
 static void __intel_cqm_event_count(void *info);
-static void init_mbm_sample(u32 rmid, u32 evt_type);
-static void __intel_mbm_event_count(void *info);
-
-static bool is_mbm_event(int e)
-{
-   return 

Re: [RFC PATCH V2 2/2] vhost: device IOTLB API

2016-04-28 Thread Jason Wang


On 04/29/2016 09:12 AM, Jason Wang wrote:
> On 04/28/2016 10:43 PM, Michael S. Tsirkin wrote:
>> > On Thu, Apr 28, 2016 at 02:37:16PM +0800, Jason Wang wrote:
>>> >>
>>> >> On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
 >>> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>  This patch tries to implement an device IOTLB for vhost. This could 
>  be
>  used with for co-operation with userspace(qemu) implementation of DMA
>  remapping.
> 
>  The idea is simple. When vhost meets an IOTLB miss, it will request
>  the assistance of userspace to do the translation, this is done
>  through:
> 
>  - Fill the translation request in a preset userspace address (This
>    address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>  - Notify userspace through eventfd (This eventfd was set through 
>  ioctl
>    VHOST_SET_IOTLB_FD).
 >>> Why use an eventfd for this?
>>> >> The aim is to implement the API all through ioctls.
>>> >>
 >>>  We use them for interrupts because
 >>> that happens to be what kvm wants, but here - why don't we
 >>> just add a generic support for reading out events
 >>> on the vhost fd itself?
>>> >> I've considered this approach, but what's the advantages of this? I mean
>>> >> looks like all other ioctls could be done through vhost fd
>>> >> reading/writing too.
>> > read/write have a non-blocking flag.
>> >
>> > It's not useful for other ioctls but it's useful here.
>> >
> Ok, this looks better.
>
>  - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
> 
>  When userspace finishes the translation, it will update the vhost
>  IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge 
>  of
>  snooping the IOTLB invalidation of IOMMU IOTLB and use
>  VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
 >>> There's one problem here, and that is that VQs still do not undergo
 >>> translation.  In theory VQ could be mapped in such a way
 >>> that it's not contigious in userspace memory.
>>> >> I'm not sure I get the issue, current vhost API support setting
>>> >> desc_user_addr, used_user_addr and avail_user_addr independently. So
>>> >> looks ok? If not, looks not a problem to device IOTLB API itself.
>> > The problem is that addresses are all HVA.
>> >
>> > Without an iommu, we ask for them to be contigious and
>> > since bus address == GPA, this means contigious GPA =>
>> > contigious HVA. With an IOMMU you can map contigious
>> > bus address but non contigious GPA and non contigious HVA.
> Yes, so the issue is we should not reuse VHOST_SET_VRING_ADDR and invent
> a new ioctl to set bus addr (guest iova). The access the VQ through
> device IOTLB too.

Note that userspace has checked for this and fallback to userspace if it
detects non contiguous GPA. Consider this happens rare, I'm not sure we
should handle this.

>
>> >
>> > Another concern: what if guest changes the GPA while keeping bus address
>> > constant? Normal devices will work because they only use
>> > bus addresses, but virtio will break.
> If we access VQ through device IOTLB too, this could be solved.
>

I don't see a reason why guest want change GPA during DMA. Even if it
can, it needs lots of other synchronization.


[PATCH 00/32] 2nd Iteration of Cache QoS Monitoring support.

2016-04-28 Thread David Carrillo-Cisneros
This series introduces the next iteration of kernel support for the
Cache QoS Monitoring (CQM) technology available in Intel Xeon processors.

One of the main limitations of the previous version is the inability
to simultaneously monitor:
  1) cpu event and any other event in that cpu.
  2) cgroup events for cgroups in same descendancy line.
  3) cgroup events and any thread event of a cgroup in the same
 descendancy line.

Another limitation is that monitoring for a cgroup was enabled/disabled by
the existence of a perf event for that cgroup. Since the event
llc_occupancy measures changes in occupancy rather than total occupancy,
in order to read meaningful llc_occupancy values, an event should be
enabled for a long enough period of time. The overhead in context switches
caused by the perf events is undesired in some sensitive scenarios.

This series of patches addresses the shortcomings mentioned above and,
add some other improvements. The main changes are:
- No more potential conflicts between different events. New
version builds a hierarchy of RMIDs that captures the dependency
between monitored cgroups. llc_occupancy for cgroup is the sum of
llc_occupancies for that cgroup RMID and all other RMIDs in the
cgroups subtree (both monitored cgroups and threads).

- A cgroup integration that allows to monitor the a cgroup without
creating a perf event, decreasing the context switch overhead.
Monitoring is controlled by a boolean cgroup subsystem attribute
in each perf cgroup, this is:

echo 1 > cgroup_path/perf_event.cqm_cont_monitoring

starts CQM monitoring whether or not there is a perf_event
attached to the cgroup. Setting the attribute to 0 makes
monitoring dependent on the existence of a perf_event.
A perf_event is always required in order to read llc_occupancy.
This cgroup integration uses Intel's PQR code and is intended to
be used by upcoming versions of Intel's CAT.

- A more stable rotation algorithm: New algorithm uses SLOs that
guarantee:
- A minimum of enabled time for monitored cgroups and
threads.
- A maximum time disabled before error is introduced by
reusing dirty RMIDs.
- A minimum rate at which RMIDs recycling must progress.

- Reduced impact of stealing/rotation of RMIDs: The new algorithm
accounts the residual occupancy held by limbo RMIDs towards the
former owner of the limbo RMID, decreasing the error introduced
by RMID rotation.
It also allows a limbo RMID to be reused by its former owner when
appropriate, decreasing the potential error of reusing dirty RMIDs
and allowing to make progress even if most limbo RMIDs do not
drop occupancy fast enough.

- Elimination of pmu::count: perf generic's perf_event_count()
perform a quick add of atomic types. The introduction of
pmu::count in the previous CQM series to read occupancy for thread
events changed the behavior of perf_event_count() by performing a
potentially slow IPI and write/read to MSR. It also made pmu::read
to have different behaviors depending on whether the event was a
cpu/cgroup event or a thread. This patches serie removes the custom
pmu::count from CQM and provides a consistent behavior for all
calls of perf_event_read .

- Added error return for pmu::read: Reads to CQM events may fail
due to stealing of RMIDs, even after successfully adding an event
to a PMU. This patch series expands pmu::read with an int return
value and propagates the error to callers that can fail
(ie. perf_read).
The ability to fail of pmu::read is consistent with the recent
changes that allow perf_event_read to fail for transactional
reading of event groups.

- Introduces the field pmu_event_flags that contain flags set by
the PMU to signal variations on the default behavior to perf's
generic code. In this series, three flags are introduced:
- PERF_CGROUP_NO_RECURSION : Signals generic code to add
events of the cgroup ancestors of a cgroup.
- PERF_INACTIVE_CPU_READ_PKG: Signals generic coda that
this CPU event can be read in any CPU in its event::cpu's
package, even if the event is not active.
- PERF_INACTIVE_EV_READ_ANY_CPU: Signals generic code that
this event can be read in any CPU in any package in the
system even if the event is not active.
Using the above flags takes advantage of the CQM's hw ability to
read llc_occupancy even when the associated perf event is not
running in a CPU.

This patch series also updates the perf tool to fix e

Re: [PATCH v7 22/24] [media] rtl2832: change the i2c gate to be mux-locked

2016-04-28 Thread Peter Rosin
On 2016-04-28 23:47, Wolfram Sang wrote:
> On Wed, Apr 20, 2016 at 05:18:02PM +0200, Peter Rosin wrote:
>> The root i2c adapter lock is then no longer held by the i2c mux during
>> accesses behind the i2c gate, and such accesses need to take that lock
>> just like any other ordinary i2c accesses do.
>>
>> So, declare the i2c gate mux-locked, and zap the regmap overrides
>> that makes the i2c accesses unlocked and use plain old regmap
>> accesses. This also removes the need for the regmap wrappers used by
>> rtl2832_sdr, so deconvolute the code further and provide the regmap
>> handle directly instead of the wrapper functions.
>>
>> Signed-off-by: Peter Rosin 
> Antti, I'd need some tag from you since this is not the i2c realm.
>

Antti sent this:

https://lkml.org/lkml/2016/4/20/828

and I added a Tested-by in v8

https://github.com/peda-r/i2c-mux/commit/c023584d34db7aacbc59f28386378131cfa970d2

but the patch was never sent as an email, only as part of a pull request for

https://github.com/peda-r/i2c-mux/commits/mux-core-and-locking-8

So, I think all is ok, or do you need more than Tested-by?

Cheers,
Peter


[PATCH kernel v2] vfio_pci: Test for extended capabilities if config space > 256 bytes

2016-04-28 Thread Alexey Kardashevskiy
PCI-Express spec says that reading 4 bytes at offset 100h should return
zero if there is no extended capability so VFIO reads this dword to
know if there are extended capabilities.

However it is not always possible to access the extended space so
generic PCI code in pci_cfg_space_size_ext() checks if
pci_read_config_dword() can read beyond 100h and if the check fails,
it sets the config space size to 100h.

VFIO does its own extended capabilities check by reading at offset 100h
which may produce 0x which VFIO treats as the extended config
space presense and calls vfio_ecap_init() which fails to parse
capabilities (which is expected) but right before the exit, it writes
zero at offset 100h which is beyond the buffer allocated for
vdev->vconfig (which is 256 bytes) which leads to random memory
corruption.

This makes VFIO only check for the extended capabilities if
the discovered config size is more than 256 bytes.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* instead of checking for 0x, this only does the check if
device's config size is big enough
---
 drivers/vfio/pci/vfio_pci_config.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 142c533..d0c4358 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1124,9 +1124,12 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
return pcibios_err_to_errno(ret);
 
if (PCI_X_CMD_VERSION(word)) {
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, &dword);
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE,
+   &dword);
+   vdev->extended_caps = (dword != 0);
+   }
return PCI_CAP_PCIX_SIZEOF_V2;
} else
return PCI_CAP_PCIX_SIZEOF_V0;
@@ -1138,9 +1141,11 @@ static int vfio_cap_len(struct vfio_pci_device *vdev, u8 
cap, u8 pos)
 
return byte;
case PCI_CAP_ID_EXP:
-   /* Test for extended capabilities */
-   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, &dword);
-   vdev->extended_caps = (dword != 0);
+   if (pdev->cfg_size > PCI_CFG_SPACE_SIZE) {
+   /* Test for extended capabilities */
+   pci_read_config_dword(pdev, PCI_CFG_SPACE_SIZE, &dword);
+   vdev->extended_caps = dword != 0;
+   }
 
/* length based on version */
if ((pcie_caps_reg(pdev) & PCI_EXP_FLAGS_VERS) == 1)
-- 
2.5.0.rc3



linux-next: manual merge of the xen-tip tree with the tip tree

2016-04-28 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the xen-tip tree got a conflict in:

  drivers/firmware/efi/arm-runtime.c

between commit:

  14c43be60166 ("efi/arm*: Drop writable mapping of the UEFI System table")

from the tip tree and commit:

  21c8dfaa2327 ("Xen: EFI: Parse DT parameters for Xen specific UEFI")

from the xen-tip tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/firmware/efi/arm-runtime.c
index 17ccf0a8787a,ac609b9f0b99..
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@@ -109,24 -90,41 +110,30 @@@ static int __init arm_enable_runtime_se
  
pr_info("Remapping and enabling EFI services.\n");
  
 -  mapsize = memmap.map_end - memmap.map;
 -  memmap.map = (__force void *)ioremap_cache(memmap.phys_map,
 - mapsize);
 -  if (!memmap.map) {
 -  pr_err("Failed to remap EFI memory map\n");
 -  return -ENOMEM;
 -  }
 -  memmap.map_end = memmap.map + mapsize;
 -  efi.memmap = &memmap;
 +  mapsize = efi.memmap.map_end - efi.memmap.map;
  
 -  efi.systab = (__force void *)ioremap_cache(efi_system_table,
 - sizeof(efi_system_table_t));
 -  if (!efi.systab) {
 -  pr_err("Failed to remap EFI System Table\n");
 +  efi.memmap.map = memremap(efi.memmap.phys_map, mapsize, MEMREMAP_WB);
 +  if (!efi.memmap.map) {
 +  pr_err("Failed to remap EFI memory map\n");
return -ENOMEM;
}
 -  set_bit(EFI_SYSTEM_TABLES, &efi.flags);
 +  efi.memmap.map_end = efi.memmap.map + mapsize;
  
-   if (!efi_virtmap_init()) {
-   pr_err("UEFI virtual mapping missing or invalid -- runtime 
services will not be available\n");
-   return -ENOMEM;
+   if (IS_ENABLED(CONFIG_XEN_EFI) && efi_enabled(EFI_PARAVIRT)) {
+   /* Set up runtime services function pointers for Xen Dom0 */
+   xen_efi_runtime_setup();
+   } else {
+   if (!efi_virtmap_init()) {
 -  pr_err("No UEFI virtual mapping was installed -- 
runtime services will not be available\n");
++  pr_err("UEFI virtual mapping missing or invalid -- 
runtime services will not be available\n");
+   return -ENOMEM;
+   }
+ 
+   /* Set up runtime services function pointers */
+   efi_native_runtime_setup();
}
  
-   /* Set up runtime services function pointers */
-   efi_native_runtime_setup();
set_bit(EFI_RUNTIME_SERVICES, &efi.flags);
  
 -  efi.runtime_version = efi.systab->hdr.revision;
 -
return 0;
  }
  early_initcall(arm_enable_runtime_services);


Re: [PATCH 3.16 000/217] 3.16.35-rc1 review

2016-04-28 Thread Guenter Roeck

On 04/26/2016 04:02 PM, Ben Hutchings wrote:

This is the start of the stable review cycle for the 3.16.35 release.
There are 217 patches in this series, which will be posted as responses
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Sat Apr 30 22:00:00 UTC 2016.
Anything received after that time might be too late.



Updated build and test results:

Build results:
total: 137 pass: 135 fail: 2
Failed builds:
arc:allnoconfig
arm64:allmodconfig

Qemu test results:
total: 97 pass: 94 fail: 3
Failed tests:
arm:xilinx-zynq-a9:multi_v7_defconfig:zynq-zc706
arm64:smp:defconfig
arm64:nosmp:defconfig

This is after dropping a couple of builds and qemu tests which are
known to be bad in 3.16, and after some fixes in the tree.

The arm64 build failure is due to gcc5, which needs a patch from a
later kernel. The other failures are new and did not occur in 3.16.7.

A bisect of the arm64 qemu failure points to commit f98ab7a1e78
("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't
make any progress"). Unfortunately, that is inconclusive, since there
have been several follow-up commits trying to fix it. After reverting
all those commits, the test still fails.

Guenter



Re: [PATCH v3 6/6] sched/fair: Move (inactive) option from code to config

2016-04-28 Thread Yuyang Du
On Thu, Apr 28, 2016 at 11:37:33AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:31PM +0800, Yuyang Du wrote:
> > The option of increased load resolution (fixed point arithmetic range) is
> > unconditionally deactivated with #if 0. But since it may still be used
> > somewhere (e.g., in Google), we want to keep this option.
> > 
> > Regardless, there should be a way to express this option. Considering the
> > current circumstances, the reconciliation is we define a config
> > CONFIG_CFS_INCREASE_LOAD_RANGE and it depends on FAIR_GROUP_SCHED and
> > 64BIT and BROKEN.
> > 
> > Suggested-by: Ingo Molnar 
> 
> So I'm very tempted to simply, unconditionally, reinstate this larger
> range for everything CONFIG_64BIT && CONFIG_FAIR_GROUP_SCHED.
> 
> There was but the single claim on increased power usage, nobody could
> reproduce / analyze and Google has been running with this for years now.
> 
> Furthermore, it seems to be leading to the obvious problems on bigger
> machines where we basically run out of precision by the sheer number of
> cpus (nr_cpus ~ SCHED_LOAD_SCALE and stuff comes apart quickly).
 
Great.


Re: [PATCH v2 2/2] cpufreq: arm_big_little: use generic OPP functions for {init,free}_opp_table

2016-04-28 Thread Viresh Kumar
On 28-04-16, 18:07, Sudeep Holla wrote:
> Currently when performing random hotplugs and suspend-to-ram(S2R) on
> systems using arm_big_little cpufreq driver, we get warnings similar to:
> 
> cpu cpu1: _opp_add: duplicate OPPs detected. Existing: freq: 6,
>   volt: 80, enabled: 1. New: freq: 6, volt: 80, enabled: 1
> 
> This is mainly because the OPPs for the shared cpus are not set. We can
> just use dev_pm_opp_of_cpumask_add_table in case the OPPs are obtained
> from DT(arm_big_little_dt.c) or use dev_pm_opp_set_sharing_cpus if the
> OPPs are obtained by other means like firmware(e.g. scpi-cpufreq.c)
> 
> Also now that the generic dev_pm_opp_cpumask_remove_table can handle
> removal of opp table and entries for all associated CPUs, we can reuse
> dev_pm_opp_cpumask_remove_table as free_opp_table in cpufreq_arm_bL_ops.
> 
> This patch makes necessary changes to reuse the generic OPP functions for
> {init,free}_opp_table and thereby eliminating the warnings.
> 
> Cc: Viresh Kumar 
> Cc: "Rafael J. Wysocki" 
> Cc: linux...@vger.kernel.org
> Signed-off-by: Sudeep Holla 
> ---
>  drivers/cpufreq/arm_big_little.c   | 54 
> ++
>  drivers/cpufreq/arm_big_little.h   |  4 +--
>  drivers/cpufreq/arm_big_little_dt.c| 21 ++---
>  drivers/cpufreq/scpi-cpufreq.c | 47 +
>  drivers/cpufreq/vexpress-spc-cpufreq.c |  4 ++-
>  5 files changed, 56 insertions(+), 74 deletions(-)

Acked-by: Viresh Kumar 

-- 
viresh


Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

2016-04-28 Thread George Spelvin
Linus wrote:
> Having looked around at other hashes, I suspect we should look at the
> ones that do five or six shifts, and a mix of add/sub and xor. And
> because they shift the bits around more freely you don't have the
> final shift (that ends up being dependent on the size of the target
> set).

I'm not sure that final shift is a problem.  You need to mask the result
to the desired final size somehow, and a shift is no more cycles than
an AND.

> It really would be lovely to hear that we can just replace
> hash_int/long() with a better hash. And I wouldn't get too hung up on
> the multiplication trick. I suspect it's not worth it.

My main concern is that the scope of the search grows enormously
if we include such things.  I don't want to discourage someone
from looking, but I volunteered to find a better multiplication
constant with an efficient add/subtract chain, not start a thesis
project on more general hash functions.

Two places one could look for ideas, though:
http://www.burtleburtle.net/bob/hash/integer.html
https://gist.github.com/badboy/6267743

Here's Thomas Wang's 64-bit hash, which is reputedly quite
good, in case it helps:

uint64_t hash(uint64_t key)
{
key  = ~key + (key << 21);  // key = (key << 21) - key - 1;
key ^= key >> 24;
key += (key << 3)) + (key << 8); // key *= 265
key ^= key >> 14;
key += (key << 2)) + (key << 4); // key *= 21
key ^= key >> 28;
key += key << 31;
return key;
}

And his slightly shorter 64-to-32-bit function:
unsigned hash(uint64_t key)
{
  key  = ~key + (key << 18); // key = (key << 18) - key - 1;
  key ^= key >> 31;
  key *= 21; // key += (key << 2)) + (key << 4);
  key ^= key >> 11;
  key += key << 6;
  key ^= key >> 22;
  return (uint32_t)key;
}


Sticking to multiplication, using the heuristics in the
current comments (prime near golden ratio = 9e3779b9 = 2654435769,)
I can come up with this for multiplying by 2654435599 = 0x9e37790f:

// -
// This code was generated by Spiral Multiplier Block Generator, www.spiral.net
// Copyright (c) 2006, Carnegie Mellon University
// All rights reserved.
// The generated code is distributed under a BSD style license
// (see http://www.opensource.org/licenses/bsd-license.php)
// ---
// Cost: 6 adds/subtracts 6 shifts 0 negations
// Depth: 5
// Input:
//   int t0
// Outputs:
//   int t1 = 2654435599 * t0
// ---
t3 = shl(t0, 11);   /* 2048*/
t2 = sub(t3, t0);   /* 2047*/
t5 = shl(t2, 8);   /* 524032*/
t4 = sub(t5, t2);   /* 521985*/
t7 = shl(t0, 25);   /* 33554432*/
t6 = add(t4, t7);   /* 34076417*/
t9 = shl(t0, 9);   /* 512*/
t8 = sub(t9, t0);   /* 511*/
t11 = shl(t6, 4);   /* 545222672*/
t10 = sub(t11, t6);   /* 511146255*/
t12 = shl(t8, 22);   /* 2143289344*/
t1 = add(t10, t12);   /* 2654435599*/

Which translates into C as

uint32_t multiply(uint32_t x)
{
unsigned y = (x << 11) - x;

y -= y << 8;
y -= x << 25;
x -= x << 9;
y -= y << 4;
y -= x << 22;
return y;
}

Unfortunately, that utility bogs like hell on 64-bit constants.


Re: [PATCH v3 5/6] sched/fair: Rename scale_load() and scale_load_down()

2016-04-28 Thread Yuyang Du
On Thu, Apr 28, 2016 at 11:19:19AM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:30PM +0800, Yuyang Du wrote:
> > Rename scale_load() and scale_load_down() to user_to_kernel_load()
> > and kernel_to_user_load() respectively, to allow the names to bear
> > what they are really about.
> 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -189,7 +189,7 @@ static void __update_inv_weight(struct load_weight *lw)
> > if (likely(lw->inv_weight))
> > return;
> >  
> > -   w = scale_load_down(lw->weight);
> > +   w = kernel_to_user_load(lw->weight);
> >  
> > if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
> > lw->inv_weight = 1;
> > @@ -213,7 +213,7 @@ static void __update_inv_weight(struct load_weight *lw)
> >   */
> >  static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct 
> > load_weight *lw)
> >  {
> > -   u64 fact = scale_load_down(weight);
> > +   u64 fact = kernel_to_user_load(weight);
> > int shift = WMULT_SHIFT;
> >  
> > __update_inv_weight(lw);

[snip]
 
> Except these 3 really are not about user/kernel visible fixed point
> ranges _at_all_... :/

But are the above two falling back to user fixed point precision? And
the reason being that we can't efficiently do this multiply/divide
thing with increased fixed point for kernel load.


Re: [PATCH v2 1/2] PM / OPP: add non-OF versions of dev_pm_opp_{cpumask_,}remove_table

2016-04-28 Thread Viresh Kumar
On 28-04-16, 18:07, Sudeep Holla wrote:
> diff --git a/drivers/base/power/opp/core.c b/drivers/base/power/opp/core.c
> index 433b60092972..e59b9e7c31ba 100644
> --- a/drivers/base/power/opp/core.c
> +++ b/drivers/base/power/opp/core.c
> @@ -1845,13 +1845,14 @@ struct srcu_notifier_head 
> *dev_pm_opp_get_notifier(struct device *dev)
>  }
>  EXPORT_SYMBOL_GPL(dev_pm_opp_get_notifier);
>  
> -#ifdef CONFIG_OF
>  /**
> - * dev_pm_opp_of_remove_table() - Free OPP table entries created from static 
> DT
> - * entries
> + * _dev_pm_opp_remove_table() - Free OPP table entries

This is an internal routine and doesn't really require a doc-style comment at
all. Please remove it. You can add a simple comment for things you want to
mention though.

>   * @dev: device pointer used to lookup OPP table.
> + * @remove_dyn:  specify whether to remove only OPPs created using
> + *  static entries from DT or even the dynamically add OPPs.
>   *
> - * Free OPPs created using static entries present in DT.
> + * Free OPPs either created using static entries present in DT or even the
> + * dynamically added entries based on @remove_dyn param.
>   *
>   * Locking: The internal opp_table and opp structures are RCU protected.
>   * Hence this function indirectly uses RCU updater strategy with mutex locks
> @@ -1859,7 +1860,7 @@ EXPORT_SYMBOL_GPL(dev_pm_opp_get_notifier);
>   * that this function is *NOT* called under RCU protection or in contexts 
> where
>   * mutex cannot be locked.
>   */
> -void dev_pm_opp_of_remove_table(struct device *dev)
> +static void _dev_pm_opp_remove_table(struct device *dev, bool remove_dyn)

Maybe s/remove_dyn/remove_all ..

>  {
>   struct opp_table *opp_table;
>   struct dev_pm_opp *opp, *tmp;
> @@ -1884,7 +1885,7 @@ void dev_pm_opp_of_remove_table(struct device *dev)
>   if (list_is_singular(&opp_table->dev_list)) {
>   /* Free static OPPs */
>   list_for_each_entry_safe(opp, tmp, &opp_table->opp_list, node) {
> - if (!opp->dynamic)
> + if (!opp->dynamic || (opp->dynamic && remove_dyn))

Well, that's a funny one :)

The second conditional statement doesn't require opp->dynamic, as that is
guaranteed to be true, as the first condition failed.

So this should be:

if (remove_all || !opp->dynamic)

>   _opp_remove(opp_table, opp, true);
>   }
>   } else {
> @@ -1894,6 +1895,44 @@ void dev_pm_opp_of_remove_table(struct device *dev)
>  unlock:
>   mutex_unlock(&opp_table_lock);
>  }
> +
> +/**
> + * dev_pm_opp_of_remove_table() - Free OPP table entries created from static 
> DT

No, this isn't the OF specific function.

> + * entries
> + * @dev: device pointer used to lookup OPP table.
> + *
> + * Free all OPPs associated with the device

Full stop at the end.

> + *
> + * Locking: The internal opp_table and opp structures are RCU protected.
> + * Hence this function indirectly uses RCU updater strategy with mutex locks
> + * to keep the integrity of the internal data structures. Callers should 
> ensure
> + * that this function is *NOT* called under RCU protection or in contexts 
> where
> + * mutex cannot be locked.
> + */
> +void dev_pm_opp_remove_table(struct device *dev)
> +{
> + _dev_pm_opp_remove_table(dev, true);
> +}
> +EXPORT_SYMBOL_GPL(dev_pm_opp_remove_table);
> +
> +#ifdef CONFIG_OF
> +/**
> + * dev_pm_opp_of_remove_table() - Free OPP table entries created from static 
> DT
> + * entries
> + * @dev: device pointer used to lookup OPP table.
> + *
> + * Free OPPs created using static entries present in DT.
> + *
> + * Locking: The internal opp_table and opp structures are RCU protected.
> + * Hence this function indirectly uses RCU updater strategy with mutex locks
> + * to keep the integrity of the internal data structures. Callers should 
> ensure
> + * that this function is *NOT* called under RCU protection or in contexts 
> where
> + * mutex cannot be locked.
> + */
> +void dev_pm_opp_of_remove_table(struct device *dev)
> +{
> + _dev_pm_opp_remove_table(dev, false);
> +}
>  EXPORT_SYMBOL_GPL(dev_pm_opp_of_remove_table);
>  
>  /* Returns opp descriptor node for a device, caller must do of_node_put() */
> diff --git a/drivers/base/power/opp/cpu.c b/drivers/base/power/opp/cpu.c
> index 55cbf9bd8707..9df4ad809c26 100644
> --- a/drivers/base/power/opp/cpu.c
> +++ b/drivers/base/power/opp/cpu.c
> @@ -119,12 +119,54 @@ void dev_pm_opp_free_cpufreq_table(struct device *dev,
>  EXPORT_SYMBOL_GPL(dev_pm_opp_free_cpufreq_table);
>  #endif   /* CONFIG_CPU_FREQ */
>  
> +static void _dev_pm_opp_cpumask_remove_table(cpumask_var_t cpumask, bool of)
> +{
> + struct device *cpu_dev;
> + int cpu;
> +
> + WARN_ON(cpumask_empty(cpumask));
> +
> + for_each_cpu(cpu, cpumask) {
> + cpu_dev = get_cpu_device(cpu);
> + if (!cpu_de

Re: [PATCH 3.16 106/217] sd: disable discard_zeroes_data for UNMAP

2016-04-28 Thread Rafael David Tinoco
Actually, It was an objection.

Knowing that WRITESAME(16), used as the discard mechanism, can cause
storage servers to misbehave (like QEMU's SCSI WRITESAME
implementation, workaround-ed by commit e461338b6cd4) and those
storage servers can't  rely on LBPRZ flag to opt out from WRITESAME as
discard mechanism (like QEMU does) since it is out of spec...

I have also seen storage servers miss-behaving with this specific
change (when changing from kernel 3.13 to 3.19, for example):

[21354.827291] Write same(16): 93 08 00 00 00 00 00 00 80 00 00 40 00 00 00 00
...
[21420.471648] sd 0:0:2:1: [sdw] FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
[21420.471665] sd 0:0:2:1: [sdw] Sense Key : Illegal Request [current]
[21420.471670] sd 0:0:2:1: [sdw] Add. Sense: Invalid field in cdb

And this happened because the storage in question didn't set properly
"max_ws_blocks" (it was 0) from VPD 0xb0 (max_ws_blocks is calculated
using it).

Anyway, 2 examples of disk servers that had problems after this
change. IMHO the change is good for regular kernel development, and it
does guarantee further READ CMDs to read zeros from LBAs, but, it
jeopardises already functioning storage servers.

If that argument isn't enough,

Without properly setting NDOB bit in WRITESAME(16), the data buffer
will be read on every SCSI WRITESAME(16) command and that will impact
the "discard method" performance (will probably be slower than regular
UNMAP command).

So far, I'm seeing 2 motives why it shouldn't be on older kernels.

On Thu, Apr 28, 2016 at 1:11 PM, Ben Hutchings  wrote:
> On Wed, 2016-04-27 at 17:43 -0300, Rafael David Tinoco wrote:
>> It seems that changing discard method from UNMAP to WRITE SAME(16)
>> without using NDOB bit (as first described in sbc3r35b.pdf) can cause
>> performance problems on big discards (since data-out buffer will be
>> checked for every WRITE SAME command). I think this is happening after
>> this commit, since NDOB bit wasn't implemented with this change
>> (afaik, iirc).
>
> Is that an objection, or just a comment?
>
> I only picked this commit for backporting because it was referenced by
> later fixes (commits 397737223c59, f4327a95dd08) and I read the commit
> message as saying that it fixes data corruption (sd claims to be
> writing zeroes but the whole area might not read back as zeroes).  Is
> my understanding correct?
>
> Ben.
>
>> From the spec:
>> """
>> To ensure that subsequent read operations return all zeros in a
>> logical block, use the WRITE SAME (16)
>> command with the NDOB bit set to one. If the UNMAP bit is set to one,
>> then the device server may unmap the logical blocks specified by the
>> WRITE SAME (16)
>> """
>>
>> And there were some problems with this change (specifically QEMU SCSI
>> WRITE SAME implementation). So the change (commit e461338b6cd4) was
>> made to guarantee that if LBPRZ=0, after VPD 0xB2, UNMAP is still
>> picked. WRITESAME(16) is picked only if LBPRZ=1. This last commit
>> violated spec in favor of a WRITE SAME "optout" approach for QEMU.
>>
>> I wonder if this should be taken to previous versions ...
>>
>> -Rafael Tinoco


[PATCH] drm/rockchip: vop: fix iommu crash with async atomic

2016-04-28 Thread Mark Yao
On Async atomic_commit callback, drm_atomic_clean_old_fb will
clean all old fb, but because async, the old fb may be also on
the vop hardware, dma will access the old fb buffer, clean old
fb will cause iommu page fault.

Reference the fb and unreference it when the fb actuall swap out
from vop hardware.

Signed-off-by: Mark Yao 
---
 drivers/gpu/drm/rockchip/rockchip_drm_vop.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c 
b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
index 28596e7..38c4de9 100644
--- a/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
+++ b/drivers/gpu/drm/rockchip/rockchip_drm_vop.c
@@ -560,6 +560,22 @@ static void vop_plane_destroy(struct drm_plane *plane)
drm_plane_cleanup(plane);
 }
 
+static int vop_plane_prepare_fb(struct drm_plane *plane,
+const struct drm_plane_state *new_state)
+{
+   if (plane->state->fb)
+   drm_framebuffer_reference(plane->state->fb);
+
+   return 0;
+}
+
+static void vop_plane_cleanup_fb(struct drm_plane *plane,
+ const struct drm_plane_state *old_state)
+{
+   if (old_state->fb)
+   drm_framebuffer_unreference(old_state->fb);
+}
+
 static int vop_plane_atomic_check(struct drm_plane *plane,
   struct drm_plane_state *state)
 {
@@ -756,6 +772,8 @@ static void vop_plane_atomic_update(struct drm_plane *plane,
 }
 
 static const struct drm_plane_helper_funcs plane_helper_funcs = {
+   .prepare_fb = vop_plane_prepare_fb,
+   .cleanup_fb = vop_plane_cleanup_fb,
.atomic_check = vop_plane_atomic_check,
.atomic_update = vop_plane_atomic_update,
.atomic_disable = vop_plane_atomic_disable,
-- 
1.9.1




linux-next: manual merge of the tip tree with the arm64 tree

2016-04-28 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the tip tree got a conflict in:

  drivers/firmware/efi/arm-init.c

between commits:

  500899c2cc3e ("efi: ARM/arm64: ignore DT memory nodes instead of removing 
them")
  7464b6e3a5fb ("efi: ARM: avoid warning about phys_addr_t cast")

from the arm64 tree and commits:

  78ce248faa3c ("efi: Iterate over efi.memmap in for_each_efi_memory_desc()")
  884f4f66ffd6 ("efi: Remove global 'memmap' EFI memory map")

from the tip tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/firmware/efi/arm-init.c
index fac567c3b66a,ef90f0c4b70a..
--- a/drivers/firmware/efi/arm-init.c
+++ b/drivers/firmware/efi/arm-init.c
@@@ -143,15 -178,7 +178,15 @@@ static __init void reserve_regions(void
if (efi_enabled(EFI_DBG))
pr_info("Processing EFI memory map:\n");
  
 +  /*
 +   * Discard memblocks discovered so far: if there are any at this
 +   * point, they originate from memory nodes in the DT, and UEFI
 +   * uses its own memory map instead.
 +   */
 +  memblock_dump_all();
 +  memblock_remove(0, (phys_addr_t)ULLONG_MAX);
 +
-   for_each_efi_memory_desc(&memmap, md) {
+   for_each_efi_memory_desc(md) {
paddr = md->phys_addr;
npages = md->num_pages;
  


Re: [PATCH v2 1/1] ASoC: fsl_ssi: add CCSR_SSI_SOR to volatile register list

2016-04-28 Thread Nicolin Chen
On Mon, Apr 25, 2016 at 11:36:18AM -0700, Caleb Crome wrote:
> The CCSR_SSI_SOR is a register that clears the TX and/or the RX fifo
> on the i.MX SSI port.  The fsl_ssi_trigger writes this register in
> order to clear the fifo at trigger time.
> 
> However, since the CCSR_SSI_SOR register is not in the volatile list,
> the caching mechanism prevented the register write in the trigger
> function.  This caused the fifo to not be cleared (because the value
> was unchanged from the last time the register was written), and thus
> causes the channels in both TDM or simple I2S mode to slip and be in
> the wrong time slots on SSI restart.
> 
> This has gone unnoticed for so long because with simple stereo mode,
> the consequence is that left and right are swapped, which isn't that
> noticeable.  However, it's catestrophic in some systems that
> require the channels to be in the right slots.
> 
> Signed-off-by: Caleb Crome 
> Suggested-by: Arnaud Mouiche 

Acked-by: Nicolin Chen 

Thanks

> 
> ---
>  sound/soc/fsl/fsl_ssi.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/sound/soc/fsl/fsl_ssi.c b/sound/soc/fsl/fsl_ssi.c
> index 216e3cb..2f3bf9c 100644
> --- a/sound/soc/fsl/fsl_ssi.c
> +++ b/sound/soc/fsl/fsl_ssi.c
> @@ -151,6 +151,7 @@ static bool fsl_ssi_volatile_reg(struct device *dev, 
> unsigned int reg)
>   case CCSR_SSI_SACDAT:
>   case CCSR_SSI_SATAG:
>   case CCSR_SSI_SACCST:
> + case CCSR_SSI_SOR:
>   return true;
>   default:
>   return false;
> -- 
> 1.9.1
> 


Re: [alsa-devel] [PATCH v2 1/1] ASoC: fsl_ssi: add CCSR_SSI_SOR to volatile register list

2016-04-28 Thread Fabio Estevam
On Mon, Apr 25, 2016 at 3:36 PM, Caleb Crome  wrote:
> The CCSR_SSI_SOR is a register that clears the TX and/or the RX fifo
> on the i.MX SSI port.  The fsl_ssi_trigger writes this register in
> order to clear the fifo at trigger time.
>
> However, since the CCSR_SSI_SOR register is not in the volatile list,
> the caching mechanism prevented the register write in the trigger
> function.  This caused the fifo to not be cleared (because the value
> was unchanged from the last time the register was written), and thus
> causes the channels in both TDM or simple I2S mode to slip and be in
> the wrong time slots on SSI restart.
>
> This has gone unnoticed for so long because with simple stereo mode,
> the consequence is that left and right are swapped, which isn't that
> noticeable.  However, it's catestrophic in some systems that
> require the channels to be in the right slots.
>
> Signed-off-by: Caleb Crome 
> Suggested-by: Arnaud Mouiche 

Reviewed-by: Fabio Estevam 


[PATCH v3] drm/rockchip: support non-iommu buffer path

2016-04-28 Thread Mark Yao
Some rockchip vop not support iommu, need use non-iommu
buffer for it. And if we get iommu issues, we can compare
the issues with non-iommu path, the would help the debug.

Signed-off-by: Mark Yao 
---
Changes in v3
- fix conflict with other iommu patch.
Changes in v2
Advised by Heiko Stuebner
- use more suitable message print.

 drivers/gpu/drm/rockchip/rockchip_drm_drv.c | 64 +
 1 file changed, 46 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/rockchip/rockchip_drm_drv.c 
b/drivers/gpu/drm/rockchip/rockchip_drm_drv.c
index 1e2d88b..0bd1cea 100644
--- a/drivers/gpu/drm/rockchip/rockchip_drm_drv.c
+++ b/drivers/gpu/drm/rockchip/rockchip_drm_drv.c
@@ -36,6 +36,8 @@
 #define DRIVER_MAJOR   1
 #define DRIVER_MINOR   0
 
+static bool is_support_iommu = true;
+
 /*
  * Attach a (component) device to the shared drm dma mapping from master drm
  * device.  This is used by the VOPs to map GEM buffers to a common DMA
@@ -47,6 +49,9 @@ int rockchip_drm_dma_attach_device(struct drm_device *drm_dev,
struct dma_iommu_mapping *mapping = drm_dev->dev->archdata.mapping;
int ret;
 
+   if (!is_support_iommu)
+   return 0;
+
ret = dma_set_coherent_mask(dev, DMA_BIT_MASK(32));
if (ret)
return ret;
@@ -59,6 +64,9 @@ int rockchip_drm_dma_attach_device(struct drm_device *drm_dev,
 void rockchip_drm_dma_detach_device(struct drm_device *drm_dev,
struct device *dev)
 {
+   if (!is_support_iommu)
+   return;
+
arm_iommu_detach_device(dev);
 }
 
@@ -152,23 +160,26 @@ static int rockchip_drm_load(struct drm_device *drm_dev, 
unsigned long flags)
goto err_config_cleanup;
}
 
-   /* TODO(djkurtz): fetch the mapping start/size from somewhere */
-   mapping = arm_iommu_create_mapping(&platform_bus_type, 0x,
-  SZ_2G);
-   if (IS_ERR(mapping)) {
-   ret = PTR_ERR(mapping);
-   goto err_config_cleanup;
-   }
+   if (is_support_iommu) {
+   /* TODO(djkurtz): fetch the mapping start/size from somewhere */
+   mapping = arm_iommu_create_mapping(&platform_bus_type,
+  0x,
+  SZ_2G);
+   if (IS_ERR(mapping)) {
+   ret = PTR_ERR(mapping);
+   goto err_config_cleanup;
+   }
 
-   ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
-   if (ret)
-   goto err_release_mapping;
+   ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
+   if (ret)
+   goto err_release_mapping;
 
-   dma_set_max_seg_size(dev, DMA_BIT_MASK(32));
+   dma_set_max_seg_size(dev, DMA_BIT_MASK(32));
 
-   ret = arm_iommu_attach_device(dev, mapping);
-   if (ret)
-   goto err_release_mapping;
+   ret = arm_iommu_attach_device(dev, mapping);
+   if (ret)
+   goto err_release_mapping;
+   }
 
/* Try to bind all sub drivers. */
ret = component_bind_all(dev, drm_dev);
@@ -218,7 +229,8 @@ static int rockchip_drm_load(struct drm_device *drm_dev, 
unsigned long flags)
if (ret)
goto err_vblank_cleanup;
 
-   arm_iommu_release_mapping(mapping);
+   if (is_support_iommu)
+   arm_iommu_release_mapping(mapping);
return 0;
 err_vblank_cleanup:
drm_vblank_cleanup(drm_dev);
@@ -227,9 +239,11 @@ err_kms_helper_poll_fini:
 err_unbind:
component_unbind_all(dev, drm_dev);
 err_detach_device:
-   arm_iommu_detach_device(dev);
+   if (is_support_iommu)
+   arm_iommu_detach_device(dev);
 err_release_mapping:
-   arm_iommu_release_mapping(mapping);
+   if (is_support_iommu)
+   arm_iommu_release_mapping(mapping);
 err_config_cleanup:
drm_mode_config_cleanup(drm_dev);
drm_dev->dev_private = NULL;
@@ -244,7 +258,8 @@ static int rockchip_drm_unload(struct drm_device *drm_dev)
drm_vblank_cleanup(drm_dev);
drm_kms_helper_poll_fini(drm_dev);
component_unbind_all(dev, drm_dev);
-   arm_iommu_detach_device(dev);
+   if (is_support_iommu)
+   arm_iommu_detach_device(dev);
drm_mode_config_cleanup(drm_dev);
drm_dev->dev_private = NULL;
 
@@ -488,6 +503,8 @@ static int rockchip_drm_platform_probe(struct 
platform_device *pdev)
 * works as expected.
 */
for (i = 0;; i++) {
+   struct device_node *iommu;
+
port = of_parse_phandle(np, "ports", i);
if (!port)
break;
@@ -497,6 +514,17 @@ static int rockchip_drm_platform_probe(struct 
platform_device *pdev)
continue;
}
 
+   

Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism

2016-04-28 Thread Linus Torvalds
On Thu, Apr 28, 2016 at 7:57 PM, George Spelvin  wrote:
>
> How many 32-bit platforms nead a multiplier that's easy for GCC to
> evaluate via shifts and adds?
>
> Generlly, by the time you've got a machine grunty enough to
> need 64 bits, a multiplier is quite affordable.

Probably true.

That said, the whole "use a multiply to do bit shifts and adds" may be
a false economy too. It's a good trick, but it does limit the end
result in many ways: you are limited to (a) only left-shifts and (b)
only addition and subtraction.

The "only left-shifts" means that you will always be in the situation
that you'll then need to use the high bits (so you'll always need that
shift down). And being limited to just the adder tends to mean that
it's harder to get a nice spread of bits - you're basically always
going to have that same carry chain.

Having looked around at other hashes, I suspect we should look at the
ones that do five or six shifts, and a mix of add/sub and xor. And
because they shift the bits around more freely you don't have the
final shift (that ends up being dependent on the size of the target
set).

It really would be lovely to hear that we can just replace
hash_int/long() with a better hash. And I wouldn't get too hung up on
the multiplication trick. I suspect it's not worth it.

  Linus


Re: [PATCH v3 4/6] sched/fair: Remove scale_load_down() for load_avg

2016-04-28 Thread Yuyang Du
On Thu, Apr 28, 2016 at 12:25:32PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 12:12:29PM +0800, Yuyang Du wrote:
> > Currently, load_avg = scale_load_down(load) * runnable%. The extra scaling
> > down of load does not make much sense, because load_avg is primarily THE
> > load and on top of that, we take runnable time into account.
> > 
> > We therefore remove scale_load_down() for load_avg. But we need to
> > carefully consider the overflow risk if load has higher range
> > (2*SCHED_FIXEDPOINT_SHIFT). The only case an overflow may occur due
> > to us is on 64bit kernel with increased load range. In that case,
> > the 64bit load_sum can afford 4251057 (=2^64/47742/88761/1024)
> > entities with the highest load (=88761*1024) always runnable on one
> > single cfs_rq, which may be an issue, but should be fine. Even if this
> > occurs at the end of day, on the condition where it occurs, the
> > load average will not be useful anyway.
> 
> I do feel we need a little more words on the actual ramification of
> overflowing here.
> 
> Yes, having 4m tasks on a single runqueue will be somewhat unlikely, but
> if it happens, then what will the user experience? How long (if ever)
> does it take for numbers to correct themselves etc..

Well, regarding the experience, this should be a stress test study.

But if the system can miraculously survive, and we end up in the scenario
that we have a ~0ULL load_sum and the rq suddently dropps to 0 load, it
would take roughly 2 seconds (=32ms*64) to converge. This time is the bound.


  1   2   3   4   5   6   7   8   9   10   >