Re: [PATCH v3 1/9] ASoC: rt5514: Switch to snd_soc_register_codec

2017-08-17 Thread Mark Brown
On Thu, Aug 17, 2017 at 12:44:09PM +0800, Jeffy Chen wrote:
> Currently we are using devm_snd_soc_register_component, which would
> use legacy dai name.

> Switch to snd_soc_register_codec to use dai driver name.

This is the wrong direction to be going in, we are trying to move all
drivers to use component.  Whatever you want to do make components use
it.


signature.asc
Description: PGP signature


Re: [PATCH v3 1/9] ASoC: rt5514: Switch to snd_soc_register_codec

2017-08-17 Thread Mark Brown
On Thu, Aug 17, 2017 at 12:44:09PM +0800, Jeffy Chen wrote:
> Currently we are using devm_snd_soc_register_component, which would
> use legacy dai name.

> Switch to snd_soc_register_codec to use dai driver name.

This is the wrong direction to be going in, we are trying to move all
drivers to use component.  Whatever you want to do make components use
it.


signature.asc
Description: PGP signature


Re: [PATCH v2] bpf: Update sysctl documentation to list all supported architectures

2017-08-17 Thread David Miller
From: Michael Ellerman 
Date: Thu, 17 Aug 2017 20:30:39 +1000

> The sysctl documentation states that the JIT is only available on
> x86_64, which is no longer correct.
> 
> Update the list, and break it out to indicate which architectures
> support the cBPF JIT (via HAVE_CBPF_JIT) or the eBPF JIT
> (HAVE_EBPF_JIT).
> 
> Signed-off-by: Michael Ellerman 

Applied, thanks Michael.


Re: [PATCH v2] bpf: Update sysctl documentation to list all supported architectures

2017-08-17 Thread David Miller
From: Michael Ellerman 
Date: Thu, 17 Aug 2017 20:30:39 +1000

> The sysctl documentation states that the JIT is only available on
> x86_64, which is no longer correct.
> 
> Update the list, and break it out to indicate which architectures
> support the cBPF JIT (via HAVE_CBPF_JIT) or the eBPF JIT
> (HAVE_EBPF_JIT).
> 
> Signed-off-by: Michael Ellerman 

Applied, thanks Michael.


Re: [PATCH][net-next] net: hns3: ensure media_type is unitialized

2017-08-17 Thread David Miller
From: Colin King 
Date: Thu, 17 Aug 2017 10:01:07 +0100

> From: Colin Ian King 
> 
> Media type is only set if h->ae_algo->ops->get_media_type is called
> so there is a possibility that media_type is uninitialized when it is
> used a switch statement.  Fix this by initializing media_type to
> HNAE3_MEDIA_TYPE_UNKNOWN.
> 
> Detected by CoverityScan, CID#1452624("Uninitialized scalar variable")
> 
> Fixes: 496d03e960ae ("net: hns3: Add Ethtool support to HNS3 driver")
> Signed-off-by: Colin Ian King 

Applied.


Re: [PATCH][net-next] net: hns3: ensure media_type is unitialized

2017-08-17 Thread David Miller
From: Colin King 
Date: Thu, 17 Aug 2017 10:01:07 +0100

> From: Colin Ian King 
> 
> Media type is only set if h->ae_algo->ops->get_media_type is called
> so there is a possibility that media_type is uninitialized when it is
> used a switch statement.  Fix this by initializing media_type to
> HNAE3_MEDIA_TYPE_UNKNOWN.
> 
> Detected by CoverityScan, CID#1452624("Uninitialized scalar variable")
> 
> Fixes: 496d03e960ae ("net: hns3: Add Ethtool support to HNS3 driver")
> Signed-off-by: Colin Ian King 

Applied.


Re: [PATCH] liquidio: fix spelling mistake: "interuupt" -> "interrupt"

2017-08-17 Thread David Miller
From: Colin King 
Date: Thu, 17 Aug 2017 09:19:30 +0100

> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in dev_info message
> 
> Signed-off-by: Colin Ian King 

Applied to net-next.


Re: [PATCH] liquidio: fix spelling mistake: "interuupt" -> "interrupt"

2017-08-17 Thread David Miller
From: Colin King 
Date: Thu, 17 Aug 2017 09:19:30 +0100

> From: Colin Ian King 
> 
> Trivial fix to spelling mistake in dev_info message
> 
> Signed-off-by: Colin Ian King 

Applied to net-next.


Re: [PATCH] staging: vboxvideo: make drm_fb_helper_funcs const

2017-08-17 Thread Bhumika Goyal
On Thu, Aug 17, 2017 at 10:33 PM, Greg KH  wrote:
> On Tue, Aug 08, 2017 at 09:15:42PM +0530, Bhumika Goyal wrote:
>> Make the structure const as it is only passed to the function
>> drm_fb_helper_prepare and the corresponding argument is of type
>> const.
>> Done using Coccinelle.
>>
>> Signed-off-by: Bhumika Goyal 
>> ---
>>  drivers/staging/vboxvideo/vbox_fb.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Does not apply to my tree at all :(

Okay, I will resend it.

Thanks,
Bhumika


Re: [PATCH] staging: vboxvideo: make drm_fb_helper_funcs const

2017-08-17 Thread Bhumika Goyal
On Thu, Aug 17, 2017 at 10:33 PM, Greg KH  wrote:
> On Tue, Aug 08, 2017 at 09:15:42PM +0530, Bhumika Goyal wrote:
>> Make the structure const as it is only passed to the function
>> drm_fb_helper_prepare and the corresponding argument is of type
>> const.
>> Done using Coccinelle.
>>
>> Signed-off-by: Bhumika Goyal 
>> ---
>>  drivers/staging/vboxvideo/vbox_fb.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Does not apply to my tree at all :(

Okay, I will resend it.

Thanks,
Bhumika


Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Andi Kleen
> So it will _not_ set to null a member that it doesn't have, i.e. the
> minimal fix is to just have the hunks below, making sure that the error
> field is present in both structs. No need to set
> parse_event_terms->error to anything, it will be set to null since other
> fields are set to something.

Right but I want to see the error too. That is why I added a real 
error handler.

-Andi


Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Andi Kleen
> So it will _not_ set to null a member that it doesn't have, i.e. the
> minimal fix is to just have the hunks below, making sure that the error
> field is present in both structs. No need to set
> parse_event_terms->error to anything, it will be set to null since other
> fields are set to something.

Right but I want to see the error too. That is why I added a real 
error handler.

-Andi


Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-17 Thread David Miller
From: Dexuan Cui 
Date: Thu, 17 Aug 2017 08:00:29 +

> @@ -73,6 +74,10 @@ struct vmci_transport_recv_pkt_info {
>   struct vmci_transport_packet pkt;
>  };
>  
> +static bool skip_hypervisor_check;
> +module_param(skip_hypervisor_check, bool, 0444);
> +MODULE_PARM_DESC(hot_add, "If set, attempt to load on non-VMware platforms");
> +

I would avoid module parameters at all costs.

It is the worst possible interface for users of your software.

You really need to fundamentally solve the problems related to making
sure the proper modules for the VM actually present on the system get
loaded when necessary rather than adding hacks like this.

Unlike a proper solution, these hacks are ugly but have to stay around
forever once you put them in place.



Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-17 Thread David Miller
From: Dexuan Cui 
Date: Thu, 17 Aug 2017 08:00:29 +

> @@ -73,6 +74,10 @@ struct vmci_transport_recv_pkt_info {
>   struct vmci_transport_packet pkt;
>  };
>  
> +static bool skip_hypervisor_check;
> +module_param(skip_hypervisor_check, bool, 0444);
> +MODULE_PARM_DESC(hot_add, "If set, attempt to load on non-VMware platforms");
> +

I would avoid module parameters at all costs.

It is the worst possible interface for users of your software.

You really need to fundamentally solve the problems related to making
sure the proper modules for the VM actually present on the system get
loaded when necessary rather than adding hacks like this.

Unlike a proper solution, these hacks are ugly but have to stay around
forever once you put them in place.



Re: [PATCH 07/10] cpufreq: dt: Add support for some new Allwinner SoCs

2017-08-17 Thread Chen-Yu Tsai
On Tue, Aug 15, 2017 at 8:39 PM, Chen-Yu Tsai  wrote:
> On Tue, Aug 15, 2017 at 8:25 PM, Rafael J. Wysocki  wrote:
>> On Tuesday, August 15, 2017 7:42:19 AM CEST Chen-Yu Tsai wrote:
>>> On Mon, Jul 24, 2017 at 7:46 PM, Rafael J. Wysocki  
>>> wrote:
>>> > On Monday, July 24, 2017 02:23:49 PM Viresh Kumar wrote:
>>> >> On 23-07-17, 18:27, Icenowy Zheng wrote:
>>> >> > Some new Allwinner SoCs get supported in the kernel after the
>>> >> > compatibles are added to cpufreq-dt-platdev driver.
>>> >> >
>>> >> > Add their compatible strings in the cpufreq-dt-platdev driver.
>>> >> >
>>> >> > Cc: "Rafael J. Wysocki" 
>>> >> > Cc: Viresh Kumar 
>>> >> > Signed-off-by: Icenowy Zheng 
>>> >> > ---
>>> >> >  drivers/cpufreq/cpufreq-dt-platdev.c | 6 ++
>>> >> >  1 file changed, 6 insertions(+)
>>> >> >
>>> >> > diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
>>> >> > b/drivers/cpufreq/cpufreq-dt-platdev.c
>>> >> > index 2eb40d46d357..c3851453e84a 100644
>>> >> > --- a/drivers/cpufreq/cpufreq-dt-platdev.c
>>> >> > +++ b/drivers/cpufreq/cpufreq-dt-platdev.c
>>> >> > @@ -24,7 +24,11 @@ static const struct of_device_id machines[] 
>>> >> > __initconst = {
>>> >> > { .compatible = "allwinner,sun8i-a23", },
>>> >> > { .compatible = "allwinner,sun8i-a33", },
>>> >> > { .compatible = "allwinner,sun8i-a83t", },
>>> >> > +   { .compatible = "allwinner,sun8i-h2-plus", },
>>> >> > { .compatible = "allwinner,sun8i-h3", },
>>> >> > +   { .compatible = "allwinner,sun8i-v3s", },
>>> >> > +   { .compatible = "allwinner,sun50i-a64", },
>>> >> > +   { .compatible = "allwinner,sun50i-h5", },
>>> >> >
>>> >> > { .compatible = "apm,xgene-shadowcat", },
>>> >> >
>>> >> > @@ -43,6 +47,8 @@ static const struct of_device_id machines[] 
>>> >> > __initconst = {
>>> >> > { .compatible = "marvell,pxa250", },
>>> >> > { .compatible = "marvell,pxa270", },
>>> >> >
>>> >> > +   { .compatible = "nextthing,gr8", },
>>> >> > +
>>> >> > { .compatible = "samsung,exynos3250", },
>>> >> > { .compatible = "samsung,exynos4210", },
>>> >> > { .compatible = "samsung,exynos4212", },
>>> >>
>>> >> Acked-by: Viresh Kumar 
>>> >>
>>> >>
>>> >
>>> > OK
>>> >
>>> > I'm assuming this will go in via arm-soc along with the rest of the 
>>> > series.
>>>
>>> The rest of the series, except for the clk patches I've already taken,
>>> doesn't seem to be going anywhere just yet. Is there a cpufreq or pm tree
>>> this can be merged through?
>>
>> You mean this particular patch or the whole series?
>>
>> If the former, then yes, there is.
>
> Just this patch. As I mentioned the rest of the series still needs some
> work, and likely won't make it in this cycle. If you could take this
> through the tree it would normally go through, avoiding any conflicts,
> that'd be great.

This patch has been obsoleted by Viresh's "cpufreq: dt-platdev:
Automatically create cpufreq device with OPP v2" patch.

ChenYu


Re: [PATCH 07/10] cpufreq: dt: Add support for some new Allwinner SoCs

2017-08-17 Thread Chen-Yu Tsai
On Tue, Aug 15, 2017 at 8:39 PM, Chen-Yu Tsai  wrote:
> On Tue, Aug 15, 2017 at 8:25 PM, Rafael J. Wysocki  wrote:
>> On Tuesday, August 15, 2017 7:42:19 AM CEST Chen-Yu Tsai wrote:
>>> On Mon, Jul 24, 2017 at 7:46 PM, Rafael J. Wysocki  
>>> wrote:
>>> > On Monday, July 24, 2017 02:23:49 PM Viresh Kumar wrote:
>>> >> On 23-07-17, 18:27, Icenowy Zheng wrote:
>>> >> > Some new Allwinner SoCs get supported in the kernel after the
>>> >> > compatibles are added to cpufreq-dt-platdev driver.
>>> >> >
>>> >> > Add their compatible strings in the cpufreq-dt-platdev driver.
>>> >> >
>>> >> > Cc: "Rafael J. Wysocki" 
>>> >> > Cc: Viresh Kumar 
>>> >> > Signed-off-by: Icenowy Zheng 
>>> >> > ---
>>> >> >  drivers/cpufreq/cpufreq-dt-platdev.c | 6 ++
>>> >> >  1 file changed, 6 insertions(+)
>>> >> >
>>> >> > diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
>>> >> > b/drivers/cpufreq/cpufreq-dt-platdev.c
>>> >> > index 2eb40d46d357..c3851453e84a 100644
>>> >> > --- a/drivers/cpufreq/cpufreq-dt-platdev.c
>>> >> > +++ b/drivers/cpufreq/cpufreq-dt-platdev.c
>>> >> > @@ -24,7 +24,11 @@ static const struct of_device_id machines[] 
>>> >> > __initconst = {
>>> >> > { .compatible = "allwinner,sun8i-a23", },
>>> >> > { .compatible = "allwinner,sun8i-a33", },
>>> >> > { .compatible = "allwinner,sun8i-a83t", },
>>> >> > +   { .compatible = "allwinner,sun8i-h2-plus", },
>>> >> > { .compatible = "allwinner,sun8i-h3", },
>>> >> > +   { .compatible = "allwinner,sun8i-v3s", },
>>> >> > +   { .compatible = "allwinner,sun50i-a64", },
>>> >> > +   { .compatible = "allwinner,sun50i-h5", },
>>> >> >
>>> >> > { .compatible = "apm,xgene-shadowcat", },
>>> >> >
>>> >> > @@ -43,6 +47,8 @@ static const struct of_device_id machines[] 
>>> >> > __initconst = {
>>> >> > { .compatible = "marvell,pxa250", },
>>> >> > { .compatible = "marvell,pxa270", },
>>> >> >
>>> >> > +   { .compatible = "nextthing,gr8", },
>>> >> > +
>>> >> > { .compatible = "samsung,exynos3250", },
>>> >> > { .compatible = "samsung,exynos4210", },
>>> >> > { .compatible = "samsung,exynos4212", },
>>> >>
>>> >> Acked-by: Viresh Kumar 
>>> >>
>>> >>
>>> >
>>> > OK
>>> >
>>> > I'm assuming this will go in via arm-soc along with the rest of the 
>>> > series.
>>>
>>> The rest of the series, except for the clk patches I've already taken,
>>> doesn't seem to be going anywhere just yet. Is there a cpufreq or pm tree
>>> this can be merged through?
>>
>> You mean this particular patch or the whole series?
>>
>> If the former, then yes, there is.
>
> Just this patch. As I mentioned the rest of the series still needs some
> work, and likely won't make it in this cycle. If you could take this
> through the tree it would normally go through, avoiding any conflicts,
> that'd be great.

This patch has been obsoleted by Viresh's "cpufreq: dt-platdev:
Automatically create cpufreq device with OPP v2" patch.

ChenYu


Re: [PATCH] staging: vboxvideo: make drm_fb_helper_funcs const

2017-08-17 Thread Greg KH
On Tue, Aug 08, 2017 at 09:15:42PM +0530, Bhumika Goyal wrote:
> Make the structure const as it is only passed to the function
> drm_fb_helper_prepare and the corresponding argument is of type
> const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 
> ---
>  drivers/staging/vboxvideo/vbox_fb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Does not apply to my tree at all :(


Re: [PATCH] staging: vboxvideo: make drm_fb_helper_funcs const

2017-08-17 Thread Greg KH
On Tue, Aug 08, 2017 at 09:15:42PM +0530, Bhumika Goyal wrote:
> Make the structure const as it is only passed to the function
> drm_fb_helper_prepare and the corresponding argument is of type
> const.
> Done using Coccinelle.
> 
> Signed-off-by: Bhumika Goyal 
> ---
>  drivers/staging/vboxvideo/vbox_fb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Does not apply to my tree at all :(


Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Andi Kleen
On Thu, Aug 17, 2017 at 01:25:39PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Aug 17, 2017 at 08:34:22AM -0700, Andi Kleen escreveu:
> > > Humm, but don't we have that checked?
> > 
> > At least not in the case of the segfault below.
> 
> Again:
> 
> tools/perf/util/parse-events.c
> 
> 2523 void parse_events_evlist_error(struct parse_events_evlist *data,
> 2524int idx, const char *str)
> 2525 {
> 2526 struct parse_events_error *err = data->error;
> 2527 
> 2528 if (!err)
> 2529 return;
> 2530 err->idx = idx;
> 2531 err->str = strdup(str);
> 2532 WARN_ONCE(!err->str, "WARNING: failed to allocate error string");
> 2533 }
> 
> data->error _is_ being checked for NULL, and the hunk you added would

In the crash, data->error is not NULL, it's random stack contents
because the data types for the data pointer do not match
(parse_events_evlist vs parse_events_terms)

-Andi



Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Andi Kleen
On Thu, Aug 17, 2017 at 01:25:39PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Thu, Aug 17, 2017 at 08:34:22AM -0700, Andi Kleen escreveu:
> > > Humm, but don't we have that checked?
> > 
> > At least not in the case of the segfault below.
> 
> Again:
> 
> tools/perf/util/parse-events.c
> 
> 2523 void parse_events_evlist_error(struct parse_events_evlist *data,
> 2524int idx, const char *str)
> 2525 {
> 2526 struct parse_events_error *err = data->error;
> 2527 
> 2528 if (!err)
> 2529 return;
> 2530 err->idx = idx;
> 2531 err->str = strdup(str);
> 2532 WARN_ONCE(!err->str, "WARNING: failed to allocate error string");
> 2533 }
> 
> data->error _is_ being checked for NULL, and the hunk you added would

In the crash, data->error is not NULL, it's random stack contents
because the data types for the data pointer do not match
(parse_events_evlist vs parse_events_terms)

-Andi



Re: [PATCH 2/2] cpufreq: dt-platdev: Drop few entries from whitelist

2017-08-17 Thread Chen-Yu Tsai
On Wed, Aug 16, 2017 at 7:24 PM, Viresh Kumar  wrote:
> On 16-08-17, 16:53, Chen-Yu Tsai wrote:
>> On Wed, Aug 16, 2017 at 1:37 PM, Viresh Kumar  
>> wrote:
>> > Drop few ARM (32 and 64 bit) platforms from the whitelist which always
>> > use "operating-points-v2" property from their DT. They should continue
>> > to work after this patch.
>> >
>> > Tested on Hikey platform (only the "hisilicon,hi6220" entry).
>> >
>> > Signed-off-by: Viresh Kumar 
>> > ---
>> >  drivers/cpufreq/cpufreq-dt-platdev.c | 11 ---
>> >  1 file changed, 11 deletions(-)
>> >
>> > diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
>> > b/drivers/cpufreq/cpufreq-dt-platdev.c
>> > index 061b468512a2..45f2ec3b7f7a 100644
>> > --- a/drivers/cpufreq/cpufreq-dt-platdev.c
>> > +++ b/drivers/cpufreq/cpufreq-dt-platdev.c
>> > @@ -27,7 +27,6 @@ static const struct of_device_id whitelist[] __initconst 
>> > = {
>> > { .compatible = "allwinner,sun6i-a31s", },
>> > { .compatible = "allwinner,sun7i-a20", },
>> > { .compatible = "allwinner,sun8i-a23", },
>> > -   { .compatible = "allwinner,sun8i-a33", },
>> > { .compatible = "allwinner,sun8i-a83t", },
>> > { .compatible = "allwinner,sun8i-h3", },
>> >
>>
>> Acked-by: Chen-Yu Tsai 
>>
>> In fact, we can drop all the sun8i entries. Of them, only sun8i-a33 has
>> cpufreq implemented. All the other ones are missing thermal sensor support,
>> and no OPP tables have been added yet.
>
> Then, why did you guys add all those SoCs there ? :)

The idea was to add all the SoCs to the cpufreq driver first, and sort out
the thermal sensor and device tree parts later. Obviously this progressed
very slowly.

ChenYu


Re: [PATCH 2/2] cpufreq: dt-platdev: Drop few entries from whitelist

2017-08-17 Thread Chen-Yu Tsai
On Wed, Aug 16, 2017 at 7:24 PM, Viresh Kumar  wrote:
> On 16-08-17, 16:53, Chen-Yu Tsai wrote:
>> On Wed, Aug 16, 2017 at 1:37 PM, Viresh Kumar  
>> wrote:
>> > Drop few ARM (32 and 64 bit) platforms from the whitelist which always
>> > use "operating-points-v2" property from their DT. They should continue
>> > to work after this patch.
>> >
>> > Tested on Hikey platform (only the "hisilicon,hi6220" entry).
>> >
>> > Signed-off-by: Viresh Kumar 
>> > ---
>> >  drivers/cpufreq/cpufreq-dt-platdev.c | 11 ---
>> >  1 file changed, 11 deletions(-)
>> >
>> > diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
>> > b/drivers/cpufreq/cpufreq-dt-platdev.c
>> > index 061b468512a2..45f2ec3b7f7a 100644
>> > --- a/drivers/cpufreq/cpufreq-dt-platdev.c
>> > +++ b/drivers/cpufreq/cpufreq-dt-platdev.c
>> > @@ -27,7 +27,6 @@ static const struct of_device_id whitelist[] __initconst 
>> > = {
>> > { .compatible = "allwinner,sun6i-a31s", },
>> > { .compatible = "allwinner,sun7i-a20", },
>> > { .compatible = "allwinner,sun8i-a23", },
>> > -   { .compatible = "allwinner,sun8i-a33", },
>> > { .compatible = "allwinner,sun8i-a83t", },
>> > { .compatible = "allwinner,sun8i-h3", },
>> >
>>
>> Acked-by: Chen-Yu Tsai 
>>
>> In fact, we can drop all the sun8i entries. Of them, only sun8i-a33 has
>> cpufreq implemented. All the other ones are missing thermal sensor support,
>> and no OPP tables have been added yet.
>
> Then, why did you guys add all those SoCs there ? :)

The idea was to add all the SoCs to the cpufreq driver first, and sort out
the thermal sensor and device tree parts later. Obviously this progressed
very slowly.

ChenYu


Re: [PATCH] PCI: Allow PCI express root ports to find themselves

2017-08-17 Thread Bjorn Helgaas
On Thu, Aug 17, 2017 at 01:06:14PM +0200, Thierry Reding wrote:
> From: Thierry Reding <tred...@nvidia.com>
> 
> If the pci_find_pcie_root_port() function is called on a root port
> itself, return the root port rather than NULL.
> 
> This effectively reverts commit 0e405232871d6 ("PCI: fix oops when
> try to find Root Port for a PCI device") which added an extra check
> that would now be redundant.
> 
> Fixes: a99b646afa8a ("PCI: Disable PCIe Relaxed Ordering if unsupported")
> Fixes: c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5 
> Completion erratum")
> Signed-off-by: Thierry Reding <tred...@nvidia.com>

Acked-by: Bjorn Helgaas <bhelg...@google.com>

I *think* this should work for everybody, but I can't test it personally.

> ---
> This applies on top of and was tested on next-20170817.
> 
> Michael, it'd be great if you could test this one again to clarify
> whether or not the fix that's already in Linus' tree is still needed, or
> whether it's indeed obsoleted by this patch.
> 
>  drivers/pci/pci.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b05c587e335a..dd56c1c05614 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -514,7 +514,7 @@ EXPORT_SYMBOL(pci_find_resource);
>   */
>  struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev)
>  {
> - struct pci_dev *bridge, *highest_pcie_bridge = NULL;
> + struct pci_dev *bridge, *highest_pcie_bridge = dev;
>  
>   bridge = pci_upstream_bridge(dev);
>   while (bridge && pci_is_pcie(bridge)) {
> @@ -522,11 +522,10 @@ struct pci_dev *pci_find_pcie_root_port(struct pci_dev 
> *dev)
>   bridge = pci_upstream_bridge(bridge);
>   }
>  
> - if (highest_pcie_bridge &&
> - pci_pcie_type(highest_pcie_bridge) == PCI_EXP_TYPE_ROOT_PORT)
> - return highest_pcie_bridge;
> + if (pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT)
> + return NULL;
>  
> - return NULL;
> + return highest_pcie_bridge;
>  }
>  EXPORT_SYMBOL(pci_find_pcie_root_port);
>  
> -- 
> 2.13.3
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH] PCI: Allow PCI express root ports to find themselves

2017-08-17 Thread Bjorn Helgaas
On Thu, Aug 17, 2017 at 01:06:14PM +0200, Thierry Reding wrote:
> From: Thierry Reding 
> 
> If the pci_find_pcie_root_port() function is called on a root port
> itself, return the root port rather than NULL.
> 
> This effectively reverts commit 0e405232871d6 ("PCI: fix oops when
> try to find Root Port for a PCI device") which added an extra check
> that would now be redundant.
> 
> Fixes: a99b646afa8a ("PCI: Disable PCIe Relaxed Ordering if unsupported")
> Fixes: c56d4450eb68 ("PCI: Turn off Request Attributes to avoid Chelsio T5 
> Completion erratum")
> Signed-off-by: Thierry Reding 

Acked-by: Bjorn Helgaas 

I *think* this should work for everybody, but I can't test it personally.

> ---
> This applies on top of and was tested on next-20170817.
> 
> Michael, it'd be great if you could test this one again to clarify
> whether or not the fix that's already in Linus' tree is still needed, or
> whether it's indeed obsoleted by this patch.
> 
>  drivers/pci/pci.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b05c587e335a..dd56c1c05614 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -514,7 +514,7 @@ EXPORT_SYMBOL(pci_find_resource);
>   */
>  struct pci_dev *pci_find_pcie_root_port(struct pci_dev *dev)
>  {
> - struct pci_dev *bridge, *highest_pcie_bridge = NULL;
> + struct pci_dev *bridge, *highest_pcie_bridge = dev;
>  
>   bridge = pci_upstream_bridge(dev);
>   while (bridge && pci_is_pcie(bridge)) {
> @@ -522,11 +522,10 @@ struct pci_dev *pci_find_pcie_root_port(struct pci_dev 
> *dev)
>   bridge = pci_upstream_bridge(bridge);
>   }
>  
> - if (highest_pcie_bridge &&
> - pci_pcie_type(highest_pcie_bridge) == PCI_EXP_TYPE_ROOT_PORT)
> - return highest_pcie_bridge;
> + if (pci_pcie_type(highest_pcie_bridge) != PCI_EXP_TYPE_ROOT_PORT)
> + return NULL;
>  
> - return NULL;
> + return highest_pcie_bridge;
>  }
>  EXPORT_SYMBOL(pci_find_pcie_root_port);
>  
> -- 
> 2.13.3
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [tip:x86/build 2/2] kernel/bounds.c:1: error: -mpreferred-stack-boundary=3 is not between 4 and 12

2017-08-17 Thread Matthias Kaehlcke
El Thu, Aug 17, 2017 at 10:13:20PM +0800 kbuild test robot ha dit:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/build
> head:   8f91869766c00622b2eaa8ee567db4f333b78c1a
> commit: 8f91869766c00622b2eaa8ee567db4f333b78c1a [2/2] x86/build: Fix stack 
> alignment for CLang
> config: x86_64-randconfig-b0-08172108 (attached as .config)
> compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
> reproduce:
> git checkout 8f91869766c00622b2eaa8ee567db4f333b78c1a
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
> >> kernel/bounds.c:1: error: -mpreferred-stack-boundary=3 is not between 4 
> >> and 12

To be honest I was slightly concerned about not checking the actual
option that is applied, but expected it to be ok.

Though we can't use cc-option directly to chose between the gcc and
the clang option in this case we can use it to make sure only valid
parameters are applied. I'll prepare a fix for the fix.

Matthias


Re: [tip:x86/build 2/2] kernel/bounds.c:1: error: -mpreferred-stack-boundary=3 is not between 4 and 12

2017-08-17 Thread Matthias Kaehlcke
El Thu, Aug 17, 2017 at 10:13:20PM +0800 kbuild test robot ha dit:

> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/build
> head:   8f91869766c00622b2eaa8ee567db4f333b78c1a
> commit: 8f91869766c00622b2eaa8ee567db4f333b78c1a [2/2] x86/build: Fix stack 
> alignment for CLang
> config: x86_64-randconfig-b0-08172108 (attached as .config)
> compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
> reproduce:
> git checkout 8f91869766c00622b2eaa8ee567db4f333b78c1a
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
> >> kernel/bounds.c:1: error: -mpreferred-stack-boundary=3 is not between 4 
> >> and 12

To be honest I was slightly concerned about not checking the actual
option that is applied, but expected it to be ok.

Though we can't use cc-option directly to chose between the gcc and
the clang option in this case we can use it to make sure only valid
parameters are applied. I'll prepare a fix for the fix.

Matthias


Re: [PATCH] iommu: Avoid NULL group dereference

2017-08-17 Thread Robin Murphy
On 17/08/17 16:41, Joerg Roedel wrote:
> On Thu, Aug 17, 2017 at 11:40:08AM +0100, Robin Murphy wrote:
>> The recently-removed FIXME in iommu_get_domain_for_dev() turns out to
>> have been a little misleading, since that check is still worthwhile even
>> when groups *are* universal. We have a few IOMMU-aware drivers which
>> only care whether their device is already attached to an existing domain
>> or not, for which the previous behaviour of iommu_get_domain_for_dev()
>> was ideal, and who now crash if their device does not have an IOMMU.
>>
>> With IOMMU groups now serving as a reliable indicator of whether a
>> device has an IOMMU or not (barring false-positives from VFIO no-IOMMU
>> mode), drivers could arguably do this:
>>
>>  group = iommu_group_get(dev);
>>  if (group) {
>>  domain = iommu_get_domain_for_dev(dev);
>>  iommu_group_put(group);
>>  }
> 
> Okay, so just to check I got it right: Drivers do the above to check
> whether a device is managed by an IOMMU, and that crashes now because
> the 'group == NULL' check was removed?

Indeed - the typical context is network descriptors that don't have
space to store the CPU virtual address of the buffer, so when a packet
arrives the driver has to work backwards from the DMA address, in this
sort of pattern:

addr = desc[idx]->addr;
domain = iommu_get_domain_for_dev(dev);
if (domain)
addr = iommu_iova_to_phys(domain, addr)
buf = phys_to_virt(addr)

(the GIC thing is similar but in reverse, with a physical address which
may or may not need replacing with an IOVA). Unless we were to change
the interface to be iommu_get_domain_for_group(), I think it makes sense
for it to remain valid to call for any device.

Robin.


Re: [PATCH] iommu: Avoid NULL group dereference

2017-08-17 Thread Robin Murphy
On 17/08/17 16:41, Joerg Roedel wrote:
> On Thu, Aug 17, 2017 at 11:40:08AM +0100, Robin Murphy wrote:
>> The recently-removed FIXME in iommu_get_domain_for_dev() turns out to
>> have been a little misleading, since that check is still worthwhile even
>> when groups *are* universal. We have a few IOMMU-aware drivers which
>> only care whether their device is already attached to an existing domain
>> or not, for which the previous behaviour of iommu_get_domain_for_dev()
>> was ideal, and who now crash if their device does not have an IOMMU.
>>
>> With IOMMU groups now serving as a reliable indicator of whether a
>> device has an IOMMU or not (barring false-positives from VFIO no-IOMMU
>> mode), drivers could arguably do this:
>>
>>  group = iommu_group_get(dev);
>>  if (group) {
>>  domain = iommu_get_domain_for_dev(dev);
>>  iommu_group_put(group);
>>  }
> 
> Okay, so just to check I got it right: Drivers do the above to check
> whether a device is managed by an IOMMU, and that crashes now because
> the 'group == NULL' check was removed?

Indeed - the typical context is network descriptors that don't have
space to store the CPU virtual address of the buffer, so when a packet
arrives the driver has to work backwards from the DMA address, in this
sort of pattern:

addr = desc[idx]->addr;
domain = iommu_get_domain_for_dev(dev);
if (domain)
addr = iommu_iova_to_phys(domain, addr)
buf = phys_to_virt(addr)

(the GIC thing is similar but in reverse, with a physical address which
may or may not need replacing with an IOVA). Unless we were to change
the interface to be iommu_get_domain_for_group(), I think it makes sense
for it to remain valid to call for any device.

Robin.


Re: [PATCH v2 1/2] wireless: move prism54 out to staging

2017-08-17 Thread Greg KH
On Mon, Aug 07, 2017 at 03:30:10PM -0700, Luis R. Rodriguez wrote:
> prism54 is deprecated in favor of the p54pci device driver. Although
> only *one soul* had reported issues with it long ago Linux most Linux
> distributions these days just disable the device driver given the
> conflicts with the PCI IDs of p54pci and the *very* unlikely situation
> of folks really need this driver anymore.
> 
> Before trying to due away with prism54 once more stuff it into staging,
> which is our hospice for dying drivers.
> 
> Acked-by: Kalle Valo 
> Signed-off-by: Luis R. Rodriguez 

Did you build this?

I get the following error:

drivers/staging/Kconfig:117: can't open file "drivers/staging/prism54/Kconfig"

Can you fix this and _test it_ and resend it?

thanks,

greg k-h


Re: [PATCH v2 1/2] wireless: move prism54 out to staging

2017-08-17 Thread Greg KH
On Mon, Aug 07, 2017 at 03:30:10PM -0700, Luis R. Rodriguez wrote:
> prism54 is deprecated in favor of the p54pci device driver. Although
> only *one soul* had reported issues with it long ago Linux most Linux
> distributions these days just disable the device driver given the
> conflicts with the PCI IDs of p54pci and the *very* unlikely situation
> of folks really need this driver anymore.
> 
> Before trying to due away with prism54 once more stuff it into staging,
> which is our hospice for dying drivers.
> 
> Acked-by: Kalle Valo 
> Signed-off-by: Luis R. Rodriguez 

Did you build this?

I get the following error:

drivers/staging/Kconfig:117: can't open file "drivers/staging/prism54/Kconfig"

Can you fix this and _test it_ and resend it?

thanks,

greg k-h


Re: [PATCH RFC 2/2] KVM: RCU protected dynamic vcpus array

2017-08-17 Thread Paolo Bonzini
On 17/08/2017 18:50, Radim Krčmář wrote:
> 2017-08-17 13:14+0200, David Hildenbrand:
>>> atomic_set(>online_vcpus, 0);
>>> mutex_unlock(>lock);
>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>> index c8df733eed41..eb9fb5b493ac 100644
>>> --- a/include/linux/kvm_host.h
>>> +++ b/include/linux/kvm_host.h
>>> @@ -386,12 +386,17 @@ struct kvm_memslots {
>>> int used_slots;
>>>  };
>>>  
>>> +struct kvm_vcpus {
>>> +   u32 online;
>>> +   struct kvm_vcpu *array[];
>>
>> On option could be to simply chunk it:
>>
>> +struct kvm_vcpus {
>> +   struct kvm_vcpu vcpus[32];
> 
> I'm thinking of 128/256.
> 
>> +};
>> +
>>  /*
>>   * Note:
>>   * memslots are not sorted by id anymore, please use id_to_memslot()
>> @@ -391,7 +395,7 @@ struct kvm {
>> struct mutex slots_lock;
>> struct mm_struct *mm; /* userspace tied to this vm */
>> struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
>> -   struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>> +   struct kvm_vcpus vcpus[(KVM_MAX_VCPUS + 31) / 32];
>> /*
>>  * created_vcpus is protected by kvm->lock, and is incremented
>> @@ -483,12 +487,14 @@ static inline struct kvm_io_bus
>> *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
>>
>>
>> 1. make nobody access kvm->vcpus directly (factor out)
>> 2. allocate next chunk if necessary when creating a VCPU and store
>> pointer using WRITE_ONCE
>> 3. use READ_ONCE to test for availability of the current chunk
> 
> We can also use kvm->online_vcpus exactly like we did now.
> 
>> kvm_for_each_vcpu just has to use READ_ONCE to access/test for the right
>> chunk. Pointers never get invalid. No RCU needed. Sleeping in the loop
>> is possible.
> 
> I like this better than SRCU because it keeps the internal code mostly
> intact, even though it is compromise solution with a tunable.
> (SRCU gives us more protection than we need.)
>
> I'd do this for v2,

Sounds good!

Paolo


Re: [PATCH RFC 2/2] KVM: RCU protected dynamic vcpus array

2017-08-17 Thread Paolo Bonzini
On 17/08/2017 18:50, Radim Krčmář wrote:
> 2017-08-17 13:14+0200, David Hildenbrand:
>>> atomic_set(>online_vcpus, 0);
>>> mutex_unlock(>lock);
>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>> index c8df733eed41..eb9fb5b493ac 100644
>>> --- a/include/linux/kvm_host.h
>>> +++ b/include/linux/kvm_host.h
>>> @@ -386,12 +386,17 @@ struct kvm_memslots {
>>> int used_slots;
>>>  };
>>>  
>>> +struct kvm_vcpus {
>>> +   u32 online;
>>> +   struct kvm_vcpu *array[];
>>
>> On option could be to simply chunk it:
>>
>> +struct kvm_vcpus {
>> +   struct kvm_vcpu vcpus[32];
> 
> I'm thinking of 128/256.
> 
>> +};
>> +
>>  /*
>>   * Note:
>>   * memslots are not sorted by id anymore, please use id_to_memslot()
>> @@ -391,7 +395,7 @@ struct kvm {
>> struct mutex slots_lock;
>> struct mm_struct *mm; /* userspace tied to this vm */
>> struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
>> -   struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>> +   struct kvm_vcpus vcpus[(KVM_MAX_VCPUS + 31) / 32];
>> /*
>>  * created_vcpus is protected by kvm->lock, and is incremented
>> @@ -483,12 +487,14 @@ static inline struct kvm_io_bus
>> *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
>>
>>
>> 1. make nobody access kvm->vcpus directly (factor out)
>> 2. allocate next chunk if necessary when creating a VCPU and store
>> pointer using WRITE_ONCE
>> 3. use READ_ONCE to test for availability of the current chunk
> 
> We can also use kvm->online_vcpus exactly like we did now.
> 
>> kvm_for_each_vcpu just has to use READ_ONCE to access/test for the right
>> chunk. Pointers never get invalid. No RCU needed. Sleeping in the loop
>> is possible.
> 
> I like this better than SRCU because it keeps the internal code mostly
> intact, even though it is compromise solution with a tunable.
> (SRCU gives us more protection than we need.)
>
> I'd do this for v2,

Sounds good!

Paolo


Re: warning: objtool: ...: call without frame pointer save/setup; return with modified stack frame

2017-08-17 Thread Josh Poimboeuf
On Thu, Aug 17, 2017 at 04:09:58PM +0300, Meelis Roos wrote:
> Today I coticed the following objtool warnings when compiling 
> 4.13-rc5+git for my Atom 330 PC:
> 
> call without frame pointer save/setup
> return with modified stack frame
> 
> Examples (there is a lot more of them in practice):
> 
> arch/x86/kernel/smpboot.o: warning: objtool: arch_update_cpu_topology()+0x29: 
> call without frame pointer save/setup
> arch/x86/kernel/smpboot.o: warning: objtool: native_smp_prepare_cpus()+0x530: 
> return with modified stack frame
> 
> mm/vmalloc.o: warning: objtool: __get_vm_area()+0x26: call without frame 
> pointer save/setup
> mm/vmalloc.o: warning: objtool: get_vm_area()+0x34: call without frame 
> pointer save/setup
> 
> fs/debugfs/inode.o: warning: objtool: debugfs_initialized()+0x22: call 
> without frame pointer save/setup
> fs/debugfs/inode.o: warning: objtool: debugfs_create_file()+0x20: call 
> without frame pointer save/setup
> fs/debugfs/inode.o: warning: objtool: debugfs_create_file_unsafe()+0x20: call 
> without frame pointer save/setup
> 
> mm/nobootmem.o: warning: objtool: __alloc_memory_core_early()+0xb2: return 
> with modified stack frame

Hi Meelis,

Thanks for reporting this issue.  This has been already been fixed in
the tip tree with the following commit:

  5b8de48e82ba ("objtool: Fix '-mtune=atom' decoding support in objtool 2.0")

Ingo, would you be willing to put this patch in the 4.13 urgent queue?

-- 
Josh


Re: warning: objtool: ...: call without frame pointer save/setup; return with modified stack frame

2017-08-17 Thread Josh Poimboeuf
On Thu, Aug 17, 2017 at 04:09:58PM +0300, Meelis Roos wrote:
> Today I coticed the following objtool warnings when compiling 
> 4.13-rc5+git for my Atom 330 PC:
> 
> call without frame pointer save/setup
> return with modified stack frame
> 
> Examples (there is a lot more of them in practice):
> 
> arch/x86/kernel/smpboot.o: warning: objtool: arch_update_cpu_topology()+0x29: 
> call without frame pointer save/setup
> arch/x86/kernel/smpboot.o: warning: objtool: native_smp_prepare_cpus()+0x530: 
> return with modified stack frame
> 
> mm/vmalloc.o: warning: objtool: __get_vm_area()+0x26: call without frame 
> pointer save/setup
> mm/vmalloc.o: warning: objtool: get_vm_area()+0x34: call without frame 
> pointer save/setup
> 
> fs/debugfs/inode.o: warning: objtool: debugfs_initialized()+0x22: call 
> without frame pointer save/setup
> fs/debugfs/inode.o: warning: objtool: debugfs_create_file()+0x20: call 
> without frame pointer save/setup
> fs/debugfs/inode.o: warning: objtool: debugfs_create_file_unsafe()+0x20: call 
> without frame pointer save/setup
> 
> mm/nobootmem.o: warning: objtool: __alloc_memory_core_early()+0xb2: return 
> with modified stack frame

Hi Meelis,

Thanks for reporting this issue.  This has been already been fixed in
the tip tree with the following commit:

  5b8de48e82ba ("objtool: Fix '-mtune=atom' decoding support in objtool 2.0")

Ingo, would you be willing to put this patch in the 4.13 urgent queue?

-- 
Josh


Re: [PATCH 02/13] iommu: Introduce Interface for IOMMU TLB Flushing

2017-08-17 Thread Joerg Roedel
Hi Will,

On Thu, Aug 17, 2017 at 05:32:35PM +0100, Will Deacon wrote:
> I really like the idea of this, but I have a couple of questions and
> comments below.

Great, this together with the common iova-flush it should make it
possible to solve the performance problems of the dma-iommu code.

> > +int iommu_map_sync(struct iommu_domain *domain, unsigned long iova,
> > +  phys_addr_t paddr, size_t size, int prot)
> > +{
> > +   int ret = iommu_map(domain, iova, paddr, size, prot);
> > +
> > +   iommu_tlb_range_add(domain, iova, size);
> > +   iommu_tlb_sync(domain);
> 
> Many IOMMUs don't need these callbacks on ->map operations, but they won't
> be able to distinguish them easily with this API. Could you add a flags
> parameter or something to the iommu_tlb_* functions, please?

Yeah, this is only needed for virtualized IOMMUs that have a non-present
cache. My idea was to let the iommu-drivers tell the common code whether
the iommu needs it and the code above just checks a flag and omits the
calls to the flush-functions then.

Problem currently is how to get this information from
'struct iommu_device' to 'struct iommu_domain'. As a workaround I
consider a per-domain flag in the iommu drivers which checks whether any
unmap has happened and just do nothing on the flush-call-back if there
were none.

> I think we will struggle to implement this efficiently on ARM SMMUv3. The
> way invalidation works there is that there is a single in-memory command
> queue into which we can put TLB invalidation commands (they are inserted
> under a lock). These are then processed asynchronously by the hardware, and
> you can complete them by inserting a SYNC command and waiting for that to
> be consumed by the SMMU. Sounds like a perfect fit, right?

Yes, its basically the same as way as it works on AMD-Vi and Intel VT-d.

> The problem is that we want to add those invalidation commands as early
> as possible, so that they can be processed by the hardware concurrently
> with us unmapping other pages.

I think that's a bad idea, because then you re-introduce the performance
problems again because everyone will spin on the cmd-queue lock in the
unmap path of the dma-api.

> That means adding the invalidation commands in the ->unmap callback
> and not bothering to implement ->iotlb_range_add callback at all.
> Then, we will do the sync in ->iotlb_sync. This falls apart if
> somebody decides to use iommu_flush_tlb_all(), where we would prefer
> not to insert all of the invalidation commands in unmap and instead
> insert a single invalidate-all command, followed up with a SYNC.

This problem can be solved with the deferred iova flushing code I posted
to the ML. When a queue fills up, iommu_flush_tlb_all() is called and
every entry that was unmapped before can be released. This works well on
x86, are there reasons it wouldn't on ARM?

Regards,

Joerg



Re: [PATCH 02/13] iommu: Introduce Interface for IOMMU TLB Flushing

2017-08-17 Thread Joerg Roedel
Hi Will,

On Thu, Aug 17, 2017 at 05:32:35PM +0100, Will Deacon wrote:
> I really like the idea of this, but I have a couple of questions and
> comments below.

Great, this together with the common iova-flush it should make it
possible to solve the performance problems of the dma-iommu code.

> > +int iommu_map_sync(struct iommu_domain *domain, unsigned long iova,
> > +  phys_addr_t paddr, size_t size, int prot)
> > +{
> > +   int ret = iommu_map(domain, iova, paddr, size, prot);
> > +
> > +   iommu_tlb_range_add(domain, iova, size);
> > +   iommu_tlb_sync(domain);
> 
> Many IOMMUs don't need these callbacks on ->map operations, but they won't
> be able to distinguish them easily with this API. Could you add a flags
> parameter or something to the iommu_tlb_* functions, please?

Yeah, this is only needed for virtualized IOMMUs that have a non-present
cache. My idea was to let the iommu-drivers tell the common code whether
the iommu needs it and the code above just checks a flag and omits the
calls to the flush-functions then.

Problem currently is how to get this information from
'struct iommu_device' to 'struct iommu_domain'. As a workaround I
consider a per-domain flag in the iommu drivers which checks whether any
unmap has happened and just do nothing on the flush-call-back if there
were none.

> I think we will struggle to implement this efficiently on ARM SMMUv3. The
> way invalidation works there is that there is a single in-memory command
> queue into which we can put TLB invalidation commands (they are inserted
> under a lock). These are then processed asynchronously by the hardware, and
> you can complete them by inserting a SYNC command and waiting for that to
> be consumed by the SMMU. Sounds like a perfect fit, right?

Yes, its basically the same as way as it works on AMD-Vi and Intel VT-d.

> The problem is that we want to add those invalidation commands as early
> as possible, so that they can be processed by the hardware concurrently
> with us unmapping other pages.

I think that's a bad idea, because then you re-introduce the performance
problems again because everyone will spin on the cmd-queue lock in the
unmap path of the dma-api.

> That means adding the invalidation commands in the ->unmap callback
> and not bothering to implement ->iotlb_range_add callback at all.
> Then, we will do the sync in ->iotlb_sync. This falls apart if
> somebody decides to use iommu_flush_tlb_all(), where we would prefer
> not to insert all of the invalidation commands in unmap and instead
> insert a single invalidate-all command, followed up with a SYNC.

This problem can be solved with the deferred iova flushing code I posted
to the ML. When a queue fills up, iommu_flush_tlb_all() is called and
every entry that was unmapped before can be released. This works well on
x86, are there reasons it wouldn't on ARM?

Regards,

Joerg



Re: [PATCH RFC 2/2] KVM: RCU protected dynamic vcpus array

2017-08-17 Thread Radim Krčmář
2017-08-17 13:14+0200, David Hildenbrand:
> > atomic_set(>online_vcpus, 0);
> > mutex_unlock(>lock);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c8df733eed41..eb9fb5b493ac 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -386,12 +386,17 @@ struct kvm_memslots {
> > int used_slots;
> >  };
> >  
> > +struct kvm_vcpus {
> > +   u32 online;
> > +   struct kvm_vcpu *array[];
> 
> On option could be to simply chunk it:
> 
> +struct kvm_vcpus {
> +   struct kvm_vcpu vcpus[32];

I'm thinking of 128/256.

> +};
> +
>  /*
>   * Note:
>   * memslots are not sorted by id anymore, please use id_to_memslot()
> @@ -391,7 +395,7 @@ struct kvm {
> struct mutex slots_lock;
> struct mm_struct *mm; /* userspace tied to this vm */
> struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
> -   struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> +   struct kvm_vcpus vcpus[(KVM_MAX_VCPUS + 31) / 32];
> /*
>  * created_vcpus is protected by kvm->lock, and is incremented
> @@ -483,12 +487,14 @@ static inline struct kvm_io_bus
> *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
> 
> 
> 1. make nobody access kvm->vcpus directly (factor out)
> 2. allocate next chunk if necessary when creating a VCPU and store
> pointer using WRITE_ONCE
> 3. use READ_ONCE to test for availability of the current chunk

We can also use kvm->online_vcpus exactly like we did now.

> kvm_for_each_vcpu just has to use READ_ONCE to access/test for the right
> chunk. Pointers never get invalid. No RCU needed. Sleeping in the loop
> is possible.

I like this better than SRCU because it keeps the internal code mostly
intact, even though it is compromise solution with a tunable.
(SRCU gives us more protection than we need.)

It is very similar to (3).  Chunks will usually take memory and have
slower access to the first chunk than an extendable list would, but is
faster from the third chunk, which is a reasonable deal.

We are just postponing the problem until a higher number of VCPUs, but
then we can extend it into more levels using the same principle and I
think it will still get a better trade-offs than SRCU.

I'd do this for v2,

thanks.


Re: [PATCH RFC 2/2] KVM: RCU protected dynamic vcpus array

2017-08-17 Thread Radim Krčmář
2017-08-17 13:14+0200, David Hildenbrand:
> > atomic_set(>online_vcpus, 0);
> > mutex_unlock(>lock);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index c8df733eed41..eb9fb5b493ac 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -386,12 +386,17 @@ struct kvm_memslots {
> > int used_slots;
> >  };
> >  
> > +struct kvm_vcpus {
> > +   u32 online;
> > +   struct kvm_vcpu *array[];
> 
> On option could be to simply chunk it:
> 
> +struct kvm_vcpus {
> +   struct kvm_vcpu vcpus[32];

I'm thinking of 128/256.

> +};
> +
>  /*
>   * Note:
>   * memslots are not sorted by id anymore, please use id_to_memslot()
> @@ -391,7 +395,7 @@ struct kvm {
> struct mutex slots_lock;
> struct mm_struct *mm; /* userspace tied to this vm */
> struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM];
> -   struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
> +   struct kvm_vcpus vcpus[(KVM_MAX_VCPUS + 31) / 32];
> /*
>  * created_vcpus is protected by kvm->lock, and is incremented
> @@ -483,12 +487,14 @@ static inline struct kvm_io_bus
> *kvm_get_bus(struct kvm *kvm, enum kvm_bus idx)
> 
> 
> 1. make nobody access kvm->vcpus directly (factor out)
> 2. allocate next chunk if necessary when creating a VCPU and store
> pointer using WRITE_ONCE
> 3. use READ_ONCE to test for availability of the current chunk

We can also use kvm->online_vcpus exactly like we did now.

> kvm_for_each_vcpu just has to use READ_ONCE to access/test for the right
> chunk. Pointers never get invalid. No RCU needed. Sleeping in the loop
> is possible.

I like this better than SRCU because it keeps the internal code mostly
intact, even though it is compromise solution with a tunable.
(SRCU gives us more protection than we need.)

It is very similar to (3).  Chunks will usually take memory and have
slower access to the first chunk than an extendable list would, but is
faster from the third chunk, which is a reasonable deal.

We are just postponing the problem until a higher number of VCPUs, but
then we can extend it into more levels using the same principle and I
think it will still get a better trade-offs than SRCU.

I'd do this for v2,

thanks.


Re: [patch-rt] hotplug, hrtimer: Migrate expired/deferred timers during cpu offline

2017-08-17 Thread Sebastian Andrzej Siewior
On 2017-08-14 09:59:48 [+0200], Mike Galbraith wrote:
> On Fri, 2017-08-11 at 10:15 +0200, Mike Galbraith wrote:
> > On Fri, 2017-08-11 at 09:55 +0200, Mike Galbraith wrote:
> > > The below fixes the list debug explosion up.
> > > 
> > > If we do not migrate expired/deferred timers during cpu offline, 
> > > ->cb_entry
> > > will be corrupted by online initialization of base->expired, leading to a
> > > loud list debug complaint should someone call __remove_hrtimer() 
> > > thereafter.
> > > 
> > > Signed-off-by: Mike Galvraith 
> > ahem.b
> 
> (actually, I shouldn't have signed, question being why we now leave
> them lying about when we _apparently_ previously did not)

takedown_cpu() invokes early smpboot_park_threads() which parks/ stops
the ksoftirqd. That means each hrtimer that fires after that and is not
marked as irqsafe won't be processed but just enqueued onto the
->expired list. The timer would be processed once the CPU goes back
online. Be not really. The thing is that once the CPU goes back online
the "expired" list head will be initialized and that timer is lost. Once
you try to cancel it, it will remove itself from the expired list and
this is when the list corruption is noticed.
My guess here is that the hotplug rework changed the timing and the bug
is more obvious now: if you cancel the timer before the CPU goes back
online then nothing happens.

> > > ---
> > >  kernel/time/hrtimer.c |   13 +
> > >  1 file changed, 13 insertions(+)
> > > 
> > > --- a/kernel/time/hrtimer.c
> > > +++ b/kernel/time/hrtimer.c
> > > @@ -1802,6 +1802,19 @@ static void migrate_hrtimer_list(struct
> > >*/
> > >   enqueue_hrtimer(timer, new_base);
> > >   }
> > > +
> > > + /*
> > > +  * Finally, migrate any expired timers deferred by RT.
> > > +  */
> > > + while (!list_empty(_base->expired)) {
> > > + struct list_head *entry = old_base->expired.next;
> > > +
> > > + timer = container_of(entry, struct hrtimer, cb_entry);
> 
> (oops, forgot to change that back too. [scribble scribble])
> 
> > > + /* XXX: hm, perhaps defer again instead of enqueueing. */
> > > + __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
> > > + timer->base = new_base;
> > > + enqueue_hrtimer(timer, new_base);

__remove_hrtimer() shouldn't be required because it has been done
already. It should be enough to just list_splice() one list to the
other and raise the softirq afterwards.

> > > + }
> > >  }
> > >  
> > >  int hrtimers_dead_cpu(unsigned int scpu)

Sebastian


Re: [patch-rt] hotplug, hrtimer: Migrate expired/deferred timers during cpu offline

2017-08-17 Thread Sebastian Andrzej Siewior
On 2017-08-14 09:59:48 [+0200], Mike Galbraith wrote:
> On Fri, 2017-08-11 at 10:15 +0200, Mike Galbraith wrote:
> > On Fri, 2017-08-11 at 09:55 +0200, Mike Galbraith wrote:
> > > The below fixes the list debug explosion up.
> > > 
> > > If we do not migrate expired/deferred timers during cpu offline, 
> > > ->cb_entry
> > > will be corrupted by online initialization of base->expired, leading to a
> > > loud list debug complaint should someone call __remove_hrtimer() 
> > > thereafter.
> > > 
> > > Signed-off-by: Mike Galvraith 
> > ahem.b
> 
> (actually, I shouldn't have signed, question being why we now leave
> them lying about when we _apparently_ previously did not)

takedown_cpu() invokes early smpboot_park_threads() which parks/ stops
the ksoftirqd. That means each hrtimer that fires after that and is not
marked as irqsafe won't be processed but just enqueued onto the
->expired list. The timer would be processed once the CPU goes back
online. Be not really. The thing is that once the CPU goes back online
the "expired" list head will be initialized and that timer is lost. Once
you try to cancel it, it will remove itself from the expired list and
this is when the list corruption is noticed.
My guess here is that the hotplug rework changed the timing and the bug
is more obvious now: if you cancel the timer before the CPU goes back
online then nothing happens.

> > > ---
> > >  kernel/time/hrtimer.c |   13 +
> > >  1 file changed, 13 insertions(+)
> > > 
> > > --- a/kernel/time/hrtimer.c
> > > +++ b/kernel/time/hrtimer.c
> > > @@ -1802,6 +1802,19 @@ static void migrate_hrtimer_list(struct
> > >*/
> > >   enqueue_hrtimer(timer, new_base);
> > >   }
> > > +
> > > + /*
> > > +  * Finally, migrate any expired timers deferred by RT.
> > > +  */
> > > + while (!list_empty(_base->expired)) {
> > > + struct list_head *entry = old_base->expired.next;
> > > +
> > > + timer = container_of(entry, struct hrtimer, cb_entry);
> 
> (oops, forgot to change that back too. [scribble scribble])
> 
> > > + /* XXX: hm, perhaps defer again instead of enqueueing. */
> > > + __remove_hrtimer(timer, old_base, HRTIMER_STATE_ENQUEUED, 0);
> > > + timer->base = new_base;
> > > + enqueue_hrtimer(timer, new_base);

__remove_hrtimer() shouldn't be required because it has been done
already. It should be enough to just list_splice() one list to the
other and raise the softirq afterwards.

> > > + }
> > >  }
> > >  
> > >  int hrtimers_dead_cpu(unsigned int scpu)

Sebastian


Re: [PATCH 4.4 018/101] netfilter: synproxy: fix conntrackd interaction

2017-08-17 Thread Greg Kroah-Hartman
On Thu, Aug 17, 2017 at 07:57:07AM +0200, Stefan Bader wrote:
> On 03.07.2017 15:34, Greg Kroah-Hartman wrote:
> > 4.4-stable review patch.  If anyone has any objections, please let me know.
> 
> We found that pulling below patch into stable trees without also pulling
> 
> commit 9c3f3794926a997b1cab6c42480ff300efa2d162
> Author: Liping Zhang 
> Date:   Sat Mar 25 16:35:29 2017 +0800
> 
> netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister
> 
> will result in a regression, at least in 4.4.y[1]. Stable maintainers who 
> picked
> up below patch might want to consider picking up above fix.

Thanks, I've now picked this one up too.

greg k-h


Re: [PATCH 4.4 018/101] netfilter: synproxy: fix conntrackd interaction

2017-08-17 Thread Greg Kroah-Hartman
On Thu, Aug 17, 2017 at 07:57:07AM +0200, Stefan Bader wrote:
> On 03.07.2017 15:34, Greg Kroah-Hartman wrote:
> > 4.4-stable review patch.  If anyone has any objections, please let me know.
> 
> We found that pulling below patch into stable trees without also pulling
> 
> commit 9c3f3794926a997b1cab6c42480ff300efa2d162
> Author: Liping Zhang 
> Date:   Sat Mar 25 16:35:29 2017 +0800
> 
> netfilter: nf_ct_ext: fix possible panic after nf_ct_extend_unregister
> 
> will result in a regression, at least in 4.4.y[1]. Stable maintainers who 
> picked
> up below patch might want to consider picking up above fix.

Thanks, I've now picked this one up too.

greg k-h


Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Arnaldo Carvalho de Melo
Em Thu, Aug 17, 2017 at 12:28:16PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Aug 16, 2017 at 03:02:01PM -0700, Andi Kleen escreveu:
> > +++ b/tools/perf/util/parse-events.h
> > @@ -108,15 +108,17 @@ struct parse_events_error {
> > char *help; /* optional help string */
> >  };
> >  
> > +/* error field must match parse_events_terms */

Argh, this seems to be the oddity that causes these problems, a void
pointer is passed around and sometimes code thinks it is a
struct_parse_events_evlist pointer, like in

#line 117 "util/parse-events.y" /* yacc.c:1646  */
{
struct parse_events_evlist *data = _data;

parse_events_update_lists((yyvsp[0].head), >list);
}
#line 1503 "/tmp/build/perf/util/parse-events-bison.c" /* yacc.c:1646  */

while sometimes it thinks it is something else, like in:

  case 52: 
#line 496 "util/parse-events.y" /* yacc.c:1646  */ 
{ 
struct parse_events_terms *data = _data; 
data->terms = (yyvsp[0].head); 
} 
#line 1931 "/tmp/build/perf/util/parse-events-bison.c" /* yacc.c:1646  */

So it will _not_ set to null a member that it doesn't have, i.e. the
minimal fix is to just have the hunks below, making sure that the error
field is present in both structs. No need to set
parse_event_terms->error to anything, it will be set to null since other
fields are set to something.

Longer term we need to fix this mess wrt that "_data" void pointer...

- Arnaldo

> >  struct parse_events_evlist {
> > +   struct parse_events_error *error;
> > struct list_head   list;
> > intidx;
> > intnr_groups;
> > -   struct parse_events_error *error;
> > struct perf_evlist*evlist;
> >  };
> >  
> >  struct parse_events_terms {
> > +   struct parse_events_error *error;
> > struct list_head *terms;
> >  };
> >  
> > -- 
> > 2.9.4


Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Arnaldo Carvalho de Melo
Em Thu, Aug 17, 2017 at 12:28:16PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Aug 16, 2017 at 03:02:01PM -0700, Andi Kleen escreveu:
> > +++ b/tools/perf/util/parse-events.h
> > @@ -108,15 +108,17 @@ struct parse_events_error {
> > char *help; /* optional help string */
> >  };
> >  
> > +/* error field must match parse_events_terms */

Argh, this seems to be the oddity that causes these problems, a void
pointer is passed around and sometimes code thinks it is a
struct_parse_events_evlist pointer, like in

#line 117 "util/parse-events.y" /* yacc.c:1646  */
{
struct parse_events_evlist *data = _data;

parse_events_update_lists((yyvsp[0].head), >list);
}
#line 1503 "/tmp/build/perf/util/parse-events-bison.c" /* yacc.c:1646  */

while sometimes it thinks it is something else, like in:

  case 52: 
#line 496 "util/parse-events.y" /* yacc.c:1646  */ 
{ 
struct parse_events_terms *data = _data; 
data->terms = (yyvsp[0].head); 
} 
#line 1931 "/tmp/build/perf/util/parse-events-bison.c" /* yacc.c:1646  */

So it will _not_ set to null a member that it doesn't have, i.e. the
minimal fix is to just have the hunks below, making sure that the error
field is present in both structs. No need to set
parse_event_terms->error to anything, it will be set to null since other
fields are set to something.

Longer term we need to fix this mess wrt that "_data" void pointer...

- Arnaldo

> >  struct parse_events_evlist {
> > +   struct parse_events_error *error;
> > struct list_head   list;
> > intidx;
> > intnr_groups;
> > -   struct parse_events_error *error;
> > struct perf_evlist*evlist;
> >  };
> >  
> >  struct parse_events_terms {
> > +   struct parse_events_error *error;
> > struct list_head *terms;
> >  };
> >  
> > -- 
> > 2.9.4


[PATCH 3/3] KVM: x86: fix use of L1 MMIO areas in nested guests

2017-08-17 Thread Paolo Bonzini
There is currently some confusion between nested and L1 GPAs.  The
assignment to "direct" in kvm_mmu_page_fault tries to fix that, but
it is not enough.  What this patch does is fence off the MMIO cache
completely when using shadow nested page tables, since we have neither
a GVA nor an L1 GPA to put in the cache.  This also allows some
simplifications in kvm_mmu_page_fault and FNAME(page_fault).

The EPT misconfig likewise does not have an L1 GPA to pass to
kvm_io_bus_write, so that must be skipped for guest mode.

Signed-off-by: Paolo Bonzini 
---
v1->v2: standardize on "nGPA" moniker, replace nested ifs with &&

 arch/x86/kvm/mmu.c | 10 +-
 arch/x86/kvm/paging_tmpl.h |  3 +--
 arch/x86/kvm/vmx.c |  7 ++-
 arch/x86/kvm/x86.h |  6 +-
 4 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a2c592b14617..02f8c507b160 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3596,6 +3596,14 @@ static bool is_shadow_zero_bits_set(struct kvm_mmu *mmu, 
u64 spte, int level)
 
 static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 {
+   /*
+* A nested guest cannot use the MMIO cache if it is using nested
+* page tables, because cr2 is a nGPA while the cache stores L1's
+* physical addresses.
+*/
+   if (mmu_is_nested(vcpu))
+   return false;
+
if (direct)
return vcpu_match_mmio_gpa(vcpu, addr);
 
@@ -4841,7 +4849,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u64 error_code,
 {
int r, emulation_type = EMULTYPE_RETRY;
enum emulation_result er;
-   bool direct = vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu);
+   bool direct = vcpu->arch.mmu.direct_map;
 
/* With shadow page tables, fault_address contains a GVA or nGPA.  */
if (vcpu->arch.mmu.direct_map) {
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3bb90ceeb52d..86b68dc5a649 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -790,8 +790,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
 _writable))
return 0;
 
-   if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
-   walker.gfn, pfn, walker.pte_access, ))
+   if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, 
))
return r;
 
/*
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e2c8b33c35d1..61389ad784e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6402,8 +6402,13 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
int ret;
gpa_t gpa;
 
+   /*
+* A nested guest cannot optimize MMIO vmexits, because we have an
+* nGPA here instead of the required GPA.
+*/
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
-   if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+   if (!is_guest_mode(vcpu) &&
+   !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
trace_kvm_fast_mmio(gpa);
return kvm_skip_emulated_instruction(vcpu);
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 612067074905..113460370a7f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -90,7 +90,11 @@ static inline u32 bit(int bitno)
 static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
gva_t gva, gfn_t gfn, unsigned access)
 {
-   vcpu->arch.mmio_gva = gva & PAGE_MASK;
+   /*
+* If this is a shadow nested page table, the "GVA" is
+* actually a nGPA.
+*/
+   vcpu->arch.mmio_gva = mmu_is_nested(vcpu) ? 0 : gva & PAGE_MASK;
vcpu->arch.access = access;
vcpu->arch.mmio_gfn = gfn;
vcpu->arch.mmio_gen = kvm_memslots(vcpu->kvm)->generation;
-- 
1.8.3.1


[PATCH 3/3] KVM: x86: fix use of L1 MMIO areas in nested guests

2017-08-17 Thread Paolo Bonzini
There is currently some confusion between nested and L1 GPAs.  The
assignment to "direct" in kvm_mmu_page_fault tries to fix that, but
it is not enough.  What this patch does is fence off the MMIO cache
completely when using shadow nested page tables, since we have neither
a GVA nor an L1 GPA to put in the cache.  This also allows some
simplifications in kvm_mmu_page_fault and FNAME(page_fault).

The EPT misconfig likewise does not have an L1 GPA to pass to
kvm_io_bus_write, so that must be skipped for guest mode.

Signed-off-by: Paolo Bonzini 
---
v1->v2: standardize on "nGPA" moniker, replace nested ifs with &&

 arch/x86/kvm/mmu.c | 10 +-
 arch/x86/kvm/paging_tmpl.h |  3 +--
 arch/x86/kvm/vmx.c |  7 ++-
 arch/x86/kvm/x86.h |  6 +-
 4 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a2c592b14617..02f8c507b160 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3596,6 +3596,14 @@ static bool is_shadow_zero_bits_set(struct kvm_mmu *mmu, 
u64 spte, int level)
 
 static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 {
+   /*
+* A nested guest cannot use the MMIO cache if it is using nested
+* page tables, because cr2 is a nGPA while the cache stores L1's
+* physical addresses.
+*/
+   if (mmu_is_nested(vcpu))
+   return false;
+
if (direct)
return vcpu_match_mmio_gpa(vcpu, addr);
 
@@ -4841,7 +4849,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u64 error_code,
 {
int r, emulation_type = EMULTYPE_RETRY;
enum emulation_result er;
-   bool direct = vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu);
+   bool direct = vcpu->arch.mmu.direct_map;
 
/* With shadow page tables, fault_address contains a GVA or nGPA.  */
if (vcpu->arch.mmu.direct_map) {
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 3bb90ceeb52d..86b68dc5a649 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -790,8 +790,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
 _writable))
return 0;
 
-   if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
-   walker.gfn, pfn, walker.pte_access, ))
+   if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, 
))
return r;
 
/*
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e2c8b33c35d1..61389ad784e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6402,8 +6402,13 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
int ret;
gpa_t gpa;
 
+   /*
+* A nested guest cannot optimize MMIO vmexits, because we have an
+* nGPA here instead of the required GPA.
+*/
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
-   if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+   if (!is_guest_mode(vcpu) &&
+   !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
trace_kvm_fast_mmio(gpa);
return kvm_skip_emulated_instruction(vcpu);
}
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 612067074905..113460370a7f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -90,7 +90,11 @@ static inline u32 bit(int bitno)
 static inline void vcpu_cache_mmio_info(struct kvm_vcpu *vcpu,
gva_t gva, gfn_t gfn, unsigned access)
 {
-   vcpu->arch.mmio_gva = gva & PAGE_MASK;
+   /*
+* If this is a shadow nested page table, the "GVA" is
+* actually a nGPA.
+*/
+   vcpu->arch.mmio_gva = mmu_is_nested(vcpu) ? 0 : gva & PAGE_MASK;
vcpu->arch.access = access;
vcpu->arch.mmio_gfn = gfn;
vcpu->arch.mmio_gen = kvm_memslots(vcpu->kvm)->generation;
-- 
1.8.3.1


[PATCH 2/3] KVM: x86: Avoid guest page table walk when gpa_available is set

2017-08-17 Thread Paolo Bonzini
From: Brijesh Singh 

When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.

Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.

Signed-off-by: Brijesh Singh 
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
 new code. - Paolo]
Signed-off-by: Paolo Bonzini 
---
v1->v2: standardize on "nGPA" moniker, move gpa_available
assignment to x86.c

 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  |  6 ++
 arch/x86/kvm/svm.c  |  2 --
 arch/x86/kvm/vmx.c  |  4 
 arch/x86/kvm/x86.c  | 20 +++-
 5 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9e4862e0e978..6db0ed9cf59e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -685,8 +685,9 @@ struct kvm_vcpu_arch {
int pending_ioapic_eoi;
int pending_external_vector;
 
-   /* GPA available (AMD only) */
+   /* GPA available */
bool gpa_available;
+   gpa_t gpa_val;
 
/* be preempted when it's in kernel-mode(cpl=0) */
bool preempted_in_kernel;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f7598883920a..a2c592b14617 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4843,6 +4843,12 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u64 error_code,
enum emulation_result er;
bool direct = vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu);
 
+   /* With shadow page tables, fault_address contains a GVA or nGPA.  */
+   if (vcpu->arch.mmu.direct_map) {
+   vcpu->arch.gpa_available = true;
+   vcpu->arch.gpa_val = cr2;
+   }
+
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2, direct);
if (r == RET_MMIO_PF_EMULATE) {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1fa9ee5660f4..a603d0c14a5a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4236,8 +4236,6 @@ static int handle_exit(struct kvm_vcpu *vcpu)
 
trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM);
 
-   vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
-
if (!is_cr_intercept(svm, INTERCEPT_CR0_WRITE))
vcpu->arch.cr0 = svm->vmcb->save.cr0;
if (npt_enabled)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 45fb0ea78ee8..e2c8b33c35d1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6393,9 +6393,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
error_code |= (exit_qualification & 0x100) != 0 ?
   PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
-   vcpu->arch.gpa_available = true;
vcpu->arch.exit_qualification = exit_qualification;
-
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -6410,7 +6408,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}
 
-   vcpu->arch.gpa_available = true;
ret = kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
if (ret >= 0)
return ret;
@@ -8644,7 +8641,6 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
u32 vectoring_info = vmx->idt_vectoring_info;
 
trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
-   vcpu->arch.gpa_available = false;
 
/*
 * Flush logged GPAs PML buffer, this will make dirty_bitmap more
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e10eda86bc7b..3f34d5f3db8d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4657,25 +4657,18 @@ static int emulator_read_write_onepage(unsigned long 
addr, void *val,
 */
if (vcpu->arch.gpa_available &&
emulator_can_use_gpa(ctxt) &&
-   vcpu_is_mmio_gpa(vcpu, addr, exception->address, write) &&
-   (addr & ~PAGE_MASK) == (exception->address & ~PAGE_MASK)) {
-   gpa = exception->address;
-   goto mmio;
+   (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
+   gpa = vcpu->arch.gpa_val;
+   ret = vcpu_is_mmio_gpa(vcpu, addr, gpa, write);
+   } else {
+   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, , exception, write);
}
 
-   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, , exception, write);
-
if (ret < 0)
return X86EMUL_PROPAGATE_FAULT;
-
-   /* For APIC access vmexit */
-   if (ret)
-   goto mmio;
-
-   if 

[PATCH 2/3] KVM: x86: Avoid guest page table walk when gpa_available is set

2017-08-17 Thread Paolo Bonzini
From: Brijesh Singh 

When a guest causes a page fault which requires emulation, the
vcpu->arch.gpa_available flag is set to indicate that cr2 contains a
valid GPA.

Currently, emulator_read_write_onepage() makes use of gpa_available flag
to avoid a guest page walk for a known MMIO regions. Lets not limit
the gpa_available optimization to just MMIO region. The patch extends
the check to avoid page walk whenever gpa_available flag is set.

Signed-off-by: Brijesh Singh 
[Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the
 new code. - Paolo]
Signed-off-by: Paolo Bonzini 
---
v1->v2: standardize on "nGPA" moniker, move gpa_available
assignment to x86.c

 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  |  6 ++
 arch/x86/kvm/svm.c  |  2 --
 arch/x86/kvm/vmx.c  |  4 
 arch/x86/kvm/x86.c  | 20 +++-
 5 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9e4862e0e978..6db0ed9cf59e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -685,8 +685,9 @@ struct kvm_vcpu_arch {
int pending_ioapic_eoi;
int pending_external_vector;
 
-   /* GPA available (AMD only) */
+   /* GPA available */
bool gpa_available;
+   gpa_t gpa_val;
 
/* be preempted when it's in kernel-mode(cpl=0) */
bool preempted_in_kernel;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index f7598883920a..a2c592b14617 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4843,6 +4843,12 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u64 error_code,
enum emulation_result er;
bool direct = vcpu->arch.mmu.direct_map || mmu_is_nested(vcpu);
 
+   /* With shadow page tables, fault_address contains a GVA or nGPA.  */
+   if (vcpu->arch.mmu.direct_map) {
+   vcpu->arch.gpa_available = true;
+   vcpu->arch.gpa_val = cr2;
+   }
+
if (unlikely(error_code & PFERR_RSVD_MASK)) {
r = handle_mmio_page_fault(vcpu, cr2, direct);
if (r == RET_MMIO_PF_EMULATE) {
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1fa9ee5660f4..a603d0c14a5a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4236,8 +4236,6 @@ static int handle_exit(struct kvm_vcpu *vcpu)
 
trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM);
 
-   vcpu->arch.gpa_available = (exit_code == SVM_EXIT_NPF);
-
if (!is_cr_intercept(svm, INTERCEPT_CR0_WRITE))
vcpu->arch.cr0 = svm->vmcb->save.cr0;
if (npt_enabled)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 45fb0ea78ee8..e2c8b33c35d1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6393,9 +6393,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
error_code |= (exit_qualification & 0x100) != 0 ?
   PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
-   vcpu->arch.gpa_available = true;
vcpu->arch.exit_qualification = exit_qualification;
-
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
@@ -6410,7 +6408,6 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}
 
-   vcpu->arch.gpa_available = true;
ret = kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
if (ret >= 0)
return ret;
@@ -8644,7 +8641,6 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
u32 vectoring_info = vmx->idt_vectoring_info;
 
trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX);
-   vcpu->arch.gpa_available = false;
 
/*
 * Flush logged GPAs PML buffer, this will make dirty_bitmap more
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e10eda86bc7b..3f34d5f3db8d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4657,25 +4657,18 @@ static int emulator_read_write_onepage(unsigned long 
addr, void *val,
 */
if (vcpu->arch.gpa_available &&
emulator_can_use_gpa(ctxt) &&
-   vcpu_is_mmio_gpa(vcpu, addr, exception->address, write) &&
-   (addr & ~PAGE_MASK) == (exception->address & ~PAGE_MASK)) {
-   gpa = exception->address;
-   goto mmio;
+   (addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
+   gpa = vcpu->arch.gpa_val;
+   ret = vcpu_is_mmio_gpa(vcpu, addr, gpa, write);
+   } else {
+   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, , exception, write);
}
 
-   ret = vcpu_mmio_gva_to_gpa(vcpu, addr, , exception, write);
-
if (ret < 0)
return X86EMUL_PROPAGATE_FAULT;
-
-   /* For APIC access vmexit */
-   if (ret)
-   goto mmio;
-
-   if (ops->read_write_emulate(vcpu, gpa, val, bytes))
+   if (!ret && 

[PATCH v2 0/3] KVM: MMU: pending MMU and nEPT patches

2017-08-17 Thread Paolo Bonzini
v2 has all the suggestions from David.

Paolo

Brijesh Singh (1):
  KVM: x86: Avoid guest page table walk when gpa_available is set

Paolo Bonzini (3):
  KVM: x86: simplify ept_misconfig
  KVM: x86: fix use of L1 MMIO areas in nested guests
  KVM: x86: fix PML in nested guests

 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  | 35 +--
 arch/x86/kvm/mmu.h  | 17 -
 arch/x86/kvm/paging_tmpl.h  |  3 +--
 arch/x86/kvm/svm.c  |  2 --
 arch/x86/kvm/vmx.c  | 28 
 arch/x86/kvm/x86.c  | 20 +++-
 arch/x86/kvm/x86.h  |  6 +-
 8 files changed, 60 insertions(+), 54 deletions(-)

-- 
1.8.3.1



[PATCH v2 0/3] KVM: MMU: pending MMU and nEPT patches

2017-08-17 Thread Paolo Bonzini
v2 has all the suggestions from David.

Paolo

Brijesh Singh (1):
  KVM: x86: Avoid guest page table walk when gpa_available is set

Paolo Bonzini (3):
  KVM: x86: simplify ept_misconfig
  KVM: x86: fix use of L1 MMIO areas in nested guests
  KVM: x86: fix PML in nested guests

 arch/x86/include/asm/kvm_host.h |  3 ++-
 arch/x86/kvm/mmu.c  | 35 +--
 arch/x86/kvm/mmu.h  | 17 -
 arch/x86/kvm/paging_tmpl.h  |  3 +--
 arch/x86/kvm/svm.c  |  2 --
 arch/x86/kvm/vmx.c  | 28 
 arch/x86/kvm/x86.c  | 20 +++-
 arch/x86/kvm/x86.h  |  6 +-
 8 files changed, 60 insertions(+), 54 deletions(-)

-- 
1.8.3.1



[PATCH 1/3] KVM: x86: simplify ept_misconfig

2017-08-17 Thread Paolo Bonzini
Calling handle_mmio_page_fault() has been unnecessary since commit
e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to
kvm_mmu_page_fault()", 2016-02-22).

handle_mmio_page_fault() can now be made static.

Signed-off-by: Paolo Bonzini 
---
v1->v2: make the function static.

 arch/x86/kvm/mmu.c | 19 ++-
 arch/x86/kvm/mmu.h | 17 -
 arch/x86/kvm/vmx.c | 13 +++--
 3 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e721e10afda1..f7598883920a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3648,7 +3648,23 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, 
u64 addr, bool direct)
return reserved;
 }
 
-int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
+/*
+ * Return values of handle_mmio_page_fault:
+ * RET_MMIO_PF_EMULATE: it is a real mmio page fault, emulate the instruction
+ * directly.
+ * RET_MMIO_PF_INVALID: invalid spte is detected then let the real page
+ * fault path update the mmio spte.
+ * RET_MMIO_PF_RETRY: let CPU fault again on the address.
+ * RET_MMIO_PF_BUG: a bug was detected (and a WARN was printed).
+ */
+enum {
+   RET_MMIO_PF_EMULATE = 1,
+   RET_MMIO_PF_INVALID = 2,
+   RET_MMIO_PF_RETRY = 0,
+   RET_MMIO_PF_BUG = -1
+};
+
+static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 {
u64 spte;
bool reserved;
@@ -4837,6 +4853,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u64 error_code,
return 1;
if (r < 0)
return r;
+   /* Must be RET_MMIO_PF_INVALID.  */
}
 
r = vcpu->arch.mmu.page_fault(vcpu, cr2, lower_32_bits(error_code),
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d7d248a000dd..3ed6192d93b1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -56,23 +56,6 @@ static inline u64 rsvd_bits(int s, int e)
 void
 reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
-/*
- * Return values of handle_mmio_page_fault:
- * RET_MMIO_PF_EMULATE: it is a real mmio page fault, emulate the instruction
- * directly.
- * RET_MMIO_PF_INVALID: invalid spte is detected then let the real page
- * fault path update the mmio spte.
- * RET_MMIO_PF_RETRY: let CPU fault again on the address.
- * RET_MMIO_PF_BUG: a bug was detected (and a WARN was printed).
- */
-enum {
-   RET_MMIO_PF_EMULATE = 1,
-   RET_MMIO_PF_INVALID = 2,
-   RET_MMIO_PF_RETRY = 0,
-   RET_MMIO_PF_BUG = -1
-};
-
-int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 bool accessed_dirty);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index df8d2f127508..45fb0ea78ee8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6410,17 +6410,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}
 
-   ret = handle_mmio_page_fault(vcpu, gpa, true);
vcpu->arch.gpa_available = true;
-   if (likely(ret == RET_MMIO_PF_EMULATE))
-   return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
- EMULATE_DONE;
-
-   if (unlikely(ret == RET_MMIO_PF_INVALID))
-   return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);
-
-   if (unlikely(ret == RET_MMIO_PF_RETRY))
-   return 1;
+   ret = kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
+   if (ret >= 0)
+   return ret;
 
/* It is the real ept misconfig */
WARN_ON(1);
-- 
1.8.3.1




[PATCH 1/3] KVM: x86: simplify ept_misconfig

2017-08-17 Thread Paolo Bonzini
Calling handle_mmio_page_fault() has been unnecessary since commit
e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to
kvm_mmu_page_fault()", 2016-02-22).

handle_mmio_page_fault() can now be made static.

Signed-off-by: Paolo Bonzini 
---
v1->v2: make the function static.

 arch/x86/kvm/mmu.c | 19 ++-
 arch/x86/kvm/mmu.h | 17 -
 arch/x86/kvm/vmx.c | 13 +++--
 3 files changed, 21 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e721e10afda1..f7598883920a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3648,7 +3648,23 @@ static bool mmio_info_in_cache(struct kvm_vcpu *vcpu, 
u64 addr, bool direct)
return reserved;
 }
 
-int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
+/*
+ * Return values of handle_mmio_page_fault:
+ * RET_MMIO_PF_EMULATE: it is a real mmio page fault, emulate the instruction
+ * directly.
+ * RET_MMIO_PF_INVALID: invalid spte is detected then let the real page
+ * fault path update the mmio spte.
+ * RET_MMIO_PF_RETRY: let CPU fault again on the address.
+ * RET_MMIO_PF_BUG: a bug was detected (and a WARN was printed).
+ */
+enum {
+   RET_MMIO_PF_EMULATE = 1,
+   RET_MMIO_PF_INVALID = 2,
+   RET_MMIO_PF_RETRY = 0,
+   RET_MMIO_PF_BUG = -1
+};
+
+static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 {
u64 spte;
bool reserved;
@@ -4837,6 +4853,7 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, 
u64 error_code,
return 1;
if (r < 0)
return r;
+   /* Must be RET_MMIO_PF_INVALID.  */
}
 
r = vcpu->arch.mmu.page_fault(vcpu, cr2, lower_32_bits(error_code),
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index d7d248a000dd..3ed6192d93b1 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -56,23 +56,6 @@ static inline u64 rsvd_bits(int s, int e)
 void
 reset_shadow_zero_bits_mask(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 
-/*
- * Return values of handle_mmio_page_fault:
- * RET_MMIO_PF_EMULATE: it is a real mmio page fault, emulate the instruction
- * directly.
- * RET_MMIO_PF_INVALID: invalid spte is detected then let the real page
- * fault path update the mmio spte.
- * RET_MMIO_PF_RETRY: let CPU fault again on the address.
- * RET_MMIO_PF_BUG: a bug was detected (and a WARN was printed).
- */
-enum {
-   RET_MMIO_PF_EMULATE = 1,
-   RET_MMIO_PF_INVALID = 2,
-   RET_MMIO_PF_RETRY = 0,
-   RET_MMIO_PF_BUG = -1
-};
-
-int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, bool execonly,
 bool accessed_dirty);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index df8d2f127508..45fb0ea78ee8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6410,17 +6410,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
return kvm_skip_emulated_instruction(vcpu);
}
 
-   ret = handle_mmio_page_fault(vcpu, gpa, true);
vcpu->arch.gpa_available = true;
-   if (likely(ret == RET_MMIO_PF_EMULATE))
-   return x86_emulate_instruction(vcpu, gpa, 0, NULL, 0) ==
- EMULATE_DONE;
-
-   if (unlikely(ret == RET_MMIO_PF_INVALID))
-   return kvm_mmu_page_fault(vcpu, gpa, 0, NULL, 0);
-
-   if (unlikely(ret == RET_MMIO_PF_RETRY))
-   return 1;
+   ret = kvm_mmu_page_fault(vcpu, gpa, PFERR_RSVD_MASK, NULL, 0);
+   if (ret >= 0)
+   return ret;
 
/* It is the real ept misconfig */
WARN_ON(1);
-- 
1.8.3.1




Re: [RFC v2 2/4] iommu/arm-smmu-v3: Add tlbi_on_map option

2017-08-17 Thread Will Deacon
On Fri, Aug 11, 2017 at 03:45:28PM +0200, Eric Auger wrote:
> When running a virtual SMMU on a guest we sometimes need to trap
> all changes to the translation structures. This is especially useful
> to integrate with VFIO. This patch adds a new option that forces
> the IO_PGTABLE_QUIRK_TLBI_ON_MAP to be applied on LPAE page tables.
> 
> TLBI commands then can be trapped.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> v1 -> v2:
> - rebase on v4.13-rc2
> ---
>  Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt | 4 
>  drivers/iommu/arm-smmu-v3.c | 5 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt 
> b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt
> index c9abbf3..ebb85e9 100644
> --- a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt
> +++ b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt
> @@ -52,6 +52,10 @@ the PCIe specification.
>  devicetree/bindings/interrupt-controller/msi.txt
>for a description of the msi-parent property.
>  
> +- tlbi-on-map   : invalidate caches whenever there is an update of
> +  any remapping structure (updates to not-present or
> +  present entries).
> +

My position on this hasn't changed, so NAK for this patch. If you want to
emulate something outside of the SMMUv3 architecture, please do so, but
don't pretend that it's an SMMUv3.

Will


Re: [RFC v2 2/4] iommu/arm-smmu-v3: Add tlbi_on_map option

2017-08-17 Thread Will Deacon
On Fri, Aug 11, 2017 at 03:45:28PM +0200, Eric Auger wrote:
> When running a virtual SMMU on a guest we sometimes need to trap
> all changes to the translation structures. This is especially useful
> to integrate with VFIO. This patch adds a new option that forces
> the IO_PGTABLE_QUIRK_TLBI_ON_MAP to be applied on LPAE page tables.
> 
> TLBI commands then can be trapped.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> v1 -> v2:
> - rebase on v4.13-rc2
> ---
>  Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt | 4 
>  drivers/iommu/arm-smmu-v3.c | 5 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt 
> b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt
> index c9abbf3..ebb85e9 100644
> --- a/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt
> +++ b/Documentation/devicetree/bindings/iommu/arm,smmu-v3.txt
> @@ -52,6 +52,10 @@ the PCIe specification.
>  devicetree/bindings/interrupt-controller/msi.txt
>for a description of the msi-parent property.
>  
> +- tlbi-on-map   : invalidate caches whenever there is an update of
> +  any remapping structure (updates to not-present or
> +  present entries).
> +

My position on this hasn't changed, so NAK for this patch. If you want to
emulate something outside of the SMMUv3 architecture, please do so, but
don't pretend that it's an SMMUv3.

Will


Re: [PATCH 02/13] iommu: Introduce Interface for IOMMU TLB Flushing

2017-08-17 Thread Will Deacon
Hi Joerg,

I really like the idea of this, but I have a couple of questions and
comments below.

On Thu, Aug 17, 2017 at 02:56:25PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> With the current IOMMU-API the hardware TLBs have to be
> flushed in every iommu_map(), iommu_map_sg(), and
> iommu_unmap() call.
> 
> For unmapping large amounts of address space, like it
> happens when a KVM domain with assigned devices is
> destroyed, this causes thousands of unnecessary TLB flushes
> in the IOMMU hardware because the unmap call-back runs for
> every unmapped physical page.
> 
> With the TLB Flush Interface introduced here the need to
> clean the hardware TLBs is removed from the iommu_map/unmap
> functions. Users now have to explicitly call these functions
> to sync the page-table changes to the hardware.
> 
> Three functions are introduced:
> 
>   * iommu_flush_tlb_all() - Flushes all TLB entries
> associated with that
> domain. TLBs entries are
> flushed when this function
> returns.
> 
>   * iommu_tlb_range_add() - This will add a given
> range to the flush queue
> for this domain.
> 
>   * iommu_tlb_sync() - Flushes all queued ranges from
>the hardware TLBs. Returns when
>the flush is finished.
> 
> The semantic of this interface is intentionally similar to
> the iommu_gather_ops from the io-pgtable code.
> 
> Additionally, this patch introduces synchronized versions of
> the iommu_map(), iommu_map_sg(), and iommu_unmap()
> functions. They will be used by current users of the
> IOMMU-API, before they are optimized to the unsynchronized
> versions.
> 
> Cc: Alex Williamson 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Signed-off-by: Joerg Roedel 
> ---
>  drivers/iommu/iommu.c | 26 +
>  include/linux/iommu.h | 80 
> ++-
>  2 files changed, 105 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea16..816e248 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -527,6 +527,8 @@ static int iommu_group_create_direct_mappings(struct 
> iommu_group *group,
>  
>   }
>  
> + iommu_flush_tlb_all(domain);
> +
>  out:
>   iommu_put_resv_regions(dev, );
>  
> @@ -1556,6 +1558,18 @@ int iommu_map(struct iommu_domain *domain, unsigned 
> long iova,
>  }
>  EXPORT_SYMBOL_GPL(iommu_map);
>  
> +int iommu_map_sync(struct iommu_domain *domain, unsigned long iova,
> +phys_addr_t paddr, size_t size, int prot)
> +{
> + int ret = iommu_map(domain, iova, paddr, size, prot);
> +
> + iommu_tlb_range_add(domain, iova, size);
> + iommu_tlb_sync(domain);

Many IOMMUs don't need these callbacks on ->map operations, but they won't
be able to distinguish them easily with this API. Could you add a flags
parameter or something to the iommu_tlb_* functions, please?

> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_map_sync);
> +
>  size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t 
> size)
>  {
>   size_t unmapped_page, unmapped = 0;
> @@ -1608,6 +1622,18 @@ size_t iommu_unmap(struct iommu_domain *domain, 
> unsigned long iova, size_t size)
>  }
>  EXPORT_SYMBOL_GPL(iommu_unmap);
>  
> +size_t iommu_unmap_sync(struct iommu_domain *domain,
> + unsigned long iova, size_t size)
> +{
> + size_t ret = iommu_unmap(domain, iova, size);
> +
> + iommu_tlb_range_add(domain, iova, size);
> + iommu_tlb_sync(domain);

I think we will struggle to implement this efficiently on ARM SMMUv3. The
way invalidation works there is that there is a single in-memory command
queue into which we can put TLB invalidation commands (they are inserted
under a lock). These are then processed asynchronously by the hardware, and
you can complete them by inserting a SYNC command and waiting for that to
be consumed by the SMMU. Sounds like a perfect fit, right?

The problem is that we want to add those invalidation commands as early
as possible, so that they can be processed by the hardware concurrently
with us unmapping other pages. That means adding the invalidation commands
in the ->unmap callback and not bothering to implement ->iotlb_range_add
callback at all. Then, we will do the sync in ->iotlb_sync. This falls
apart if somebody decides to use iommu_flush_tlb_all(), where we would
prefer not to insert all of the invalidation commands in unmap and instead
insert a single invalidate-all command, followed up with a SYNC.

In other words, we really need the information about the invalidation as
part of the unmap call.

Any ideas?

Will


Re: [PATCH 02/13] iommu: Introduce Interface for IOMMU TLB Flushing

2017-08-17 Thread Will Deacon
Hi Joerg,

I really like the idea of this, but I have a couple of questions and
comments below.

On Thu, Aug 17, 2017 at 02:56:25PM +0200, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> With the current IOMMU-API the hardware TLBs have to be
> flushed in every iommu_map(), iommu_map_sg(), and
> iommu_unmap() call.
> 
> For unmapping large amounts of address space, like it
> happens when a KVM domain with assigned devices is
> destroyed, this causes thousands of unnecessary TLB flushes
> in the IOMMU hardware because the unmap call-back runs for
> every unmapped physical page.
> 
> With the TLB Flush Interface introduced here the need to
> clean the hardware TLBs is removed from the iommu_map/unmap
> functions. Users now have to explicitly call these functions
> to sync the page-table changes to the hardware.
> 
> Three functions are introduced:
> 
>   * iommu_flush_tlb_all() - Flushes all TLB entries
> associated with that
> domain. TLBs entries are
> flushed when this function
> returns.
> 
>   * iommu_tlb_range_add() - This will add a given
> range to the flush queue
> for this domain.
> 
>   * iommu_tlb_sync() - Flushes all queued ranges from
>the hardware TLBs. Returns when
>the flush is finished.
> 
> The semantic of this interface is intentionally similar to
> the iommu_gather_ops from the io-pgtable code.
> 
> Additionally, this patch introduces synchronized versions of
> the iommu_map(), iommu_map_sg(), and iommu_unmap()
> functions. They will be used by current users of the
> IOMMU-API, before they are optimized to the unsynchronized
> versions.
> 
> Cc: Alex Williamson 
> Cc: Will Deacon 
> Cc: Robin Murphy 
> Signed-off-by: Joerg Roedel 
> ---
>  drivers/iommu/iommu.c | 26 +
>  include/linux/iommu.h | 80 
> ++-
>  2 files changed, 105 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 3f6ea16..816e248 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -527,6 +527,8 @@ static int iommu_group_create_direct_mappings(struct 
> iommu_group *group,
>  
>   }
>  
> + iommu_flush_tlb_all(domain);
> +
>  out:
>   iommu_put_resv_regions(dev, );
>  
> @@ -1556,6 +1558,18 @@ int iommu_map(struct iommu_domain *domain, unsigned 
> long iova,
>  }
>  EXPORT_SYMBOL_GPL(iommu_map);
>  
> +int iommu_map_sync(struct iommu_domain *domain, unsigned long iova,
> +phys_addr_t paddr, size_t size, int prot)
> +{
> + int ret = iommu_map(domain, iova, paddr, size, prot);
> +
> + iommu_tlb_range_add(domain, iova, size);
> + iommu_tlb_sync(domain);

Many IOMMUs don't need these callbacks on ->map operations, but they won't
be able to distinguish them easily with this API. Could you add a flags
parameter or something to the iommu_tlb_* functions, please?

> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_map_sync);
> +
>  size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t 
> size)
>  {
>   size_t unmapped_page, unmapped = 0;
> @@ -1608,6 +1622,18 @@ size_t iommu_unmap(struct iommu_domain *domain, 
> unsigned long iova, size_t size)
>  }
>  EXPORT_SYMBOL_GPL(iommu_unmap);
>  
> +size_t iommu_unmap_sync(struct iommu_domain *domain,
> + unsigned long iova, size_t size)
> +{
> + size_t ret = iommu_unmap(domain, iova, size);
> +
> + iommu_tlb_range_add(domain, iova, size);
> + iommu_tlb_sync(domain);

I think we will struggle to implement this efficiently on ARM SMMUv3. The
way invalidation works there is that there is a single in-memory command
queue into which we can put TLB invalidation commands (they are inserted
under a lock). These are then processed asynchronously by the hardware, and
you can complete them by inserting a SYNC command and waiting for that to
be consumed by the SMMU. Sounds like a perfect fit, right?

The problem is that we want to add those invalidation commands as early
as possible, so that they can be processed by the hardware concurrently
with us unmapping other pages. That means adding the invalidation commands
in the ->unmap callback and not bothering to implement ->iotlb_range_add
callback at all. Then, we will do the sync in ->iotlb_sync. This falls
apart if somebody decides to use iommu_flush_tlb_all(), where we would
prefer not to insert all of the invalidation commands in unmap and instead
insert a single invalidate-all command, followed up with a SYNC.

In other words, we really need the information about the invalidation as
part of the unmap call.

Any ideas?

Will


Re: [PATCH][i2c-next] i2c-cht-wc: make cht_wc_i2c_adap_driver static

2017-08-17 Thread Wolfram Sang
On Wed, Aug 16, 2017 at 10:16:59AM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> The structure cht_wc_i2c_adap_driver is local to the source
> and does not need to be in global scope, so make it static.
> 
> Cleans up sparse warning:
> symbol 'cht_wc_i2c_adap_driver' was not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King 

CCing Hans to ask him if he maybe wants to create a MAINTAINERS entry
for this driver, so he'll get CCed for such patches? :)

> ---
>  drivers/i2c/busses/i2c-cht-wc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/i2c/busses/i2c-cht-wc.c b/drivers/i2c/busses/i2c-cht-wc.c
> index fe5caf70c7fe..01c94c918d4c 100644
> --- a/drivers/i2c/busses/i2c-cht-wc.c
> +++ b/drivers/i2c/busses/i2c-cht-wc.c
> @@ -322,7 +322,7 @@ static struct platform_device_id 
> cht_wc_i2c_adap_id_table[] = {
>  };
>  MODULE_DEVICE_TABLE(platform, cht_wc_i2c_adap_id_table);
>  
> -struct platform_driver cht_wc_i2c_adap_driver = {
> +static struct platform_driver cht_wc_i2c_adap_driver = {
>   .probe = cht_wc_i2c_adap_i2c_probe,
>   .remove = cht_wc_i2c_adap_i2c_remove,
>   .driver = {
> -- 
> 2.11.0
> 


signature.asc
Description: PGP signature


Re: [PATCH][i2c-next] i2c-cht-wc: make cht_wc_i2c_adap_driver static

2017-08-17 Thread Wolfram Sang
On Wed, Aug 16, 2017 at 10:16:59AM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> The structure cht_wc_i2c_adap_driver is local to the source
> and does not need to be in global scope, so make it static.
> 
> Cleans up sparse warning:
> symbol 'cht_wc_i2c_adap_driver' was not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King 

CCing Hans to ask him if he maybe wants to create a MAINTAINERS entry
for this driver, so he'll get CCed for such patches? :)

> ---
>  drivers/i2c/busses/i2c-cht-wc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/i2c/busses/i2c-cht-wc.c b/drivers/i2c/busses/i2c-cht-wc.c
> index fe5caf70c7fe..01c94c918d4c 100644
> --- a/drivers/i2c/busses/i2c-cht-wc.c
> +++ b/drivers/i2c/busses/i2c-cht-wc.c
> @@ -322,7 +322,7 @@ static struct platform_device_id 
> cht_wc_i2c_adap_id_table[] = {
>  };
>  MODULE_DEVICE_TABLE(platform, cht_wc_i2c_adap_id_table);
>  
> -struct platform_driver cht_wc_i2c_adap_driver = {
> +static struct platform_driver cht_wc_i2c_adap_driver = {
>   .probe = cht_wc_i2c_adap_i2c_probe,
>   .remove = cht_wc_i2c_adap_i2c_remove,
>   .driver = {
> -- 
> 2.11.0
> 


signature.asc
Description: PGP signature


Re: [PATCH v4] livepatch: introduce shadow variable API

2017-08-17 Thread Josh Poimboeuf
On Thu, Aug 17, 2017 at 12:01:33PM -0400, Joe Lawrence wrote:
> Without a good real-world example, you've convinced me that
> klp_shadow_update_or_attach() could be dropped.  I think this will also
> simplify the requirements of a shared __klp_shadow_get_or_attach() like
> you sketched out earlier.
> 
> If Josh and Miroslav don't mind, I'd like to continue churning this
> patch with the suggestions that Petr has made.

By all means, go ahead.  I rescind my Ack ;-)

-- 
Josh


Re: [PATCH v4] livepatch: introduce shadow variable API

2017-08-17 Thread Josh Poimboeuf
On Thu, Aug 17, 2017 at 12:01:33PM -0400, Joe Lawrence wrote:
> Without a good real-world example, you've convinced me that
> klp_shadow_update_or_attach() could be dropped.  I think this will also
> simplify the requirements of a shared __klp_shadow_get_or_attach() like
> you sketched out earlier.
> 
> If Josh and Miroslav don't mind, I'd like to continue churning this
> patch with the suggestions that Petr has made.

By all means, go ahead.  I rescind my Ack ;-)

-- 
Josh


[RFC] memory allocations in genalloc

2017-08-17 Thread Igor Stoppa
Foreword:
If I should direct this message to someone else, please let me know.
I couldn't get a clear idea, by looking at both MAINTAINERS and git blame.



Hi,

I'm currently trying to convert the SE Linux policy db into using a
protectable memory allocator (pmalloc) that I have developed.

Such allocator is based on genalloc: I had come up with an
implementation that was pretty similar to what genalloc already does, so
it was pointed out to me that I could have a look at it.

And, indeed, it seemed a perfect choice.

But ... when freeing memory, genalloc wants that the caller also states
how large each specific memory allocation is.

This, per se, is not an issue, although genalloc doesn't seen to check
if the memory being freed is really matching a previous allocation request.

However, this design doesn't sit well with the use case I have in mind.

In particular, when the SE Linux policy db is populated, the creation of
one or more specific entry of the db might fail.
In this case, the memory previously allocated for said entry, is
released with kfree, which doesn't need to know the size of the chunk
being freed.

I would like to add similar capability to genalloc.

genalloc already uses bitmaps, to track what words are allocated (1) and
which are free (0)

What I would like to do is to add another bitmap, which would track the
beginning of each individual allocation (1 on the first allocation unit
of each allocation, 0 otherwise).

Such enhancement would enable also the detection of calls to free with
incorrect / misaligned addresses - right now it is possible to
successfully free a memory area that overlaps the interface of two
adjacent allocations, without fully covering either of them.

Would this change be acceptable?
Is there any better way to achieve what I want?


---

I have also a question wrt the use of spinlocks in genalloc.
Why a spinlock?

Freeing a chunk of memory previously allocated with vmalloc requires
invoking vfree_atomic, instead of vfree, because the list of chunks is
walked with the spinlock held, and vfree can sleep.

Why not using a mutex?


--
TIA, igor


[RFC] memory allocations in genalloc

2017-08-17 Thread Igor Stoppa
Foreword:
If I should direct this message to someone else, please let me know.
I couldn't get a clear idea, by looking at both MAINTAINERS and git blame.



Hi,

I'm currently trying to convert the SE Linux policy db into using a
protectable memory allocator (pmalloc) that I have developed.

Such allocator is based on genalloc: I had come up with an
implementation that was pretty similar to what genalloc already does, so
it was pointed out to me that I could have a look at it.

And, indeed, it seemed a perfect choice.

But ... when freeing memory, genalloc wants that the caller also states
how large each specific memory allocation is.

This, per se, is not an issue, although genalloc doesn't seen to check
if the memory being freed is really matching a previous allocation request.

However, this design doesn't sit well with the use case I have in mind.

In particular, when the SE Linux policy db is populated, the creation of
one or more specific entry of the db might fail.
In this case, the memory previously allocated for said entry, is
released with kfree, which doesn't need to know the size of the chunk
being freed.

I would like to add similar capability to genalloc.

genalloc already uses bitmaps, to track what words are allocated (1) and
which are free (0)

What I would like to do is to add another bitmap, which would track the
beginning of each individual allocation (1 on the first allocation unit
of each allocation, 0 otherwise).

Such enhancement would enable also the detection of calls to free with
incorrect / misaligned addresses - right now it is possible to
successfully free a memory area that overlaps the interface of two
adjacent allocations, without fully covering either of them.

Would this change be acceptable?
Is there any better way to achieve what I want?


---

I have also a question wrt the use of spinlocks in genalloc.
Why a spinlock?

Freeing a chunk of memory previously allocated with vmalloc requires
invoking vfree_atomic, instead of vfree, because the list of chunks is
walked with the spinlock held, and vfree can sleep.

Why not using a mutex?


--
TIA, igor


Re: [PATCH] staging: lustre: fix structure size for ARM OABI

2017-08-17 Thread Greg KH
On Wed, Aug 16, 2017 at 05:44:15PM +0300, Cihangir Akturk wrote:
> When building the kernel for the ARM architecture without setting
> CONFIG_AEABI, size of struct lov_user_md_v3 and struct lov_mds_md_v3
> differs, due to different alignment requirements of OABI and EABI.
> 
> Marking the anonymous union within struct lov_user_md_v3 as
> '_packed' solves this issue. Otherwise we get the following
> error:
> 
> drivers/staging/lustre/lustre/lov/lov_pack.c:352:2: note: in expansion
> of macro ‘BUILD_BUG_ON’
>   BUILD_BUG_ON(sizeof(lum) != sizeof(struct lov_mds_md_v3));
> 
> Signed-off-by: Cihangir Akturk 
> ---
>  drivers/staging/lustre/lustre/include/lustre/lustre_user.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

This file is no longer in the tree :(

thanks,

greg k-h


Re: [PATCH] staging: lustre: fix structure size for ARM OABI

2017-08-17 Thread Greg KH
On Wed, Aug 16, 2017 at 05:44:15PM +0300, Cihangir Akturk wrote:
> When building the kernel for the ARM architecture without setting
> CONFIG_AEABI, size of struct lov_user_md_v3 and struct lov_mds_md_v3
> differs, due to different alignment requirements of OABI and EABI.
> 
> Marking the anonymous union within struct lov_user_md_v3 as
> '_packed' solves this issue. Otherwise we get the following
> error:
> 
> drivers/staging/lustre/lustre/lov/lov_pack.c:352:2: note: in expansion
> of macro ‘BUILD_BUG_ON’
>   BUILD_BUG_ON(sizeof(lum) != sizeof(struct lov_mds_md_v3));
> 
> Signed-off-by: Cihangir Akturk 
> ---
>  drivers/staging/lustre/lustre/include/lustre/lustre_user.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

This file is no longer in the tree :(

thanks,

greg k-h


Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Arnaldo Carvalho de Melo
Em Thu, Aug 17, 2017 at 08:34:22AM -0700, Andi Kleen escreveu:
> > Humm, but don't we have that checked?
> 
> At least not in the case of the segfault below.

Again:

tools/perf/util/parse-events.c

2523 void parse_events_evlist_error(struct parse_events_evlist *data,
2524int idx, const char *str)
2525 {
2526 struct parse_events_error *err = data->error;
2527 
2528 if (!err)
2529 return;
2530 err->idx = idx;
2531 err->str = strdup(str);
2532 WARN_ONCE(!err->str, "WARNING: failed to allocate error string");
2533 }

data->error _is_ being checked for NULL, and the hunk you added would
just change data->error from NULL (some other member is being assigned,
so all the other left unassigned will be set to zero) to something that
will then be set to something to then get ignored.

Looking at the intervening functions in the .y files...

- Arnaldo
 
> > > Segfault:
> > > 
> > > #'0  0x004d66d2 in parse_events_evlist_error 
> > > (data=0x7fffce20, idx=6, str=0x76cda8 "parser error") at 
> > > util/parse-events.c:2530
> > > #'1  0x00516d0a in parse_events_error (loc=0x7fffb210, 
> > > data=0x7fffce20, scanner=0x245b400, msg=0x76cb13 "syntax error") at 
> > > util/parse-events.y:692
> > > #'2  0x0051675b in parse_events_parse (_data=0x7fffce20, 
> > > scanner=0x245b400) at /home/ak/hle/obj-perf/util/parse-events-bison.c:2213
> > > #'3  0x004d3fd6 in parse_events__scanner (str=0x725cff 
> > > "event=0,", data=0x7fffce20, start_token=259) at 
> > > util/parse-events.c:1646
> > > #'4  0x004d4063 in parse_events_terms (terms=0x245b398, 
> > > str=0x725cff "event=0,") at util/parse-events.c:1664
> > > #'5  0x005179f1 in __perf_pmu__new_alias 
> > > (list=0x7fffcf90, dir=0x0, name=0x725cec "unc_cha_clockticks", 
> > > desc=0x725d08 "Uncore cache clock ticks. Unit: uncore_cha ",
> > > val=0x725cff "event=0,", long_desc=0x0, topic=0x725d34 "uncore 
> > > other", unit=0x0, perpkg=0x6ca7c6 "1", metric_expr=0x0, metric_name=0x0) 
> > > at util/pmu.c:255
> > > #'6  0x00518789 in pmu_add_cpu_aliases (head=0x7fffcf90, 
> > > name=0x2450903 "uncore_cha_9") at util/pmu.c:571
> > > #'7  0x005188ac in pmu_lookup (name=0x2450903 "uncore_cha_9") 
> > > at util/pmu.c:613
> > > #'8  0x00518aff in perf_pmu__find (name=0x2450903 
> > > "uncore_cha_9") at util/pmu.c:672
> > > #'9  0x005183d5 in pmu_read_sysfs () at util/pmu.c:467
> > > #'10 0x00518a54 in perf_pmu__scan (pmu=0x0) at util/pmu.c:651
> > > #'11 0x00519f26 in print_pmu_events (event_glob=0x0, 
> > > name_only=false, quiet_flag=false, long_desc=false, details_flag=false) 
> > > at util/pmu.c:1173
> > > #'12 0x004d5ef0 in print_events (event_glob=0x0, 
> > > name_only=false, quiet_flag=false, long_desc=false, details_flag=false) 
> > > at util/parse-events.c:2343
> > > #'13 0x0043c7d4 in cmd_list (argc=0, argv=0x7fffeb90) at 
> > > builtin-list.c:56
> > > #'14 0x004ab2c8 in run_builtin (p=0xa281a0 , 
> > > argc=1, argv=0x7fffeb90) at perf.c:296
> > > #15 0x004ab535 in handle_internal_command (argc=1, 
> > > argv=0x7fffeb90) at perf.c:348
> > > #16 0x004ab687 in run_argv (argcp=0x7fffe9ec, 
> > > argv=0x7fffe9e0) at perf.c:392
> > > #17 0x004aba55 in main (argc=1, argv=0x7fffeb90) at 
> > > perf.c:530
> > > 


Re: [PATCH 2/2] perf, tools: Avoid segfault on alias parse error

2017-08-17 Thread Arnaldo Carvalho de Melo
Em Thu, Aug 17, 2017 at 08:34:22AM -0700, Andi Kleen escreveu:
> > Humm, but don't we have that checked?
> 
> At least not in the case of the segfault below.

Again:

tools/perf/util/parse-events.c

2523 void parse_events_evlist_error(struct parse_events_evlist *data,
2524int idx, const char *str)
2525 {
2526 struct parse_events_error *err = data->error;
2527 
2528 if (!err)
2529 return;
2530 err->idx = idx;
2531 err->str = strdup(str);
2532 WARN_ONCE(!err->str, "WARNING: failed to allocate error string");
2533 }

data->error _is_ being checked for NULL, and the hunk you added would
just change data->error from NULL (some other member is being assigned,
so all the other left unassigned will be set to zero) to something that
will then be set to something to then get ignored.

Looking at the intervening functions in the .y files...

- Arnaldo
 
> > > Segfault:
> > > 
> > > #'0  0x004d66d2 in parse_events_evlist_error 
> > > (data=0x7fffce20, idx=6, str=0x76cda8 "parser error") at 
> > > util/parse-events.c:2530
> > > #'1  0x00516d0a in parse_events_error (loc=0x7fffb210, 
> > > data=0x7fffce20, scanner=0x245b400, msg=0x76cb13 "syntax error") at 
> > > util/parse-events.y:692
> > > #'2  0x0051675b in parse_events_parse (_data=0x7fffce20, 
> > > scanner=0x245b400) at /home/ak/hle/obj-perf/util/parse-events-bison.c:2213
> > > #'3  0x004d3fd6 in parse_events__scanner (str=0x725cff 
> > > "event=0,", data=0x7fffce20, start_token=259) at 
> > > util/parse-events.c:1646
> > > #'4  0x004d4063 in parse_events_terms (terms=0x245b398, 
> > > str=0x725cff "event=0,") at util/parse-events.c:1664
> > > #'5  0x005179f1 in __perf_pmu__new_alias 
> > > (list=0x7fffcf90, dir=0x0, name=0x725cec "unc_cha_clockticks", 
> > > desc=0x725d08 "Uncore cache clock ticks. Unit: uncore_cha ",
> > > val=0x725cff "event=0,", long_desc=0x0, topic=0x725d34 "uncore 
> > > other", unit=0x0, perpkg=0x6ca7c6 "1", metric_expr=0x0, metric_name=0x0) 
> > > at util/pmu.c:255
> > > #'6  0x00518789 in pmu_add_cpu_aliases (head=0x7fffcf90, 
> > > name=0x2450903 "uncore_cha_9") at util/pmu.c:571
> > > #'7  0x005188ac in pmu_lookup (name=0x2450903 "uncore_cha_9") 
> > > at util/pmu.c:613
> > > #'8  0x00518aff in perf_pmu__find (name=0x2450903 
> > > "uncore_cha_9") at util/pmu.c:672
> > > #'9  0x005183d5 in pmu_read_sysfs () at util/pmu.c:467
> > > #'10 0x00518a54 in perf_pmu__scan (pmu=0x0) at util/pmu.c:651
> > > #'11 0x00519f26 in print_pmu_events (event_glob=0x0, 
> > > name_only=false, quiet_flag=false, long_desc=false, details_flag=false) 
> > > at util/pmu.c:1173
> > > #'12 0x004d5ef0 in print_events (event_glob=0x0, 
> > > name_only=false, quiet_flag=false, long_desc=false, details_flag=false) 
> > > at util/parse-events.c:2343
> > > #'13 0x0043c7d4 in cmd_list (argc=0, argv=0x7fffeb90) at 
> > > builtin-list.c:56
> > > #'14 0x004ab2c8 in run_builtin (p=0xa281a0 , 
> > > argc=1, argv=0x7fffeb90) at perf.c:296
> > > #15 0x004ab535 in handle_internal_command (argc=1, 
> > > argv=0x7fffeb90) at perf.c:348
> > > #16 0x004ab687 in run_argv (argcp=0x7fffe9ec, 
> > > argv=0x7fffe9e0) at perf.c:392
> > > #17 0x004aba55 in main (argc=1, argv=0x7fffeb90) at 
> > > perf.c:530
> > > 


Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Linus Torvalds
On Thu, Aug 17, 2017 at 9:17 AM, Liang, Kan  wrote:
>
> We tried both 12 and 16 bit table and that didn't make a difference.
> The long wake ups are mostly on the same page when we do instrumentation

Ok.

> Here is the wake_up_page_bit call stack when the workaround is running, which
> is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10

It's actually not really wake_up_page_bit() that is all that
interesting, it would be more interesting to see which path it is that
*adds* the entries.

So it's mainly wait_on_page_bit_common(), but also add_page_wait_queue().

Can you get that call stack instead (or in addition to)?

  Linus


Re: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Linus Torvalds
On Thu, Aug 17, 2017 at 9:17 AM, Liang, Kan  wrote:
>
> We tried both 12 and 16 bit table and that didn't make a difference.
> The long wake ups are mostly on the same page when we do instrumentation

Ok.

> Here is the wake_up_page_bit call stack when the workaround is running, which
> is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10

It's actually not really wake_up_page_bit() that is all that
interesting, it would be more interesting to see which path it is that
*adds* the entries.

So it's mainly wait_on_page_bit_common(), but also add_page_wait_queue().

Can you get that call stack instead (or in addition to)?

  Linus


Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-08-17 Thread Waiman Long
On 08/17/2017 09:34 AM, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 16:40:40 -0400
> Waiman Long  wrote:
>
>> The lockdep code had reported the following unsafe locking scenario:
>>
>>CPU0CPU1
>>
>>   lock(s_active#228);
>>lock(>bd_mutex/1);
>>lock(s_active#228);
>>   lock(>bd_mutex);
> Can you show the exact locations of these locks. I have no idea where
> this "s_active" is.
The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

>>  *** DEADLOCK ***
>>
>> The deadlock may happen when one task (CPU1) is trying to delete
>> a partition in a block device and another task (CPU0) is accessing
>> tracing sysfs file in that partition.
>>
>> To avoid that, accessing tracing sysfs file will now use a mutex
>> trylock loop and the operation will fail if a delete operation is
>> in progress.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>
>>  v2:
>>- Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
>>- Check for signal in the mutex_trylock loops.
>>- Use usleep() instead of schedule() for RT tasks.
> I'm sorry but I really do hate this patch.

Any suggestion on how to make it better?

>>  block/ioctl.c   |  3 +++
>>  include/linux/fs.h  |  1 +
>>  kernel/trace/blktrace.c | 39 +--
>>  3 files changed, 41 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index 0de02ee..b920329 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -86,12 +86,15 @@ static int blkpg_ioctl(struct block_device *bdev, struct 
>> blkpg_ioctl_arg __user
>>  return -EBUSY;
>>  }
>>  /* all seems OK */
>> +smp_store_mb(bdev->bd_deleting, 1);
> No comment to explain what is happening here, and why.

I am going to add a comment block here.

>>  fsync_bdev(bdevp);
>>  invalidate_bdev(bdevp);
>>  
>>  mutex_lock_nested(>bd_mutex, 1);
>>  delete_partition(disk, partno);
>>  mutex_unlock(>bd_mutex);
>> +smp_store_mb(bdev->bd_deleting, 0);
>> +
> ditto.
>
>>  mutex_unlock(>bd_mutex);
>>  bdput(bdevp);
>>  
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6e1fd5d..c2ba35e 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -427,6 +427,7 @@ struct block_device {
>>  #endif
>>  struct block_device *   bd_contains;
>>  unsignedbd_block_size;
>> +int bd_deleting;
>>  struct hd_struct *  bd_part;
>>  /* number of times partitions within this device have been opened. */
>>  unsignedbd_part_count;
>> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
>> index bc364f8..b2dffa9 100644
>> --- a/kernel/trace/blktrace.c
>> +++ b/kernel/trace/blktrace.c
>> @@ -27,6 +27,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>  
>>  #include "../../block/blk.h"
>>  
>> @@ -1605,6 +1607,23 @@ static struct request_queue 
>> *blk_trace_get_queue(struct block_device *bdev)
>>  return bdev_get_queue(bdev);
>>  }
>>  
>> +/*
>> + * Read/write to the tracing sysfs file requires taking references to the
> What's the "tracing sysfs" file? tracefs?

I am referring to blk_trace sysfs files which are used for enable
tracing of block operations.

>> + * sysfs file and then acquiring the bd_mutex. Deleting a block device
>> + * requires acquiring the bd_mutex and then waiting for all the sysfs
>> + * references to be gone. This can lead to deadlock if both operations
>> + * happen simultaneously. To avoid this problem, read/write to the
>> + * the tracing sysfs files can now fail if the bd_mutex cannot be
>> + * acquired while a deletion operation is in progress.
>> + *
>> + * A mutex trylock loop is used assuming that tracing sysfs operations
> A mutex trylock loop is not enough to stop a deadlock. But I'm guessing
> the undocumented bd_deleting may prevent that.

Yes, that is what the bd_deleting flag is for.

I was thinking about failing the sysfs operation after a certain number
of trylock attempts, but then it will require changes to user space code
to handle the occasional failures. Finally I decided to fail it only
when a delete operation is in progress as all hopes are lost in this case.

>> + * aren't frequently enough to cause any contention problem.
>> + *
>> + * For RT tasks, a running high priority task will prevent any lower
>> + * priority RT tasks from being run. So they do need to actually 

Re: [PATCH v2] blktrace: Fix potentail deadlock between delete & sysfs ops

2017-08-17 Thread Waiman Long
On 08/17/2017 09:34 AM, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 16:40:40 -0400
> Waiman Long  wrote:
>
>> The lockdep code had reported the following unsafe locking scenario:
>>
>>CPU0CPU1
>>
>>   lock(s_active#228);
>>lock(>bd_mutex/1);
>>lock(s_active#228);
>>   lock(>bd_mutex);
> Can you show the exact locations of these locks. I have no idea where
> this "s_active" is.
The s_active isn't an actual lock. It is a reference count (kn->count)
on the sysfs (kernfs) file. Removal of a sysfs file, however, require
a wait until all the references are gone. The reference count is
treated like a rwsem using lockdep instrumentation code.

>>  *** DEADLOCK ***
>>
>> The deadlock may happen when one task (CPU1) is trying to delete
>> a partition in a block device and another task (CPU0) is accessing
>> tracing sysfs file in that partition.
>>
>> To avoid that, accessing tracing sysfs file will now use a mutex
>> trylock loop and the operation will fail if a delete operation is
>> in progress.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>
>>  v2:
>>- Use READ_ONCE() and smp_store_mb() to read and write bd_deleting.
>>- Check for signal in the mutex_trylock loops.
>>- Use usleep() instead of schedule() for RT tasks.
> I'm sorry but I really do hate this patch.

Any suggestion on how to make it better?

>>  block/ioctl.c   |  3 +++
>>  include/linux/fs.h  |  1 +
>>  kernel/trace/blktrace.c | 39 +--
>>  3 files changed, 41 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/ioctl.c b/block/ioctl.c
>> index 0de02ee..b920329 100644
>> --- a/block/ioctl.c
>> +++ b/block/ioctl.c
>> @@ -86,12 +86,15 @@ static int blkpg_ioctl(struct block_device *bdev, struct 
>> blkpg_ioctl_arg __user
>>  return -EBUSY;
>>  }
>>  /* all seems OK */
>> +smp_store_mb(bdev->bd_deleting, 1);
> No comment to explain what is happening here, and why.

I am going to add a comment block here.

>>  fsync_bdev(bdevp);
>>  invalidate_bdev(bdevp);
>>  
>>  mutex_lock_nested(>bd_mutex, 1);
>>  delete_partition(disk, partno);
>>  mutex_unlock(>bd_mutex);
>> +smp_store_mb(bdev->bd_deleting, 0);
>> +
> ditto.
>
>>  mutex_unlock(>bd_mutex);
>>  bdput(bdevp);
>>  
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 6e1fd5d..c2ba35e 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -427,6 +427,7 @@ struct block_device {
>>  #endif
>>  struct block_device *   bd_contains;
>>  unsignedbd_block_size;
>> +int bd_deleting;
>>  struct hd_struct *  bd_part;
>>  /* number of times partitions within this device have been opened. */
>>  unsignedbd_part_count;
>> diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
>> index bc364f8..b2dffa9 100644
>> --- a/kernel/trace/blktrace.c
>> +++ b/kernel/trace/blktrace.c
>> @@ -27,6 +27,8 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>> +#include 
>>  
>>  #include "../../block/blk.h"
>>  
>> @@ -1605,6 +1607,23 @@ static struct request_queue 
>> *blk_trace_get_queue(struct block_device *bdev)
>>  return bdev_get_queue(bdev);
>>  }
>>  
>> +/*
>> + * Read/write to the tracing sysfs file requires taking references to the
> What's the "tracing sysfs" file? tracefs?

I am referring to blk_trace sysfs files which are used for enable
tracing of block operations.

>> + * sysfs file and then acquiring the bd_mutex. Deleting a block device
>> + * requires acquiring the bd_mutex and then waiting for all the sysfs
>> + * references to be gone. This can lead to deadlock if both operations
>> + * happen simultaneously. To avoid this problem, read/write to the
>> + * the tracing sysfs files can now fail if the bd_mutex cannot be
>> + * acquired while a deletion operation is in progress.
>> + *
>> + * A mutex trylock loop is used assuming that tracing sysfs operations
> A mutex trylock loop is not enough to stop a deadlock. But I'm guessing
> the undocumented bd_deleting may prevent that.

Yes, that is what the bd_deleting flag is for.

I was thinking about failing the sysfs operation after a certain number
of trylock attempts, but then it will require changes to user space code
to handle the occasional failures. Finally I decided to fail it only
when a delete operation is in progress as all hopes are lost in this case.

>> + * aren't frequently enough to cause any contention problem.
>> + *
>> + * For RT tasks, a running high priority task will prevent any lower
>> + * priority RT tasks from being run. So they do need to actually sleep
>> + * when the trylock fails to 

Re: [PATCH v2 2/3] i2c: davinci: Add PM Runtime Support

2017-08-17 Thread Wolfram Sang
On Wed, Aug 16, 2017 at 05:17:14PM -0500, Franklin S Cooper Jr wrote:
> 66AK2G has I2C instances that are not apart of the ALWAYS_ON power domain
> like other Keystone 2 SoCs and OMAPL138. Therefore, pm_runtime
> is required to insure the power domain used by the specific I2C instance is
> properly turned on along with its functional clock.
> 
> Signed-off-by: Franklin S Cooper Jr 

I'd need a review from the driver maintainers here. Which are Sekhar
Nori and Kevin Hilman(?) according to MAINTAINERS. Is that still
correct?

> ---
> Version 2 changes:
> Move initial calls to pm runtime autosuspend before pm_runtime_enable
> 
>  drivers/i2c/busses/i2c-davinci.c | 62 
> ++--
>  1 file changed, 54 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-davinci.c 
> b/drivers/i2c/busses/i2c-davinci.c
> index b8c4353..6b1930d 100644
> --- a/drivers/i2c/busses/i2c-davinci.c
> +++ b/drivers/i2c/busses/i2c-davinci.c
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* - global defines --- */
>  
> @@ -122,6 +123,9 @@
>  /* set the SDA GPIO low */
>  #define DAVINCI_I2C_DCLR_PDCLR1  BIT(1)
>  
> +/* timeout for pm runtime autosuspend */
> +#define DAVINCI_I2C_PM_TIMEOUT   1000/* ms */
> +
>  struct davinci_i2c_dev {
>   struct device   *dev;
>   void __iomem*base;
> @@ -541,10 +545,17 @@ i2c_davinci_xfer(struct i2c_adapter *adap, struct 
> i2c_msg msgs[], int num)
>  
>   dev_dbg(dev->dev, "%s: msgs: %d\n", __func__, num);
>  
> + ret = pm_runtime_get_sync(dev->dev);
> + if (ret < 0) {
> + dev_err(dev->dev, "Failed to runtime_get device: %d\n", ret);
> + pm_runtime_put_noidle(dev->dev);
> + return ret;
> + }
> +
>   ret = i2c_davinci_wait_bus_not_busy(dev);
>   if (ret < 0) {
>   dev_warn(dev->dev, "timeout waiting for bus ready\n");
> - return ret;
> + goto out;
>   }
>  
>   for (i = 0; i < num; i++) {
> @@ -552,14 +563,19 @@ i2c_davinci_xfer(struct i2c_adapter *adap, struct 
> i2c_msg msgs[], int num)
>   dev_dbg(dev->dev, "%s [%d/%d] ret: %d\n", __func__, i + 1, num,
>   ret);
>   if (ret < 0)
> - return ret;
> + goto out;
>   }
>  
> + ret = num;
>  #ifdef CONFIG_CPU_FREQ
>   complete(>xfr_complete);
>  #endif
>  
> - return num;
> +out:
> + pm_runtime_mark_last_busy(dev->dev);
> + pm_runtime_put_autosuspend(dev->dev);
> +
> + return ret;
>  }
>  
>  static u32 i2c_davinci_func(struct i2c_adapter *adap)
> @@ -599,6 +615,9 @@ static irqreturn_t i2c_davinci_isr(int this_irq, void 
> *dev_id)
>   int count = 0;
>   u16 w;
>  
> + if (pm_runtime_suspended(dev->dev))
> + return IRQ_NONE;
> +
>   while ((stat = davinci_i2c_read_reg(dev, DAVINCI_I2C_IVR_REG))) {
>   dev_dbg(dev->dev, "%s: stat=0x%x\n", __func__, stat);
>   if (count++ == 100) {
> @@ -802,7 +821,6 @@ static int davinci_i2c_probe(struct platform_device *pdev)
>   dev->clk = devm_clk_get(>dev, NULL);
>   if (IS_ERR(dev->clk))
>   return PTR_ERR(dev->clk);
> - clk_prepare_enable(dev->clk);
>  
>   mem = platform_get_resource(pdev, IORESOURCE_MEM, 0);
>   dev->base = devm_ioremap_resource(>dev, mem);
> @@ -811,6 +829,18 @@ static int davinci_i2c_probe(struct platform_device 
> *pdev)
>   goto err_unuse_clocks;
>   }
>  
> + pm_runtime_set_autosuspend_delay(dev->dev,
> +  DAVINCI_I2C_PM_TIMEOUT);
> + pm_runtime_use_autosuspend(dev->dev);
> +
> + pm_runtime_enable(dev->dev);
> +
> + r = pm_runtime_get_sync(dev->dev);
> + if (r < 0) {
> + dev_err(dev->dev, "failed to runtime_get device: %d\n", r);
> + goto err_unuse_clocks;
> + }
> +
>   i2c_davinci_init(dev);
>  
>   r = devm_request_irq(>dev, dev->irq, i2c_davinci_isr, 0,
> @@ -849,10 +879,16 @@ static int davinci_i2c_probe(struct platform_device 
> *pdev)
>   if (r)
>   goto err_unuse_clocks;
>  
> + pm_runtime_mark_last_busy(dev->dev);
> + pm_runtime_put_autosuspend(dev->dev);
> +
>   return 0;
>  
>  err_unuse_clocks:
> - clk_disable_unprepare(dev->clk);
> + pm_runtime_dont_use_autosuspend(dev->dev);
> + pm_runtime_put_sync(dev->dev);
> + pm_runtime_disable(dev->dev);
> +
>   dev->clk = NULL;
>   return r;
>  }
> @@ -860,16 +896,26 @@ static int davinci_i2c_probe(struct platform_device 
> *pdev)
>  static int davinci_i2c_remove(struct platform_device *pdev)
>  {
>   struct davinci_i2c_dev *dev = platform_get_drvdata(pdev);
> + int ret;
>  
>   i2c_davinci_cpufreq_deregister(dev);
>  
>   i2c_del_adapter(>adapter);
>  
> - 

Re: [PATCH v2 2/3] i2c: davinci: Add PM Runtime Support

2017-08-17 Thread Wolfram Sang
On Wed, Aug 16, 2017 at 05:17:14PM -0500, Franklin S Cooper Jr wrote:
> 66AK2G has I2C instances that are not apart of the ALWAYS_ON power domain
> like other Keystone 2 SoCs and OMAPL138. Therefore, pm_runtime
> is required to insure the power domain used by the specific I2C instance is
> properly turned on along with its functional clock.
> 
> Signed-off-by: Franklin S Cooper Jr 

I'd need a review from the driver maintainers here. Which are Sekhar
Nori and Kevin Hilman(?) according to MAINTAINERS. Is that still
correct?

> ---
> Version 2 changes:
> Move initial calls to pm runtime autosuspend before pm_runtime_enable
> 
>  drivers/i2c/busses/i2c-davinci.c | 62 
> ++--
>  1 file changed, 54 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/i2c/busses/i2c-davinci.c 
> b/drivers/i2c/busses/i2c-davinci.c
> index b8c4353..6b1930d 100644
> --- a/drivers/i2c/busses/i2c-davinci.c
> +++ b/drivers/i2c/busses/i2c-davinci.c
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* - global defines --- */
>  
> @@ -122,6 +123,9 @@
>  /* set the SDA GPIO low */
>  #define DAVINCI_I2C_DCLR_PDCLR1  BIT(1)
>  
> +/* timeout for pm runtime autosuspend */
> +#define DAVINCI_I2C_PM_TIMEOUT   1000/* ms */
> +
>  struct davinci_i2c_dev {
>   struct device   *dev;
>   void __iomem*base;
> @@ -541,10 +545,17 @@ i2c_davinci_xfer(struct i2c_adapter *adap, struct 
> i2c_msg msgs[], int num)
>  
>   dev_dbg(dev->dev, "%s: msgs: %d\n", __func__, num);
>  
> + ret = pm_runtime_get_sync(dev->dev);
> + if (ret < 0) {
> + dev_err(dev->dev, "Failed to runtime_get device: %d\n", ret);
> + pm_runtime_put_noidle(dev->dev);
> + return ret;
> + }
> +
>   ret = i2c_davinci_wait_bus_not_busy(dev);
>   if (ret < 0) {
>   dev_warn(dev->dev, "timeout waiting for bus ready\n");
> - return ret;
> + goto out;
>   }
>  
>   for (i = 0; i < num; i++) {
> @@ -552,14 +563,19 @@ i2c_davinci_xfer(struct i2c_adapter *adap, struct 
> i2c_msg msgs[], int num)
>   dev_dbg(dev->dev, "%s [%d/%d] ret: %d\n", __func__, i + 1, num,
>   ret);
>   if (ret < 0)
> - return ret;
> + goto out;
>   }
>  
> + ret = num;
>  #ifdef CONFIG_CPU_FREQ
>   complete(>xfr_complete);
>  #endif
>  
> - return num;
> +out:
> + pm_runtime_mark_last_busy(dev->dev);
> + pm_runtime_put_autosuspend(dev->dev);
> +
> + return ret;
>  }
>  
>  static u32 i2c_davinci_func(struct i2c_adapter *adap)
> @@ -599,6 +615,9 @@ static irqreturn_t i2c_davinci_isr(int this_irq, void 
> *dev_id)
>   int count = 0;
>   u16 w;
>  
> + if (pm_runtime_suspended(dev->dev))
> + return IRQ_NONE;
> +
>   while ((stat = davinci_i2c_read_reg(dev, DAVINCI_I2C_IVR_REG))) {
>   dev_dbg(dev->dev, "%s: stat=0x%x\n", __func__, stat);
>   if (count++ == 100) {
> @@ -802,7 +821,6 @@ static int davinci_i2c_probe(struct platform_device *pdev)
>   dev->clk = devm_clk_get(>dev, NULL);
>   if (IS_ERR(dev->clk))
>   return PTR_ERR(dev->clk);
> - clk_prepare_enable(dev->clk);
>  
>   mem = platform_get_resource(pdev, IORESOURCE_MEM, 0);
>   dev->base = devm_ioremap_resource(>dev, mem);
> @@ -811,6 +829,18 @@ static int davinci_i2c_probe(struct platform_device 
> *pdev)
>   goto err_unuse_clocks;
>   }
>  
> + pm_runtime_set_autosuspend_delay(dev->dev,
> +  DAVINCI_I2C_PM_TIMEOUT);
> + pm_runtime_use_autosuspend(dev->dev);
> +
> + pm_runtime_enable(dev->dev);
> +
> + r = pm_runtime_get_sync(dev->dev);
> + if (r < 0) {
> + dev_err(dev->dev, "failed to runtime_get device: %d\n", r);
> + goto err_unuse_clocks;
> + }
> +
>   i2c_davinci_init(dev);
>  
>   r = devm_request_irq(>dev, dev->irq, i2c_davinci_isr, 0,
> @@ -849,10 +879,16 @@ static int davinci_i2c_probe(struct platform_device 
> *pdev)
>   if (r)
>   goto err_unuse_clocks;
>  
> + pm_runtime_mark_last_busy(dev->dev);
> + pm_runtime_put_autosuspend(dev->dev);
> +
>   return 0;
>  
>  err_unuse_clocks:
> - clk_disable_unprepare(dev->clk);
> + pm_runtime_dont_use_autosuspend(dev->dev);
> + pm_runtime_put_sync(dev->dev);
> + pm_runtime_disable(dev->dev);
> +
>   dev->clk = NULL;
>   return r;
>  }
> @@ -860,16 +896,26 @@ static int davinci_i2c_probe(struct platform_device 
> *pdev)
>  static int davinci_i2c_remove(struct platform_device *pdev)
>  {
>   struct davinci_i2c_dev *dev = platform_get_drvdata(pdev);
> + int ret;
>  
>   i2c_davinci_cpufreq_deregister(dev);
>  
>   i2c_del_adapter(>adapter);
>  
> - clk_disable_unprepare(dev->clk);
> + ret = 

Re: [PATCH v2 01/14] staging: lustre: llite: Remove filtering of seclabel xattr

2017-08-17 Thread Greg Kroah-Hartman
On Mon, Aug 14, 2017 at 12:20:51PM -0400, James Simmons wrote:
> From: Robin Humble 
> 
> The security.capability xattr is used to implement File
> Capabilities in recent Linux versions. Capabilities are a
> fine grained approach to granting executables elevated
> privileges. eg. /bin/ping can have capabilities
> cap_net_admin, cap_net_raw+ep instead of being setuid root.
> 
> This xattr has long been filtered out by llite, initially for
> stability reasons (b15587), and later over performance
> concerns as this xattr is read for every file with eg.
> 'ls --color'. Since LU-2869 xattr's are cached on clients,
> alleviating most performance concerns.
> 
> Removing llite's filtering of the security.capability xattr
> enables using Lustre as a root filesystem, which is used on
> some large clusters.
> 
> Signed-off-by: Robin Humble 
> Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-9562
> Reviewed-on: https://review.whamcloud.com/27292
> Reviewed-by: John L. Hammond 
> Reviewed-by: Sebastien Buisson 
> Reviewed-by: Oleg Drokin 
> Signed-off-by: James Simmons 
> ---
> Changelog:
> 
> v1) Initial submit with wrong patch attached.
> v2) Proper patch this time.

I don't see a v1 patch anywhere...

Anyway, when you do this, please make your subject such that I can sort
the emails properly and they show up in the correct order, so put the
"v2" after the patch number like this:

   Subject: [PATCH 01/14 v2] staging: lustre: llite: Remove filtering of

I think git does that correctly for you automatically if you use it...

thanks,

greg k-h


Re: [PATCH v2 01/14] staging: lustre: llite: Remove filtering of seclabel xattr

2017-08-17 Thread Greg Kroah-Hartman
On Mon, Aug 14, 2017 at 12:20:51PM -0400, James Simmons wrote:
> From: Robin Humble 
> 
> The security.capability xattr is used to implement File
> Capabilities in recent Linux versions. Capabilities are a
> fine grained approach to granting executables elevated
> privileges. eg. /bin/ping can have capabilities
> cap_net_admin, cap_net_raw+ep instead of being setuid root.
> 
> This xattr has long been filtered out by llite, initially for
> stability reasons (b15587), and later over performance
> concerns as this xattr is read for every file with eg.
> 'ls --color'. Since LU-2869 xattr's are cached on clients,
> alleviating most performance concerns.
> 
> Removing llite's filtering of the security.capability xattr
> enables using Lustre as a root filesystem, which is used on
> some large clusters.
> 
> Signed-off-by: Robin Humble 
> Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-9562
> Reviewed-on: https://review.whamcloud.com/27292
> Reviewed-by: John L. Hammond 
> Reviewed-by: Sebastien Buisson 
> Reviewed-by: Oleg Drokin 
> Signed-off-by: James Simmons 
> ---
> Changelog:
> 
> v1) Initial submit with wrong patch attached.
> v2) Proper patch this time.

I don't see a v1 patch anywhere...

Anyway, when you do this, please make your subject such that I can sort
the emails properly and they show up in the correct order, so put the
"v2" after the patch number like this:

   Subject: [PATCH 01/14 v2] staging: lustre: llite: Remove filtering of

I think git does that correctly for you automatically if you use it...

thanks,

greg k-h


Re: [PATCH 00/64] staging: lustre: uapi: normalize the lustre headers

2017-08-17 Thread Greg Kroah-Hartman
On Mon, Aug 14, 2017 at 11:46:28AM -0400, James Simmons wrote:
> The headers for lustre/LNet for a long time lacked a clean separation in
> its internal headers which resulted in kernel specific data structures
> being exposed in user land code. This work unravels this mess and creates
> a clear separation between lustre kernel space and lustre user land.
> With this work done the include paths in the lustre kernel code can now
> be normalized.

Nice work, all now applied.

greg k-h


Re: [PATCH 00/64] staging: lustre: uapi: normalize the lustre headers

2017-08-17 Thread Greg Kroah-Hartman
On Mon, Aug 14, 2017 at 11:46:28AM -0400, James Simmons wrote:
> The headers for lustre/LNet for a long time lacked a clean separation in
> its internal headers which resulted in kernel specific data structures
> being exposed in user land code. This work unravels this mess and creates
> a clear separation between lustre kernel space and lustre user land.
> With this work done the include paths in the lustre kernel code can now
> be normalized.

Nice work, all now applied.

greg k-h


RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Liang, Kan

> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen 
> wrote:
> > We encountered workloads that have very long wake up list on large
> > systems. A waker takes a long time to traverse the entire wake list
> > and execute all the wake functions.
> >
> > We saw page wait list that are up to 3700+ entries long in tests of
> > large
> > 4 and 8 socket systems.  It took 0.8 sec to traverse such list during
> > wake up.  Any other CPU that contends for the list spin lock will spin
> > for a long time.  As page wait list is shared by many pages so it
> > could get very long on systems with large memory.
> 
> I really dislike this patch.
> 
> The patch seems a band-aid for really horrible kernel behavior, rather than
> fixing the underlying problem itself.
> 
> Now, it may well be that we do end up needing this band-aid in the end, so
> this isn't a NAK of the patch per se. But I'd *really* like to see if we can 
> fix the
> underlying cause for what you see somehow..
> 
> In particular, if this is about the page wait table, maybe we can just make 
> the
> wait table bigger. IOW, are people actually waiting on the
> *same* page, or are they mainly waiting on totally different pages, just
> hashing to the same wait queue?
> 
> Because right now that page wait table is a small fixed size, and the only
> reason it's a small fixed size is that nobody reported any issues with it -
> particularly since we now avoid the wait table entirely for the common cases
> by having that "contention" bit.
> 
> But it really is a *small* table. We literally have
> 
>#define PAGE_WAIT_TABLE_BITS 8
> 
> so it's just 256 entries. We could easily it much bigger, if we are actually
> seeing a lot of collissions.
> 
> We *used* to have a very complex per-zone thing for bit-waitiqueues, but
> that was because we got lots and lots of contention issues, and everybody
> *always* touched the wait-queues whether they waited or not (so being per-
> zone was a big deal)
> 
> We got rid of all that per-zone complexity when the normal case didn't hit in
> the page wait queues at all, but we may have over-done the simplification a
> bit since nobody showed any issue.
> 
> In particular, we used to size the per-zone thing by amount of memory.
> We could easily re-introduce that for the new simpler page queues.
> 
> The page_waitiqueue() is a simple helper function inside mm/filemap.c, and
> thanks to the per-page "do we have actual waiters" bit that we have now, we
> can actually afford to make it bigger and more complex now if we want to.
> 
> What happens to your load if you just make that table bigger? You can
> literally test by just changing the constant from 8 to 16 or something, making
> us use twice as many bits for hashing. A "real"
> patch would size it by amount of memory, but just for testing the contention
> on your load, you can do the hacky one-liner.

Hi Linus,

We tried both 12 and 16 bit table and that didn't make a difference.
The long wake ups are mostly on the same page when we do instrumentation

Here is the wake_up_page_bit call stack when the workaround is running, which
is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10

# To display the perf.data header info, please use --header/--header-only 
options.
#
#
# Total Lost Samples: 0
#
# Samples: 374  of event 'probe:wake_up_page_bit'
# Event count (approx.): 374
#
# Overhead  Trace output  
#   ..
#
   100.00%  (ae1ad000)
|
---wake_up_page_bit
   |  
   |--49.73%--migrate_misplaced_transhuge_page
   |  do_huge_pmd_numa_page
   |  __handle_mm_fault
   |  handle_mm_fault
   |  __do_page_fault
   |  do_page_fault
   |  page_fault
   |  |  
   |  |--28.07%--0x2b7b7
   |  |  |  
   |  |  |--13.64%--0x127a2
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |  |--13.37%--0x127d8
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |  |--0.53%--0x1280e
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |   --0.27%--0x12844
   |  | 0x7fb5247eddc5
   |  |  
   |  |--18.18%--0x2b788
   |  |  |  
   |  |  |--14.97%--0x127a2
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |  |--1.34%--0x1287a
   

RE: [PATCH 1/2] sched/wait: Break up long wake list walk

2017-08-17 Thread Liang, Kan

> On Mon, Aug 14, 2017 at 5:52 PM, Tim Chen 
> wrote:
> > We encountered workloads that have very long wake up list on large
> > systems. A waker takes a long time to traverse the entire wake list
> > and execute all the wake functions.
> >
> > We saw page wait list that are up to 3700+ entries long in tests of
> > large
> > 4 and 8 socket systems.  It took 0.8 sec to traverse such list during
> > wake up.  Any other CPU that contends for the list spin lock will spin
> > for a long time.  As page wait list is shared by many pages so it
> > could get very long on systems with large memory.
> 
> I really dislike this patch.
> 
> The patch seems a band-aid for really horrible kernel behavior, rather than
> fixing the underlying problem itself.
> 
> Now, it may well be that we do end up needing this band-aid in the end, so
> this isn't a NAK of the patch per se. But I'd *really* like to see if we can 
> fix the
> underlying cause for what you see somehow..
> 
> In particular, if this is about the page wait table, maybe we can just make 
> the
> wait table bigger. IOW, are people actually waiting on the
> *same* page, or are they mainly waiting on totally different pages, just
> hashing to the same wait queue?
> 
> Because right now that page wait table is a small fixed size, and the only
> reason it's a small fixed size is that nobody reported any issues with it -
> particularly since we now avoid the wait table entirely for the common cases
> by having that "contention" bit.
> 
> But it really is a *small* table. We literally have
> 
>#define PAGE_WAIT_TABLE_BITS 8
> 
> so it's just 256 entries. We could easily it much bigger, if we are actually
> seeing a lot of collissions.
> 
> We *used* to have a very complex per-zone thing for bit-waitiqueues, but
> that was because we got lots and lots of contention issues, and everybody
> *always* touched the wait-queues whether they waited or not (so being per-
> zone was a big deal)
> 
> We got rid of all that per-zone complexity when the normal case didn't hit in
> the page wait queues at all, but we may have over-done the simplification a
> bit since nobody showed any issue.
> 
> In particular, we used to size the per-zone thing by amount of memory.
> We could easily re-introduce that for the new simpler page queues.
> 
> The page_waitiqueue() is a simple helper function inside mm/filemap.c, and
> thanks to the per-page "do we have actual waiters" bit that we have now, we
> can actually afford to make it bigger and more complex now if we want to.
> 
> What happens to your load if you just make that table bigger? You can
> literally test by just changing the constant from 8 to 16 or something, making
> us use twice as many bits for hashing. A "real"
> patch would size it by amount of memory, but just for testing the contention
> on your load, you can do the hacky one-liner.

Hi Linus,

We tried both 12 and 16 bit table and that didn't make a difference.
The long wake ups are mostly on the same page when we do instrumentation

Here is the wake_up_page_bit call stack when the workaround is running, which
is collected by perf record -g -a -e probe:wake_up_page_bit -- sleep 10

# To display the perf.data header info, please use --header/--header-only 
options.
#
#
# Total Lost Samples: 0
#
# Samples: 374  of event 'probe:wake_up_page_bit'
# Event count (approx.): 374
#
# Overhead  Trace output  
#   ..
#
   100.00%  (ae1ad000)
|
---wake_up_page_bit
   |  
   |--49.73%--migrate_misplaced_transhuge_page
   |  do_huge_pmd_numa_page
   |  __handle_mm_fault
   |  handle_mm_fault
   |  __do_page_fault
   |  do_page_fault
   |  page_fault
   |  |  
   |  |--28.07%--0x2b7b7
   |  |  |  
   |  |  |--13.64%--0x127a2
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |  |--13.37%--0x127d8
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |  |--0.53%--0x1280e
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |   --0.27%--0x12844
   |  | 0x7fb5247eddc5
   |  |  
   |  |--18.18%--0x2b788
   |  |  |  
   |  |  |--14.97%--0x127a2
   |  |  |  0x7fb5247eddc5
   |  |  |  
   |  |  |--1.34%--0x1287a
   |  |  | 

Re: [GIT PULL 2/2] bcm2835-soc-next-2017-08-15

2017-08-17 Thread Eric Anholt
Florian Fainelli  writes:

> On 08/15/2017 11:03 AM, Eric Anholt wrote:
>> The following changes since commit f29c256853b7412961d3ee80ca525bd2530573db:
>> 
>>   ARM: dts: bcm283x: Add 32-bit enable method for SMP (2017-08-14 20:09:44 
>> +0200)
>> 
>> are available in the git repository at:
>> 
>>   git://github.com/anholt/linux tags/bcm2835-soc-next-2017-08-15
>> 
>> for you to fetch changes up to 067b437e55a892e3ebb13e40c98825fcfa1e2d99:
>> 
>>   ARM: bcm2836: Send event when onlining other cores (2017-08-15 10:52:26 
>> -0700)
>> 
>> 
>> This pull request brings in two things.
>> 
>> One is to use sev() to wake up CPUs that might be sleeping when doing
>> the custom spin-table boot process in 32-bit mode (new firmware
>> versions will have the CPUs sleeping waiting for an event instead of
>> just spinning).  However, the irqchip maintainer objected to our SMP
>> code continuing to live in the driver, so we had to move it to
>> platsmp.c, and to do that we needed a new SMP enable-method to the DT
>> for the platsmp.c to attach to (thus the DT cross-merge in this PR).
>> The platsmp.c patch was acked by irqchip for going through arm-soc.
>
> This does make us pull quite a lot of changes, how about I just
> cherry-pick "ARM: dts: bcm283x: Add 32-bit enable method for SMP" such
> that the branch in itself is functional as-is, but we don't pull in
> everything else from devicetree/next?

Then you get the commit duplicated in the history, which people
generally dislike even more.  Also, it depends on the arm64->arm move,
so you'd need that as well.


signature.asc
Description: PGP signature


Re: [GIT PULL 2/2] bcm2835-soc-next-2017-08-15

2017-08-17 Thread Eric Anholt
Florian Fainelli  writes:

> On 08/15/2017 11:03 AM, Eric Anholt wrote:
>> The following changes since commit f29c256853b7412961d3ee80ca525bd2530573db:
>> 
>>   ARM: dts: bcm283x: Add 32-bit enable method for SMP (2017-08-14 20:09:44 
>> +0200)
>> 
>> are available in the git repository at:
>> 
>>   git://github.com/anholt/linux tags/bcm2835-soc-next-2017-08-15
>> 
>> for you to fetch changes up to 067b437e55a892e3ebb13e40c98825fcfa1e2d99:
>> 
>>   ARM: bcm2836: Send event when onlining other cores (2017-08-15 10:52:26 
>> -0700)
>> 
>> 
>> This pull request brings in two things.
>> 
>> One is to use sev() to wake up CPUs that might be sleeping when doing
>> the custom spin-table boot process in 32-bit mode (new firmware
>> versions will have the CPUs sleeping waiting for an event instead of
>> just spinning).  However, the irqchip maintainer objected to our SMP
>> code continuing to live in the driver, so we had to move it to
>> platsmp.c, and to do that we needed a new SMP enable-method to the DT
>> for the platsmp.c to attach to (thus the DT cross-merge in this PR).
>> The platsmp.c patch was acked by irqchip for going through arm-soc.
>
> This does make us pull quite a lot of changes, how about I just
> cherry-pick "ARM: dts: bcm283x: Add 32-bit enable method for SMP" such
> that the branch in itself is functional as-is, but we don't pull in
> everything else from devicetree/next?

Then you get the commit duplicated in the history, which people
generally dislike even more.  Also, it depends on the arm64->arm move,
so you'd need that as well.


signature.asc
Description: PGP signature


[PATCH] xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()

2017-08-17 Thread Julien Grall
When booting Linux as Xen guest with CONFIG_DEBUG_ATOMIC, the following
splat appears:

[0.002323] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 bytes)
[0.019717] ASID allocator initialised with 65536 entries
[0.020019] xen:grant_table: Grant tables using version 1 layout
[0.020051] Grant table initialized
[0.020069] BUG: sleeping function called from invalid context at 
/data/src/linux/mm/page_alloc.c:4046
[0.020100] in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: swapper/0
[0.020123] no locks held by swapper/0/1.
[0.020143] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc5 #598
[0.020166] Hardware name: FVP Base (DT)
[0.020182] Call trace:
[0.020199] [] dump_backtrace+0x0/0x270
[0.020222] [] show_stack+0x24/0x30
[0.020244] [] dump_stack+0xb8/0xf0
[0.020267] [] ___might_sleep+0x1c8/0x1f8
[0.020291] [] __might_sleep+0x58/0x90
[0.020313] [] __alloc_pages_nodemask+0x1c0/0x12e8
[0.020338] [] alloc_page_interleave+0x38/0x88
[0.020363] [] alloc_pages_current+0xdc/0xf0
[0.020387] [] __get_free_pages+0x28/0x50
[0.020411] [] evtchn_fifo_alloc_control_block+0x2c/0xa0
[0.020437] [] xen_evtchn_fifo_init+0x38/0xb4
[0.020461] [] xen_init_IRQ+0x44/0xc8
[0.020484] [] xen_guest_init+0x250/0x300
[0.020507] [] do_one_initcall+0x44/0x130
[0.020531] [] kernel_init_freeable+0x120/0x288
[0.020556] [] kernel_init+0x18/0x110
[0.020578] [] ret_from_fork+0x10/0x40
[0.020606] xen:events: Using FIFO-based ABI
[0.020658] Xen: initializing cpu0
[0.027727] Hierarchical SRCU implementation.
[0.036235] EFI services will not be available.
[0.043810] smp: Bringing up secondary CPUs ...

This is because get_cpu() in xen_evtchn_fifo_init() will disable
preemption, but __get_free_page() might sleep (GFP_ATOMIC is not set).

xen_evtchn_fifo_init() will always be called before SMP is initialized,
so {get,put}_cpu() could be replaced by a simple smp_processor_id().

This also avoid to modify evtchn_fifo_alloc_control_block that will be
called in other context.

Signed-off-by: Julien Grall 
Reported-by: Andre Przywara 
Fixes: 1fe565517b57 ("xen/events: use the FIFO-based ABI if available")
---
 drivers/xen/events/events_fifo.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
index 3c41470c7fc4..76b318e88382 100644
--- a/drivers/xen/events/events_fifo.c
+++ b/drivers/xen/events/events_fifo.c
@@ -432,12 +432,12 @@ static int xen_evtchn_cpu_dead(unsigned int cpu)
 
 int __init xen_evtchn_fifo_init(void)
 {
-   int cpu = get_cpu();
+   int cpu = smp_processor_id();
int ret;
 
ret = evtchn_fifo_alloc_control_block(cpu);
if (ret < 0)
-   goto out;
+   return ret;
 
pr_info("Using FIFO-based ABI\n");
 
@@ -446,7 +446,6 @@ int __init xen_evtchn_fifo_init(void)
cpuhp_setup_state_nocalls(CPUHP_XEN_EVTCHN_PREPARE,
  "xen/evtchn:prepare",
  xen_evtchn_cpu_prepare, xen_evtchn_cpu_dead);
-out:
-   put_cpu();
+
return ret;
 }
-- 
2.11.0



[PATCH] xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()

2017-08-17 Thread Julien Grall
When booting Linux as Xen guest with CONFIG_DEBUG_ATOMIC, the following
splat appears:

[0.002323] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 bytes)
[0.019717] ASID allocator initialised with 65536 entries
[0.020019] xen:grant_table: Grant tables using version 1 layout
[0.020051] Grant table initialized
[0.020069] BUG: sleeping function called from invalid context at 
/data/src/linux/mm/page_alloc.c:4046
[0.020100] in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: swapper/0
[0.020123] no locks held by swapper/0/1.
[0.020143] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc5 #598
[0.020166] Hardware name: FVP Base (DT)
[0.020182] Call trace:
[0.020199] [] dump_backtrace+0x0/0x270
[0.020222] [] show_stack+0x24/0x30
[0.020244] [] dump_stack+0xb8/0xf0
[0.020267] [] ___might_sleep+0x1c8/0x1f8
[0.020291] [] __might_sleep+0x58/0x90
[0.020313] [] __alloc_pages_nodemask+0x1c0/0x12e8
[0.020338] [] alloc_page_interleave+0x38/0x88
[0.020363] [] alloc_pages_current+0xdc/0xf0
[0.020387] [] __get_free_pages+0x28/0x50
[0.020411] [] evtchn_fifo_alloc_control_block+0x2c/0xa0
[0.020437] [] xen_evtchn_fifo_init+0x38/0xb4
[0.020461] [] xen_init_IRQ+0x44/0xc8
[0.020484] [] xen_guest_init+0x250/0x300
[0.020507] [] do_one_initcall+0x44/0x130
[0.020531] [] kernel_init_freeable+0x120/0x288
[0.020556] [] kernel_init+0x18/0x110
[0.020578] [] ret_from_fork+0x10/0x40
[0.020606] xen:events: Using FIFO-based ABI
[0.020658] Xen: initializing cpu0
[0.027727] Hierarchical SRCU implementation.
[0.036235] EFI services will not be available.
[0.043810] smp: Bringing up secondary CPUs ...

This is because get_cpu() in xen_evtchn_fifo_init() will disable
preemption, but __get_free_page() might sleep (GFP_ATOMIC is not set).

xen_evtchn_fifo_init() will always be called before SMP is initialized,
so {get,put}_cpu() could be replaced by a simple smp_processor_id().

This also avoid to modify evtchn_fifo_alloc_control_block that will be
called in other context.

Signed-off-by: Julien Grall 
Reported-by: Andre Przywara 
Fixes: 1fe565517b57 ("xen/events: use the FIFO-based ABI if available")
---
 drivers/xen/events/events_fifo.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
index 3c41470c7fc4..76b318e88382 100644
--- a/drivers/xen/events/events_fifo.c
+++ b/drivers/xen/events/events_fifo.c
@@ -432,12 +432,12 @@ static int xen_evtchn_cpu_dead(unsigned int cpu)
 
 int __init xen_evtchn_fifo_init(void)
 {
-   int cpu = get_cpu();
+   int cpu = smp_processor_id();
int ret;
 
ret = evtchn_fifo_alloc_control_block(cpu);
if (ret < 0)
-   goto out;
+   return ret;
 
pr_info("Using FIFO-based ABI\n");
 
@@ -446,7 +446,6 @@ int __init xen_evtchn_fifo_init(void)
cpuhp_setup_state_nocalls(CPUHP_XEN_EVTCHN_PREPARE,
  "xen/evtchn:prepare",
  xen_evtchn_cpu_prepare, xen_evtchn_cpu_dead);
-out:
-   put_cpu();
+
return ret;
 }
-- 
2.11.0



[PATCH] xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()

2017-08-17 Thread Julien Grall
When booting Linux as Xen guest with CONFIG_DEBUG_ATOMIC, the following
splat appears:

[0.002323] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 bytes)
[0.019717] ASID allocator initialised with 65536 entries
[0.020019] xen:grant_table: Grant tables using version 1 layout
[0.020051] Grant table initialized
[0.020069] BUG: sleeping function called from invalid context at 
/data/src/linux/mm/page_alloc.c:4046
[0.020100] in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: swapper/0
[0.020123] no locks held by swapper/0/1.
[0.020143] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc5 #598
[0.020166] Hardware name: FVP Base (DT)
[0.020182] Call trace:
[0.020199] [] dump_backtrace+0x0/0x270
[0.020222] [] show_stack+0x24/0x30
[0.020244] [] dump_stack+0xb8/0xf0
[0.020267] [] ___might_sleep+0x1c8/0x1f8
[0.020291] [] __might_sleep+0x58/0x90
[0.020313] [] __alloc_pages_nodemask+0x1c0/0x12e8
[0.020338] [] alloc_page_interleave+0x38/0x88
[0.020363] [] alloc_pages_current+0xdc/0xf0
[0.020387] [] __get_free_pages+0x28/0x50
[0.020411] [] evtchn_fifo_alloc_control_block+0x2c/0xa0
[0.020437] [] xen_evtchn_fifo_init+0x38/0xb4
[0.020461] [] xen_init_IRQ+0x44/0xc8
[0.020484] [] xen_guest_init+0x250/0x300
[0.020507] [] do_one_initcall+0x44/0x130
[0.020531] [] kernel_init_freeable+0x120/0x288
[0.020556] [] kernel_init+0x18/0x110
[0.020578] [] ret_from_fork+0x10/0x40
[0.020606] xen:events: Using FIFO-based ABI
[0.020658] Xen: initializing cpu0
[0.027727] Hierarchical SRCU implementation.
[0.036235] EFI services will not be available.
[0.043810] smp: Bringing up secondary CPUs ...

This is because get_cpu() in xen_evtchn_fifo_init() will disable
preemption, but __get_free_page() might sleep (GFP_ATOMIC is not set).

xen_evtchn_fifo_init() will always be called before SMP is initialized,
so {get,put}_cpu() could be replaced by a simple smp_processor_id().

This also avoid to modify evtchn_fifo_alloc_control_block that will be
called in other context.

Signed-off-by: Julien Grall 
Reported-by: Andre Przywara 
Fixes: 1fe565517b57 ("xen/events: use the FIFO-based ABI if available")
---
 drivers/xen/events/events_fifo.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
index 3c41470c7fc4..76b318e88382 100644
--- a/drivers/xen/events/events_fifo.c
+++ b/drivers/xen/events/events_fifo.c
@@ -432,12 +432,12 @@ static int xen_evtchn_cpu_dead(unsigned int cpu)
 
 int __init xen_evtchn_fifo_init(void)
 {
-   int cpu = get_cpu();
+   int cpu = smp_processor_id();
int ret;
 
ret = evtchn_fifo_alloc_control_block(cpu);
if (ret < 0)
-   goto out;
+   return ret;
 
pr_info("Using FIFO-based ABI\n");
 
@@ -446,7 +446,6 @@ int __init xen_evtchn_fifo_init(void)
cpuhp_setup_state_nocalls(CPUHP_XEN_EVTCHN_PREPARE,
  "xen/evtchn:prepare",
  xen_evtchn_cpu_prepare, xen_evtchn_cpu_dead);
-out:
-   put_cpu();
+
return ret;
 }
-- 
2.11.0



[PATCH] xen/events: events_fifo: Don't use {get,put}_cpu() in xen_evtchn_fifo_init()

2017-08-17 Thread Julien Grall
When booting Linux as Xen guest with CONFIG_DEBUG_ATOMIC, the following
splat appears:

[0.002323] Mountpoint-cache hash table entries: 1024 (order: 1, 8192 bytes)
[0.019717] ASID allocator initialised with 65536 entries
[0.020019] xen:grant_table: Grant tables using version 1 layout
[0.020051] Grant table initialized
[0.020069] BUG: sleeping function called from invalid context at 
/data/src/linux/mm/page_alloc.c:4046
[0.020100] in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: swapper/0
[0.020123] no locks held by swapper/0/1.
[0.020143] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc5 #598
[0.020166] Hardware name: FVP Base (DT)
[0.020182] Call trace:
[0.020199] [] dump_backtrace+0x0/0x270
[0.020222] [] show_stack+0x24/0x30
[0.020244] [] dump_stack+0xb8/0xf0
[0.020267] [] ___might_sleep+0x1c8/0x1f8
[0.020291] [] __might_sleep+0x58/0x90
[0.020313] [] __alloc_pages_nodemask+0x1c0/0x12e8
[0.020338] [] alloc_page_interleave+0x38/0x88
[0.020363] [] alloc_pages_current+0xdc/0xf0
[0.020387] [] __get_free_pages+0x28/0x50
[0.020411] [] evtchn_fifo_alloc_control_block+0x2c/0xa0
[0.020437] [] xen_evtchn_fifo_init+0x38/0xb4
[0.020461] [] xen_init_IRQ+0x44/0xc8
[0.020484] [] xen_guest_init+0x250/0x300
[0.020507] [] do_one_initcall+0x44/0x130
[0.020531] [] kernel_init_freeable+0x120/0x288
[0.020556] [] kernel_init+0x18/0x110
[0.020578] [] ret_from_fork+0x10/0x40
[0.020606] xen:events: Using FIFO-based ABI
[0.020658] Xen: initializing cpu0
[0.027727] Hierarchical SRCU implementation.
[0.036235] EFI services will not be available.
[0.043810] smp: Bringing up secondary CPUs ...

This is because get_cpu() in xen_evtchn_fifo_init() will disable
preemption, but __get_free_page() might sleep (GFP_ATOMIC is not set).

xen_evtchn_fifo_init() will always be called before SMP is initialized,
so {get,put}_cpu() could be replaced by a simple smp_processor_id().

This also avoid to modify evtchn_fifo_alloc_control_block that will be
called in other context.

Signed-off-by: Julien Grall 
Reported-by: Andre Przywara 
Fixes: 1fe565517b57 ("xen/events: use the FIFO-based ABI if available")
---
 drivers/xen/events/events_fifo.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/xen/events/events_fifo.c b/drivers/xen/events/events_fifo.c
index 3c41470c7fc4..76b318e88382 100644
--- a/drivers/xen/events/events_fifo.c
+++ b/drivers/xen/events/events_fifo.c
@@ -432,12 +432,12 @@ static int xen_evtchn_cpu_dead(unsigned int cpu)
 
 int __init xen_evtchn_fifo_init(void)
 {
-   int cpu = get_cpu();
+   int cpu = smp_processor_id();
int ret;
 
ret = evtchn_fifo_alloc_control_block(cpu);
if (ret < 0)
-   goto out;
+   return ret;
 
pr_info("Using FIFO-based ABI\n");
 
@@ -446,7 +446,6 @@ int __init xen_evtchn_fifo_init(void)
cpuhp_setup_state_nocalls(CPUHP_XEN_EVTCHN_PREPARE,
  "xen/evtchn:prepare",
  xen_evtchn_cpu_prepare, xen_evtchn_cpu_dead);
-out:
-   put_cpu();
+
return ret;
 }
-- 
2.11.0



[tip:locking/core 37/41] drivers/block//drbd/drbd_receiver.c:1159:1: warning: the frame size of 1072 bytes is larger than 1024 bytes

2017-08-17 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
head:   e26f34a407aec9c65bce2bc0c838fabe4f051fc6
commit: d0541b0fa64b36665d6261079974a26943c75009 [37/41] locking/lockdep: Make 
CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout d0541b0fa64b36665d6261079974a26943c75009
# save the attached .config to linux build tree
make.cross ARCH=s390 

All warnings (new ones prefixed by >>):

   drivers/block//drbd/drbd_receiver.c: In function 'conn_connect':
>> drivers/block//drbd/drbd_receiver.c:1159:1: warning: the frame size of 1072 
>> bytes is larger than 1024 bytes [-Wframe-larger-than=]
}
^

vim +1159 drivers/block//drbd/drbd_receiver.c

b411b3637 Philipp Reisner 2009-09-25   923  
b411b3637 Philipp Reisner 2009-09-25   924  /*
b411b3637 Philipp Reisner 2009-09-25   925   * return values:
b411b3637 Philipp Reisner 2009-09-25   926   *   1 yes, we have a valid 
connection
b411b3637 Philipp Reisner 2009-09-25   927   *   0 oops, did not work out, 
please try again
b411b3637 Philipp Reisner 2009-09-25   928   *  -1 peer talks different 
language,
b411b3637 Philipp Reisner 2009-09-25   929   * no point in trying 
again, please go standalone.
b411b3637 Philipp Reisner 2009-09-25   930   *  -2 We do not have a network 
config...
b411b3637 Philipp Reisner 2009-09-25   931   */
bde89a9e1 Andreas Gruenbacher 2011-05-30   932  static int conn_connect(struct 
drbd_connection *connection)
b411b3637 Philipp Reisner 2009-09-25   933  {
7da358625 Philipp Reisner 2011-12-19   934  struct drbd_socket 
sock, msock;
c06ece6ba Andreas Gruenbacher 2011-06-21   935  struct drbd_peer_device 
*peer_device;
44ed167da Philipp Reisner 2011-04-19   936  struct net_conf *nc;
5d0b17f1a Philipp Reisner 2014-03-18   937  int vnr, timeout, h;
5d0b17f1a Philipp Reisner 2014-03-18   938  bool discard_my_data, 
ok;
197296ffe Philipp Reisner 2012-03-26   939  enum drbd_state_rv rv;
7a426fd8d Philipp Reisner 2012-07-12   940  struct accept_wait_data 
ad = {
bde89a9e1 Andreas Gruenbacher 2011-05-30   941  .connection = 
connection,
7a426fd8d Philipp Reisner 2012-07-12   942  .door_bell = 
COMPLETION_INITIALIZER_ONSTACK(ad.door_bell),
7a426fd8d Philipp Reisner 2012-07-12   943  };
b411b3637 Philipp Reisner 2009-09-25   944  
bde89a9e1 Andreas Gruenbacher 2011-05-30   945  
clear_bit(DISCONNECT_SENT, >flags);
bde89a9e1 Andreas Gruenbacher 2011-05-30   946  if 
(conn_request_state(connection, NS(conn, C_WF_CONNECTION), CS_VERBOSE) < 
SS_SUCCESS)
b411b3637 Philipp Reisner 2009-09-25   947  return -2;
b411b3637 Philipp Reisner 2009-09-25   948  
7da358625 Philipp Reisner 2011-12-19   949  mutex_init();
bde89a9e1 Andreas Gruenbacher 2011-05-30   950  sock.sbuf = 
connection->data.sbuf;
bde89a9e1 Andreas Gruenbacher 2011-05-30   951  sock.rbuf = 
connection->data.rbuf;
7da358625 Philipp Reisner 2011-12-19   952  sock.socket = NULL;
7da358625 Philipp Reisner 2011-12-19   953  
mutex_init();
bde89a9e1 Andreas Gruenbacher 2011-05-30   954  msock.sbuf = 
connection->meta.sbuf;
bde89a9e1 Andreas Gruenbacher 2011-05-30   955  msock.rbuf = 
connection->meta.rbuf;
7da358625 Philipp Reisner 2011-12-19   956  msock.socket = NULL;
7da358625 Philipp Reisner 2011-12-19   957  
0916e0e30 Andreas Gruenbacher 2011-03-21   958  /* Assume that the peer 
only understands protocol 80 until we know better.  */
bde89a9e1 Andreas Gruenbacher 2011-05-30   959  
connection->agreed_pro_version = 80;
b411b3637 Philipp Reisner 2009-09-25   960  
bde89a9e1 Andreas Gruenbacher 2011-05-30   961  if 
(prepare_listen_socket(connection, ))
7a426fd8d Philipp Reisner 2012-07-12   962  return 0;
b411b3637 Philipp Reisner 2009-09-25   963  
b411b3637 Philipp Reisner 2009-09-25   964  do {
2bf896213 Andreas Gruenbacher 2011-03-28   965  struct socket 
*s;
b411b3637 Philipp Reisner 2009-09-25   966  
bde89a9e1 Andreas Gruenbacher 2011-05-30   967  s = 
drbd_try_connect(connection);
b411b3637 Philipp Reisner 2009-09-25   968  if (s) {
7da358625 Philipp Reisner 2011-12-19   969  if 
(!sock.socket) {
7da358625 Philipp Reisner 2011-12-19   970  
sock.socket = s;
bde89a9e1 Andreas Gruenbacher 2011-05-30   971  
send_first_packet(connection, , 

[tip:locking/core 37/41] drivers/block//drbd/drbd_receiver.c:1159:1: warning: the frame size of 1072 bytes is larger than 1024 bytes

2017-08-17 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git locking/core
head:   e26f34a407aec9c65bce2bc0c838fabe4f051fc6
commit: d0541b0fa64b36665d6261079974a26943c75009 [37/41] locking/lockdep: Make 
CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
config: s390-default_defconfig (attached as .config)
compiler: s390x-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout d0541b0fa64b36665d6261079974a26943c75009
# save the attached .config to linux build tree
make.cross ARCH=s390 

All warnings (new ones prefixed by >>):

   drivers/block//drbd/drbd_receiver.c: In function 'conn_connect':
>> drivers/block//drbd/drbd_receiver.c:1159:1: warning: the frame size of 1072 
>> bytes is larger than 1024 bytes [-Wframe-larger-than=]
}
^

vim +1159 drivers/block//drbd/drbd_receiver.c

b411b3637 Philipp Reisner 2009-09-25   923  
b411b3637 Philipp Reisner 2009-09-25   924  /*
b411b3637 Philipp Reisner 2009-09-25   925   * return values:
b411b3637 Philipp Reisner 2009-09-25   926   *   1 yes, we have a valid 
connection
b411b3637 Philipp Reisner 2009-09-25   927   *   0 oops, did not work out, 
please try again
b411b3637 Philipp Reisner 2009-09-25   928   *  -1 peer talks different 
language,
b411b3637 Philipp Reisner 2009-09-25   929   * no point in trying 
again, please go standalone.
b411b3637 Philipp Reisner 2009-09-25   930   *  -2 We do not have a network 
config...
b411b3637 Philipp Reisner 2009-09-25   931   */
bde89a9e1 Andreas Gruenbacher 2011-05-30   932  static int conn_connect(struct 
drbd_connection *connection)
b411b3637 Philipp Reisner 2009-09-25   933  {
7da358625 Philipp Reisner 2011-12-19   934  struct drbd_socket 
sock, msock;
c06ece6ba Andreas Gruenbacher 2011-06-21   935  struct drbd_peer_device 
*peer_device;
44ed167da Philipp Reisner 2011-04-19   936  struct net_conf *nc;
5d0b17f1a Philipp Reisner 2014-03-18   937  int vnr, timeout, h;
5d0b17f1a Philipp Reisner 2014-03-18   938  bool discard_my_data, 
ok;
197296ffe Philipp Reisner 2012-03-26   939  enum drbd_state_rv rv;
7a426fd8d Philipp Reisner 2012-07-12   940  struct accept_wait_data 
ad = {
bde89a9e1 Andreas Gruenbacher 2011-05-30   941  .connection = 
connection,
7a426fd8d Philipp Reisner 2012-07-12   942  .door_bell = 
COMPLETION_INITIALIZER_ONSTACK(ad.door_bell),
7a426fd8d Philipp Reisner 2012-07-12   943  };
b411b3637 Philipp Reisner 2009-09-25   944  
bde89a9e1 Andreas Gruenbacher 2011-05-30   945  
clear_bit(DISCONNECT_SENT, >flags);
bde89a9e1 Andreas Gruenbacher 2011-05-30   946  if 
(conn_request_state(connection, NS(conn, C_WF_CONNECTION), CS_VERBOSE) < 
SS_SUCCESS)
b411b3637 Philipp Reisner 2009-09-25   947  return -2;
b411b3637 Philipp Reisner 2009-09-25   948  
7da358625 Philipp Reisner 2011-12-19   949  mutex_init();
bde89a9e1 Andreas Gruenbacher 2011-05-30   950  sock.sbuf = 
connection->data.sbuf;
bde89a9e1 Andreas Gruenbacher 2011-05-30   951  sock.rbuf = 
connection->data.rbuf;
7da358625 Philipp Reisner 2011-12-19   952  sock.socket = NULL;
7da358625 Philipp Reisner 2011-12-19   953  
mutex_init();
bde89a9e1 Andreas Gruenbacher 2011-05-30   954  msock.sbuf = 
connection->meta.sbuf;
bde89a9e1 Andreas Gruenbacher 2011-05-30   955  msock.rbuf = 
connection->meta.rbuf;
7da358625 Philipp Reisner 2011-12-19   956  msock.socket = NULL;
7da358625 Philipp Reisner 2011-12-19   957  
0916e0e30 Andreas Gruenbacher 2011-03-21   958  /* Assume that the peer 
only understands protocol 80 until we know better.  */
bde89a9e1 Andreas Gruenbacher 2011-05-30   959  
connection->agreed_pro_version = 80;
b411b3637 Philipp Reisner 2009-09-25   960  
bde89a9e1 Andreas Gruenbacher 2011-05-30   961  if 
(prepare_listen_socket(connection, ))
7a426fd8d Philipp Reisner 2012-07-12   962  return 0;
b411b3637 Philipp Reisner 2009-09-25   963  
b411b3637 Philipp Reisner 2009-09-25   964  do {
2bf896213 Andreas Gruenbacher 2011-03-28   965  struct socket 
*s;
b411b3637 Philipp Reisner 2009-09-25   966  
bde89a9e1 Andreas Gruenbacher 2011-05-30   967  s = 
drbd_try_connect(connection);
b411b3637 Philipp Reisner 2009-09-25   968  if (s) {
7da358625 Philipp Reisner 2011-12-19   969  if 
(!sock.socket) {
7da358625 Philipp Reisner 2011-12-19   970  
sock.socket = s;
bde89a9e1 Andreas Gruenbacher 2011-05-30   971  
send_first_packet(connection, , 

Re: [PATCH v7] serio: PS/2 gpio bit banging driver for serio bus

2017-08-17 Thread Danilo Krummrich

Hi Bob,

thanks for reviewing.

On 2017-08-17 17:43, Rob Herring wrote:

On Fri, Aug 11, 2017 at 03:17:36PM +0200, Danilo Krummrich wrote:
This driver provides PS/2 serio bus support by implementing bit 
banging
with the GPIO API. The GPIO pins, data and clock, can be configured 
with

a node in the device tree or by generic device properties (GDP).

Writing to a device is supported as well, though it is possible 
timings
can not be halt as they are tough and difficult to reach with bit 
banging.

Therefore it can be configured (also in DT and GDP) whether the serio
write function should be available for clients.

This driver is for development purposes and not recommended for 
productive
use. However, this driver can be useful e.g. when no USB port is 
available
or using old peripherals is desired as PS/2 controller chips getting 
rare.


This driver was tested on RPI1 and on Hikey960 and it worked well 
together

with the atkbd and psmouse driver.

Signed-off-by: Danilo Krummrich 
---
v2: Removed one verbose print statement, changed another one to 
dev_dbg.

v3: - fixed compiler warning on blackfin
- depends on GPIOLIB
- clarify documentation
v4: - fixed concurrent calls to ps2_gpio_write (serio->write)
- use gpiod API
- use generic device properties
- request irq separately, do not use gpiod_to_irq
- abort when gpio is connected via slow bus
- Fixed a bug where PS2_CMD_RESEND is always send after tx failed 
once.
  The makes the write functionallity work better, tough timing is 
still

  critical.
- disable irq initially until ps2_gpip_open (serio->open) is 
called
v5: Checked again why timings are that hard to reach while in tx mode 
and
discovered that there is an extra clock pulse between stop bit 
sent from
host and acknowledgement from device. By just skipping this clock 
pulse
tx works fine now, though it still happens sometimes that the 
timing can

not be reached of course.
v6: - fixed typos
- use of_match_ptr
v7: remove unnecessary barriers

Sorry for resending, forgot der version tag in the subject.
---
 .../devicetree/bindings/serio/ps2-gpio.txt |  22 +


It's preferred to split bindings to separate patch.

Together with Documentation/gpio/drivers-on-gpio.txt or would you prefer 
to have

a separate patch for this as well?

 Documentation/gpio/drivers-on-gpio.txt |   5 +
 drivers/input/serio/Kconfig|  11 +
 drivers/input/serio/Makefile   |   1 +
 drivers/input/serio/ps2-gpio.c | 453 
+

 5 files changed, 492 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/serio/ps2-gpio.txt

 create mode 100644 drivers/input/serio/ps2-gpio.c

diff --git a/Documentation/devicetree/bindings/serio/ps2-gpio.txt 
b/Documentation/devicetree/bindings/serio/ps2-gpio.txt

new file mode 100644
index ..099dd6d46cb3
--- /dev/null
+++ b/Documentation/devicetree/bindings/serio/ps2-gpio.txt
@@ -0,0 +1,22 @@
+Device-Tree bindings for ps/2 gpio driver


Bindings don't describe drivers.


Will fix.

+
+Required properties:
+   - compatible = "ps2-gpio"
+   - gpios: data and clock gpio
+   - interrupts: Should trigger on the falling edge of the clock line.
+
+Optional properties:
+	- ps2-gpio,write-enable: Indicates whether write function is 
provided


ps2-gpio is not a vendor prefix, so drop it.


I will do.
+	to serio device. Possibly providing the write fn will not work, 
because

+   of the tough timing requirements.
+
+Example nodes:
+
+ps2@0 {
+   compatible = "ps2-gpio";
+   interrupt-parent = <>;
+   interrupts = <23 IRQ_TYPE_EDGE_FALLING>;
+   data-gpios = < 24 GPIO_ACTIVE_HIGH>;
+   clk-gpios = < 23 GPIO_ACTIVE_HIGH>;
+   ps2-gpio,write-enable;
+};


Thanks,
Danilo


Re: [PATCH v7] serio: PS/2 gpio bit banging driver for serio bus

2017-08-17 Thread Danilo Krummrich

Hi Bob,

thanks for reviewing.

On 2017-08-17 17:43, Rob Herring wrote:

On Fri, Aug 11, 2017 at 03:17:36PM +0200, Danilo Krummrich wrote:
This driver provides PS/2 serio bus support by implementing bit 
banging
with the GPIO API. The GPIO pins, data and clock, can be configured 
with

a node in the device tree or by generic device properties (GDP).

Writing to a device is supported as well, though it is possible 
timings
can not be halt as they are tough and difficult to reach with bit 
banging.

Therefore it can be configured (also in DT and GDP) whether the serio
write function should be available for clients.

This driver is for development purposes and not recommended for 
productive
use. However, this driver can be useful e.g. when no USB port is 
available
or using old peripherals is desired as PS/2 controller chips getting 
rare.


This driver was tested on RPI1 and on Hikey960 and it worked well 
together

with the atkbd and psmouse driver.

Signed-off-by: Danilo Krummrich 
---
v2: Removed one verbose print statement, changed another one to 
dev_dbg.

v3: - fixed compiler warning on blackfin
- depends on GPIOLIB
- clarify documentation
v4: - fixed concurrent calls to ps2_gpio_write (serio->write)
- use gpiod API
- use generic device properties
- request irq separately, do not use gpiod_to_irq
- abort when gpio is connected via slow bus
- Fixed a bug where PS2_CMD_RESEND is always send after tx failed 
once.
  The makes the write functionallity work better, tough timing is 
still

  critical.
- disable irq initially until ps2_gpip_open (serio->open) is 
called
v5: Checked again why timings are that hard to reach while in tx mode 
and
discovered that there is an extra clock pulse between stop bit 
sent from
host and acknowledgement from device. By just skipping this clock 
pulse
tx works fine now, though it still happens sometimes that the 
timing can

not be reached of course.
v6: - fixed typos
- use of_match_ptr
v7: remove unnecessary barriers

Sorry for resending, forgot der version tag in the subject.
---
 .../devicetree/bindings/serio/ps2-gpio.txt |  22 +


It's preferred to split bindings to separate patch.

Together with Documentation/gpio/drivers-on-gpio.txt or would you prefer 
to have

a separate patch for this as well?

 Documentation/gpio/drivers-on-gpio.txt |   5 +
 drivers/input/serio/Kconfig|  11 +
 drivers/input/serio/Makefile   |   1 +
 drivers/input/serio/ps2-gpio.c | 453 
+

 5 files changed, 492 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/serio/ps2-gpio.txt

 create mode 100644 drivers/input/serio/ps2-gpio.c

diff --git a/Documentation/devicetree/bindings/serio/ps2-gpio.txt 
b/Documentation/devicetree/bindings/serio/ps2-gpio.txt

new file mode 100644
index ..099dd6d46cb3
--- /dev/null
+++ b/Documentation/devicetree/bindings/serio/ps2-gpio.txt
@@ -0,0 +1,22 @@
+Device-Tree bindings for ps/2 gpio driver


Bindings don't describe drivers.


Will fix.

+
+Required properties:
+   - compatible = "ps2-gpio"
+   - gpios: data and clock gpio
+   - interrupts: Should trigger on the falling edge of the clock line.
+
+Optional properties:
+	- ps2-gpio,write-enable: Indicates whether write function is 
provided


ps2-gpio is not a vendor prefix, so drop it.


I will do.
+	to serio device. Possibly providing the write fn will not work, 
because

+   of the tough timing requirements.
+
+Example nodes:
+
+ps2@0 {
+   compatible = "ps2-gpio";
+   interrupt-parent = <>;
+   interrupts = <23 IRQ_TYPE_EDGE_FALLING>;
+   data-gpios = < 24 GPIO_ACTIVE_HIGH>;
+   clk-gpios = < 23 GPIO_ACTIVE_HIGH>;
+   ps2-gpio,write-enable;
+};


Thanks,
Danilo


Re: [RFC PATCH 5/7] dt-bindings: ti-omap-hsmmc: Document new compatible for sdhci omap

2017-08-17 Thread Tony Lindgren
* Kishon Vijay Abraham I  [170816 22:44]:
> Hi Tony,
> 
> On Thursday 10 August 2017 03:42 AM, Tony Lindgren wrote:
> > * Kishon Vijay Abraham I  [170807 09:03]:
> >> Document the new compatible string "ti,dra7-sdhci" to be used for
> >> MMC controllers in DRA7 and DRA72 SoCs.
> > 
> > I wonder if this should really be documented for sdhci
> > instead of ti-omap-hsmmc.txt?
> 
> hmm.. yeah, having a separate binding document for sdhci would make the 
> binding
> really clean and we also don't have to carry legacy binding code. But we'll be
> never able to remove omap-hsmmc driver then and end up maintaining 2 drivers
> for the same controller.

Well I think it's best to be able to enable sdhci one board at
a time after testing. Then eventually when fully supported, we
can just have sdhci also parse the legacy omap binding and just
drop the omap-hsmmc driver.

Regards,

Tony


Re: [RFC PATCH 5/7] dt-bindings: ti-omap-hsmmc: Document new compatible for sdhci omap

2017-08-17 Thread Tony Lindgren
* Kishon Vijay Abraham I  [170816 22:44]:
> Hi Tony,
> 
> On Thursday 10 August 2017 03:42 AM, Tony Lindgren wrote:
> > * Kishon Vijay Abraham I  [170807 09:03]:
> >> Document the new compatible string "ti,dra7-sdhci" to be used for
> >> MMC controllers in DRA7 and DRA72 SoCs.
> > 
> > I wonder if this should really be documented for sdhci
> > instead of ti-omap-hsmmc.txt?
> 
> hmm.. yeah, having a separate binding document for sdhci would make the 
> binding
> really clean and we also don't have to carry legacy binding code. But we'll be
> never able to remove omap-hsmmc driver then and end up maintaining 2 drivers
> for the same controller.

Well I think it's best to be able to enable sdhci one board at
a time after testing. Then eventually when fully supported, we
can just have sdhci also parse the legacy omap binding and just
drop the omap-hsmmc driver.

Regards,

Tony


Re: smp-induced oops/NULL pointer dereference in mpt3sas, from kernel >= 4.11

2017-08-17 Thread Bart Van Assche
On Thu, 2017-08-17 at 15:18 +0530, Chaitra Basappa wrote:
> We analyzed this issue and could figure out it is not because of driver,
> its because the "sense" field of the 'struct scsi_request' is not being
> populated properly from the upper layer.
> And this "sense" member is being referenced in our driver code for kernel
> versions >= 4.11 as shown below in the snippet:
> Whereas as for < 4.11 kernel version this "sense" member was referenced
> via 'struct request'
> 
> 
> static int
> _transport_smp_handler (.) {
> .
> .
> > > memcpy(scsi_req(req)->sense, mpi_reply, sizeof(*mpi_reply));
> 
> .
> .
> }
> 
> And hence the NULL pointer dereference call trace is seen for the above
> chunk of mpt3sas. This needs to be addressed from upper layer, so please
> help us in getting this resolved.

Hello Chaitra,

Have you noticed the following e-mail thread: "[RFC PATCH 0/6] bsg: fix
regression resulting in panics when sending commands via BSG and some
sanity cleanups" (http://www.spinics.net/lists/linux-scsi/msg111724.html)?

Bart.

Re: smp-induced oops/NULL pointer dereference in mpt3sas, from kernel >= 4.11

2017-08-17 Thread Bart Van Assche
On Thu, 2017-08-17 at 15:18 +0530, Chaitra Basappa wrote:
> We analyzed this issue and could figure out it is not because of driver,
> its because the "sense" field of the 'struct scsi_request' is not being
> populated properly from the upper layer.
> And this "sense" member is being referenced in our driver code for kernel
> versions >= 4.11 as shown below in the snippet:
> Whereas as for < 4.11 kernel version this "sense" member was referenced
> via 'struct request'
> 
> 
> static int
> _transport_smp_handler (.) {
> .
> .
> > > memcpy(scsi_req(req)->sense, mpi_reply, sizeof(*mpi_reply));
> 
> .
> .
> }
> 
> And hence the NULL pointer dereference call trace is seen for the above
> chunk of mpt3sas. This needs to be addressed from upper layer, so please
> help us in getting this resolved.

Hello Chaitra,

Have you noticed the following e-mail thread: "[RFC PATCH 0/6] bsg: fix
regression resulting in panics when sending commands via BSG and some
sanity cleanups" (http://www.spinics.net/lists/linux-scsi/msg111724.html)?

Bart.

Re: [PATCH v4] livepatch: introduce shadow variable API

2017-08-17 Thread Joe Lawrence
On 08/17/2017 10:05 AM, Petr Mladek wrote:
> On Mon 2017-08-14 16:02:43, Joe Lawrence wrote:
>> Add exported API for livepatch modules:
>>
>>   klp_shadow_get()
>>   klp_shadow_attach()
>>   klp_shadow_get_or_attach()
>>   klp_shadow_update_or_attach()
>>   klp_shadow_detach()
>>   klp_shadow_detach_all()
>>
>> that implement "shadow" variables, which allow callers to associate new
>> shadow fields to existing data structures.  This is intended to be used
>> by livepatch modules seeking to emulate additions to data structure
>> definitions.
>>
>> See Documentation/livepatch/shadow-vars.txt for a summary of the new
>> shadow variable API, including a few common use cases.
>>
>> See samples/livepatch/livepatch-shadow-* for example modules that
>> demonstrate shadow variables.
>>
>> diff --git a/kernel/livepatch/shadow.c b/kernel/livepatch/shadow.c
>> new file mode 100644
>> index ..0ebd4b635e4f
>> --- /dev/null
>> +++ b/kernel/livepatch/shadow.c
>> +/**
>> + * klp_shadow_match() - verify a shadow variable matches given 
>> + * @shadow: shadow variable to match
>> + * @obj:pointer to parent object
>> + * @id: data identifier
>> + *
>> + * Return: true if the shadow variable matches.
>> + *
>> + * Callers should hold the klp_shadow_lock.
>> + */
>> +static inline bool klp_shadow_match(struct klp_shadow *shadow, void *obj,
>> +unsigned long id)
>> +{
>> +return shadow->obj == obj && shadow->id == id;
>> +}
> 
> Do we really need this function? It is called only in situations
> where shadow->obj == obj is always true. Especially the use in
> klp_shadow_detach_all() is funny because we pass shadow->obj as
> the shadow parameter.

Personal preference.  Abstracting out all of the routines that operated
on the shadow variables (setting up, comparison) did save some code
lines and centralized these common bits.

>> +
>> +/**
>> + * klp_shadow_get() - retrieve a shadow variable data pointer
>> + * @obj:pointer to parent object
>> + * @id: data identifier
>> + *
>> + * Return: the shadow variable data element, NULL on failure.
>> + */
>> +void *klp_shadow_get(void *obj, unsigned long id)
>> +{
>> +struct klp_shadow *shadow;
>> +
>> +rcu_read_lock();
>> +
>> +hash_for_each_possible_rcu(klp_shadow_hash, shadow, node,
>> +   (unsigned long)obj) {
>> +
>> +if (klp_shadow_match(shadow, obj, id)) {
>> +rcu_read_unlock();
>> +return shadow->data;
>> +}
>> +}
>> +
>> +rcu_read_unlock();
>> +
>> +return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(klp_shadow_get);
>> +
>> +/*
>> + * klp_shadow_set() - initialize a shadow variable
>> + * @shadow: shadow variable to initialize
>> + * @obj:pointer to parent object
>> + * @id: data identifier
>> + * @data:   pointer to data to attach to parent
>> + * @size:   size of attached data
>> + *
>> + * Callers should hold the klp_shadow_lock.
>> + */
>> +static inline void klp_shadow_set(struct klp_shadow *shadow, void *obj,
>> +  unsigned long id, void *data, size_t size)
>> +{
>> +shadow->obj = obj;
>> +shadow->id = id;
>> +
>> +if (data)
>> +memcpy(shadow->data, data, size);
>> +}
> 
> The function name suggests that it is a counterpart of
> klp_shadow_get() but it is not. Which is a bit confusing.
I should have called it "setup" or "init", but perhaps that's moot ...

> Hmm, the purpose of this function is to reduce the size of cut
> code between all that klp_shadow_*attach() variants. But there
> is still too much cut code. In fact, the base logic of all
> variants is basically the same. The only small difference should be
> how they handle the situation when the variable is already there.

... this is true.  An earlier draft revision that I had discarded
attempted combining all cases.  I had used two extra function arguments,
"update" and "duplicates", to key off for each behavior... it turned
into a complicated, full-screen page of conditional logic, so I threw it
out.

However, I like the way you pulled it off using a jump-to-switch
statement at the bottom of the function...

> OK, there is a locking difference in the update variant but
> it is questionable, see below.
> 
> I would suggest to do something like this:
> 
> static enum klp_shadow_attach_existing_handling {
>  KLP_SHADOW_EXISTING_RETURN,
>  KLP_SHADOW_EXISTING_WARN,
>  KLP_SHADOW_EXISING_UPDATE,
> };
> 
> void *__klp_shadow_get_or_attach(void *obj, unsigned long id, void *data,
>  size_t size, gfp_t gfp_flags,
>  enum klp_shadow_attach_existing_handling 
> existing_handling)
> {
>   struct klp_shadow *new_shadow;
>   void *shadow_data;
>   unsigned long flags;
> 
>   /* Check if the shadow variable if  already exists */
>   shadow_data = klp_shadow_get(obj, id);
>   if 

Re: [PATCH v4] livepatch: introduce shadow variable API

2017-08-17 Thread Joe Lawrence
On 08/17/2017 10:05 AM, Petr Mladek wrote:
> On Mon 2017-08-14 16:02:43, Joe Lawrence wrote:
>> Add exported API for livepatch modules:
>>
>>   klp_shadow_get()
>>   klp_shadow_attach()
>>   klp_shadow_get_or_attach()
>>   klp_shadow_update_or_attach()
>>   klp_shadow_detach()
>>   klp_shadow_detach_all()
>>
>> that implement "shadow" variables, which allow callers to associate new
>> shadow fields to existing data structures.  This is intended to be used
>> by livepatch modules seeking to emulate additions to data structure
>> definitions.
>>
>> See Documentation/livepatch/shadow-vars.txt for a summary of the new
>> shadow variable API, including a few common use cases.
>>
>> See samples/livepatch/livepatch-shadow-* for example modules that
>> demonstrate shadow variables.
>>
>> diff --git a/kernel/livepatch/shadow.c b/kernel/livepatch/shadow.c
>> new file mode 100644
>> index ..0ebd4b635e4f
>> --- /dev/null
>> +++ b/kernel/livepatch/shadow.c
>> +/**
>> + * klp_shadow_match() - verify a shadow variable matches given 
>> + * @shadow: shadow variable to match
>> + * @obj:pointer to parent object
>> + * @id: data identifier
>> + *
>> + * Return: true if the shadow variable matches.
>> + *
>> + * Callers should hold the klp_shadow_lock.
>> + */
>> +static inline bool klp_shadow_match(struct klp_shadow *shadow, void *obj,
>> +unsigned long id)
>> +{
>> +return shadow->obj == obj && shadow->id == id;
>> +}
> 
> Do we really need this function? It is called only in situations
> where shadow->obj == obj is always true. Especially the use in
> klp_shadow_detach_all() is funny because we pass shadow->obj as
> the shadow parameter.

Personal preference.  Abstracting out all of the routines that operated
on the shadow variables (setting up, comparison) did save some code
lines and centralized these common bits.

>> +
>> +/**
>> + * klp_shadow_get() - retrieve a shadow variable data pointer
>> + * @obj:pointer to parent object
>> + * @id: data identifier
>> + *
>> + * Return: the shadow variable data element, NULL on failure.
>> + */
>> +void *klp_shadow_get(void *obj, unsigned long id)
>> +{
>> +struct klp_shadow *shadow;
>> +
>> +rcu_read_lock();
>> +
>> +hash_for_each_possible_rcu(klp_shadow_hash, shadow, node,
>> +   (unsigned long)obj) {
>> +
>> +if (klp_shadow_match(shadow, obj, id)) {
>> +rcu_read_unlock();
>> +return shadow->data;
>> +}
>> +}
>> +
>> +rcu_read_unlock();
>> +
>> +return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(klp_shadow_get);
>> +
>> +/*
>> + * klp_shadow_set() - initialize a shadow variable
>> + * @shadow: shadow variable to initialize
>> + * @obj:pointer to parent object
>> + * @id: data identifier
>> + * @data:   pointer to data to attach to parent
>> + * @size:   size of attached data
>> + *
>> + * Callers should hold the klp_shadow_lock.
>> + */
>> +static inline void klp_shadow_set(struct klp_shadow *shadow, void *obj,
>> +  unsigned long id, void *data, size_t size)
>> +{
>> +shadow->obj = obj;
>> +shadow->id = id;
>> +
>> +if (data)
>> +memcpy(shadow->data, data, size);
>> +}
> 
> The function name suggests that it is a counterpart of
> klp_shadow_get() but it is not. Which is a bit confusing.
I should have called it "setup" or "init", but perhaps that's moot ...

> Hmm, the purpose of this function is to reduce the size of cut
> code between all that klp_shadow_*attach() variants. But there
> is still too much cut code. In fact, the base logic of all
> variants is basically the same. The only small difference should be
> how they handle the situation when the variable is already there.

... this is true.  An earlier draft revision that I had discarded
attempted combining all cases.  I had used two extra function arguments,
"update" and "duplicates", to key off for each behavior... it turned
into a complicated, full-screen page of conditional logic, so I threw it
out.

However, I like the way you pulled it off using a jump-to-switch
statement at the bottom of the function...

> OK, there is a locking difference in the update variant but
> it is questionable, see below.
> 
> I would suggest to do something like this:
> 
> static enum klp_shadow_attach_existing_handling {
>  KLP_SHADOW_EXISTING_RETURN,
>  KLP_SHADOW_EXISTING_WARN,
>  KLP_SHADOW_EXISING_UPDATE,
> };
> 
> void *__klp_shadow_get_or_attach(void *obj, unsigned long id, void *data,
>  size_t size, gfp_t gfp_flags,
>  enum klp_shadow_attach_existing_handling 
> existing_handling)
> {
>   struct klp_shadow *new_shadow;
>   void *shadow_data;
>   unsigned long flags;
> 
>   /* Check if the shadow variable if  already exists */
>   shadow_data = klp_shadow_get(obj, id);
>   if (shadow_data)
>   

Re: [PATCH v5 6/6] perf: ARM DynamIQ Shared Unit PMU support

2017-08-17 Thread Mark Rutland
On Thu, Aug 17, 2017 at 03:52:24PM +0100, Suzuki K Poulose wrote:
> On 16/08/17 15:10, Mark Rutland wrote:
> >On Tue, Aug 08, 2017 at 12:37:26PM +0100, Suzuki K Poulose wrote:

> >>+static struct attribute *dsu_pmu_event_attrs[] = {
> >>+   DSU_EVENT_ATTR(cycles, 0x11),
> >>+   DSU_EVENT_ATTR(bus_acecss, 0x19),
> >>+   DSU_EVENT_ATTR(memory_error, 0x1a),
> >>+   DSU_EVENT_ATTR(bus_cycles, 0x1d),
> >>+   DSU_EVENT_ATTR(l3d_cache_allocate, 0x29),
> >>+   DSU_EVENT_ATTR(l3d_cache_refill, 0x2a),
> >>+   DSU_EVENT_ATTR(l3d_cache, 0x2b),
> >>+   DSU_EVENT_ATTR(l3d_cache_wb, 0x2c),
> >
> >MAX_COMMON_EVENTS seems to be 0x40, so are we just assuming the below
> >are implemented?
> >
> >If so, why bother exposing them at all? We can't know that they're going
> >to work...
> 
> Thats right. The only defending argument is that the event codes are specific
> to the DynamIQ, unlike the COMMON_EVENTS which matches the ARMv8 PMU event 
> codes.
> So, someone would need to carefully find the event code for a particular 
> event.
> Having these entries would make it easier for the user to do the profiling.
> 
> Also, future revisions of the DSU could potentially expose more events. So 
> there
> wouldn't be any way to tell the user (provided there aren't any changes to the
> programming model and we decide to reuse the same compatible string) what we 
> *could*
> potentially support. In short, this is not a problem at the moment and we 
> could
> do something about it as and when required.

I'd rather that we only describes those that we can probe from the
PMCEID* registers, and for the rest, left the user to consult a manual.

I can well imagine future variants of this IP supporing different
events, and I'd prefer to avoid poriflerating tables for those.

[...]

> >>+static struct attribute *dsu_pmu_cpumask_attrs[] = {
> >>+   DSU_CPUMASK_ATTR(cpumask, DSU_ACTIVE_CPU_MASK),
> >>+   DSU_CPUMASK_ATTR(supported_cpus, DSU_SUPPORTED_CPU_MASK),
> >>+   NULL,
> >>+};
> >
> >Normally we only expose one mask.
> >
> >Why do we need the supported cpu mask? What is the intended use-case?
> 
> Thats just to let the user know the CPUs bound to this PMU instance.

I guess that can be useful, though the cpumasks we expose today are
confusing as-is, and this is another point of confusion.

We could drop this for now, and add it when requested, or we should try
to make the naming clearer somehow -- "supported" can be read in a
number of ways.

Further, it would be worth documenting this PMU under
Documentation/perf/.

[...]

> >>+static int dsu_pmu_add(struct perf_event *event, int flags)
> >>+{
> >>+   struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> >>+   struct dsu_hw_events *hw_events = _pmu->hw_events;
> >>+   struct hw_perf_event *hwc = >hw;
> >>+   int idx;
> >>+
> >>+   if (!cpumask_test_cpu(smp_processor_id(), _pmu->supported_cpus))
> >>+   return -ENOENT;
> >
> >This shouldn't ever happen, and we can check against the active cpumask,
> >with a WARN_ON_ONCE(). We have to do this for CPU PMUs since they
> >support events which can migrate with tasks, but that's not the case
> >here.
> >
> >[...]
> 
> But, we have to make sure we are reading the events from a CPU within this
> DSU in case we have multiple DSU units.

Regardless of how many instances there are, the core should *never*
add() a CPU-bound event (which DSU events are) on another CPU. To do so
would be a major bug.

So if this is just a paranoid check, we should WARN_ON_ONCE().
Otherwise, it's unnecessary.

> >>+/**
> >>+ * dsu_pmu_dt_get_cpus: Get the list of CPUs in the cluster.
> >>+ */
> >>+static int dsu_pmu_dt_get_cpus(struct device_node *dev, cpumask_t *mask)
> >>+{
> >>+   int i = 0, n, cpu;
> >>+   struct device_node *cpu_node;
> >>+
> >>+   n = of_count_phandle_with_args(dev, "cpus", NULL);
> >>+   if (n <= 0)
> >>+   return -ENODEV;
> >>+   for (; i < n; i++) {
> >>+   cpu_node = of_parse_phandle(dev, "cpus", i);
> >>+   if (!cpu_node)
> >>+   break;
> >>+   cpu = of_cpu_node_to_id(cpu_node);
> >>+   of_node_put(cpu_node);
> >>+   if (cpu < 0)
> >>+   break;
> >
> >I believe this can happen if the kernel's nr_cpus were capped below the
> >number of CPUs in the system. So we need to iterate all entries of the
> >cpus list regardless.
> >
> 
> Good point. Initial version of the driver used to ignore the failures, but
> was changed later. I will roll it back.

Thanks. If you could add a comment as to why, that'll hopefully avoid
anyone trying to "fix" the logic later.

[...]

> >>+   cpmcr = __dsu_pmu_read_pmcr();
> >>+   dsu_pmu->num_counters = ((cpmcr >> CLUSTERPMCR_N_SHIFT) &
> >>+   CLUSTERPMCR_N_MASK);
> >>+   if (!dsu_pmu->num_counters)
> >>+   return;
> >
> >Is that valid? What range of values makes sense?
> >
> >[...]
> >
> 
> We should at least have one counter implemented (excluding the cycle counter).
> And yes, we should 

Re: [PATCH v5 6/6] perf: ARM DynamIQ Shared Unit PMU support

2017-08-17 Thread Mark Rutland
On Thu, Aug 17, 2017 at 03:52:24PM +0100, Suzuki K Poulose wrote:
> On 16/08/17 15:10, Mark Rutland wrote:
> >On Tue, Aug 08, 2017 at 12:37:26PM +0100, Suzuki K Poulose wrote:

> >>+static struct attribute *dsu_pmu_event_attrs[] = {
> >>+   DSU_EVENT_ATTR(cycles, 0x11),
> >>+   DSU_EVENT_ATTR(bus_acecss, 0x19),
> >>+   DSU_EVENT_ATTR(memory_error, 0x1a),
> >>+   DSU_EVENT_ATTR(bus_cycles, 0x1d),
> >>+   DSU_EVENT_ATTR(l3d_cache_allocate, 0x29),
> >>+   DSU_EVENT_ATTR(l3d_cache_refill, 0x2a),
> >>+   DSU_EVENT_ATTR(l3d_cache, 0x2b),
> >>+   DSU_EVENT_ATTR(l3d_cache_wb, 0x2c),
> >
> >MAX_COMMON_EVENTS seems to be 0x40, so are we just assuming the below
> >are implemented?
> >
> >If so, why bother exposing them at all? We can't know that they're going
> >to work...
> 
> Thats right. The only defending argument is that the event codes are specific
> to the DynamIQ, unlike the COMMON_EVENTS which matches the ARMv8 PMU event 
> codes.
> So, someone would need to carefully find the event code for a particular 
> event.
> Having these entries would make it easier for the user to do the profiling.
> 
> Also, future revisions of the DSU could potentially expose more events. So 
> there
> wouldn't be any way to tell the user (provided there aren't any changes to the
> programming model and we decide to reuse the same compatible string) what we 
> *could*
> potentially support. In short, this is not a problem at the moment and we 
> could
> do something about it as and when required.

I'd rather that we only describes those that we can probe from the
PMCEID* registers, and for the rest, left the user to consult a manual.

I can well imagine future variants of this IP supporing different
events, and I'd prefer to avoid poriflerating tables for those.

[...]

> >>+static struct attribute *dsu_pmu_cpumask_attrs[] = {
> >>+   DSU_CPUMASK_ATTR(cpumask, DSU_ACTIVE_CPU_MASK),
> >>+   DSU_CPUMASK_ATTR(supported_cpus, DSU_SUPPORTED_CPU_MASK),
> >>+   NULL,
> >>+};
> >
> >Normally we only expose one mask.
> >
> >Why do we need the supported cpu mask? What is the intended use-case?
> 
> Thats just to let the user know the CPUs bound to this PMU instance.

I guess that can be useful, though the cpumasks we expose today are
confusing as-is, and this is another point of confusion.

We could drop this for now, and add it when requested, or we should try
to make the naming clearer somehow -- "supported" can be read in a
number of ways.

Further, it would be worth documenting this PMU under
Documentation/perf/.

[...]

> >>+static int dsu_pmu_add(struct perf_event *event, int flags)
> >>+{
> >>+   struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> >>+   struct dsu_hw_events *hw_events = _pmu->hw_events;
> >>+   struct hw_perf_event *hwc = >hw;
> >>+   int idx;
> >>+
> >>+   if (!cpumask_test_cpu(smp_processor_id(), _pmu->supported_cpus))
> >>+   return -ENOENT;
> >
> >This shouldn't ever happen, and we can check against the active cpumask,
> >with a WARN_ON_ONCE(). We have to do this for CPU PMUs since they
> >support events which can migrate with tasks, but that's not the case
> >here.
> >
> >[...]
> 
> But, we have to make sure we are reading the events from a CPU within this
> DSU in case we have multiple DSU units.

Regardless of how many instances there are, the core should *never*
add() a CPU-bound event (which DSU events are) on another CPU. To do so
would be a major bug.

So if this is just a paranoid check, we should WARN_ON_ONCE().
Otherwise, it's unnecessary.

> >>+/**
> >>+ * dsu_pmu_dt_get_cpus: Get the list of CPUs in the cluster.
> >>+ */
> >>+static int dsu_pmu_dt_get_cpus(struct device_node *dev, cpumask_t *mask)
> >>+{
> >>+   int i = 0, n, cpu;
> >>+   struct device_node *cpu_node;
> >>+
> >>+   n = of_count_phandle_with_args(dev, "cpus", NULL);
> >>+   if (n <= 0)
> >>+   return -ENODEV;
> >>+   for (; i < n; i++) {
> >>+   cpu_node = of_parse_phandle(dev, "cpus", i);
> >>+   if (!cpu_node)
> >>+   break;
> >>+   cpu = of_cpu_node_to_id(cpu_node);
> >>+   of_node_put(cpu_node);
> >>+   if (cpu < 0)
> >>+   break;
> >
> >I believe this can happen if the kernel's nr_cpus were capped below the
> >number of CPUs in the system. So we need to iterate all entries of the
> >cpus list regardless.
> >
> 
> Good point. Initial version of the driver used to ignore the failures, but
> was changed later. I will roll it back.

Thanks. If you could add a comment as to why, that'll hopefully avoid
anyone trying to "fix" the logic later.

[...]

> >>+   cpmcr = __dsu_pmu_read_pmcr();
> >>+   dsu_pmu->num_counters = ((cpmcr >> CLUSTERPMCR_N_SHIFT) &
> >>+   CLUSTERPMCR_N_MASK);
> >>+   if (!dsu_pmu->num_counters)
> >>+   return;
> >
> >Is that valid? What range of values makes sense?
> >
> >[...]
> >
> 
> We should at least have one counter implemented (excluding the cycle counter).
> And yes, we should 

Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

2017-08-17 Thread Bart Van Assche
On Wed, 2017-08-16 at 18:18 -0500, Brian King wrote:
> On 08/16/2017 12:21 PM, Bart Van Assche wrote:
> > On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
> > > As of next-20170809, linux-next on powerpc boot hung with below trace
> > > message.
> > > 
> > > [ ... ]
> > > 
> > > A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
> > > Always unprepare ...) in the merge branch 'scsi/for-next'
> > > 
> > > System booted fine when the below commit is reverted: 
> > > 
> > > commit 270065e92c317845d69095ec8e3d18616b5b39d5
> > > Author: Bart Van Assche 
> > > Date:   Thu Aug 3 14:40:14 2017 -0700
> > > 
> > > scsi: scsi-mq: Always unprepare before requeuing a request
> > 
> > Hello Brian and Michael,
> > 
> > Do you agree that this probably indicates a bug in the PowerPC block driver
> > that is used to access the boot disk? Anyway, since a solution is not yet
> > available, I will submit a revert for this patch.
> 
> I've been looking at this a bit, and can recreate the issue, but haven't
> got to root cause of the issue as of yet. If I do a sysrq-w while the system 
> is hung
> during boot I see this:
> 
> [   25.561523] Workqueue: events_unbound async_run_entry_fn
> [   25.561527] Call Trace:
> [   25.561529] [c001697873f0] [c00169701600] 0xc00169701600 
> (unreliable)
> [   25.561534] [c001697875c0] [c001ab78] __switch_to+0x2e8/0x430
> [   25.561539] [c00169787620] [c091ccb0] __schedule+0x310/0xa00
> [   25.561543] [c001697876f0] [c091d3e0] schedule+0x40/0xb0
> [   25.561548] [c00169787720] [c0921e40] 
> schedule_timeout+0x200/0x430
> [   25.561553] [c00169787810] [c091db10] 
> io_schedule_timeout+0x30/0x70
> [   25.561558] [c00169787840] [c091e978] 
> wait_for_common_io.constprop.3+0x178/0x280
> [   25.561563] [c001697878c0] [c047f7ec] blk_execute_rq+0x7c/0xd0
> [   25.561567] [c00169787910] [c0614cd0] scsi_execute+0x100/0x230
> [   25.561572] [c00169787990] [c060d29c] 
> scsi_report_opcode+0xbc/0x170
> [   25.561577] [c00169787a50] [d4fe6404] 
> sd_revalidate_disk+0xe04/0x1620 [sd_mod]
> [   25.561583] [c00169787b80] [d4fe6d84] 
> sd_probe_async+0xb4/0x230 [sd_mod]
> [   25.561588] [c00169787c00] [c010fc44] 
> async_run_entry_fn+0x74/0x210
> [   25.561593] [c00169787c90] [c0102f48] 
> process_one_work+0x198/0x480
> [   25.561598] [c00169787d30] [c01032b8] worker_thread+0x88/0x510
> [   25.561603] [c00169787dc0] [c010b030] kthread+0x160/0x1a0
> [   25.561608] [c00169787e30] [c000b3a4] 
> ret_from_kernel_thread+0x5c/0xb8
> 
> I was noticing that we are commonly in scsi_report_opcode. Since ipr RAID 
> arrays don't support
> the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES, I tried setting 
> sdev->no_report_opcodes = 1
> in ipr's slave configure. This seems to eliminate the boot hang for me, but 
> is only working around
> the issue. Since this command is not supported by ipr, it should return with 
> an illegal request.
> When I'm hung at this point, there is nothing outstanding to the adapter / 
> driver. I'll continue
> debugging...

(+linux-scsi)

Hello Brian,

Is kernel debugging enabled on your test system? Is lockdep enabled?
Anyway, stack traces like the above usually mean that a request got stuck in
a block or scsi driver (ipr in this case). Information about pending requests,
including the SCSI CDB, is available under /sys/kernel/debug/block (see also
commit 0eebd005dd07 ("scsi: Implement blk_mq_ops.show_rq()")).

Bart.

Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

2017-08-17 Thread Bart Van Assche
On Wed, 2017-08-16 at 18:18 -0500, Brian King wrote:
> On 08/16/2017 12:21 PM, Bart Van Assche wrote:
> > On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
> > > As of next-20170809, linux-next on powerpc boot hung with below trace
> > > message.
> > > 
> > > [ ... ]
> > > 
> > > A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
> > > Always unprepare ...) in the merge branch 'scsi/for-next'
> > > 
> > > System booted fine when the below commit is reverted: 
> > > 
> > > commit 270065e92c317845d69095ec8e3d18616b5b39d5
> > > Author: Bart Van Assche 
> > > Date:   Thu Aug 3 14:40:14 2017 -0700
> > > 
> > > scsi: scsi-mq: Always unprepare before requeuing a request
> > 
> > Hello Brian and Michael,
> > 
> > Do you agree that this probably indicates a bug in the PowerPC block driver
> > that is used to access the boot disk? Anyway, since a solution is not yet
> > available, I will submit a revert for this patch.
> 
> I've been looking at this a bit, and can recreate the issue, but haven't
> got to root cause of the issue as of yet. If I do a sysrq-w while the system 
> is hung
> during boot I see this:
> 
> [   25.561523] Workqueue: events_unbound async_run_entry_fn
> [   25.561527] Call Trace:
> [   25.561529] [c001697873f0] [c00169701600] 0xc00169701600 
> (unreliable)
> [   25.561534] [c001697875c0] [c001ab78] __switch_to+0x2e8/0x430
> [   25.561539] [c00169787620] [c091ccb0] __schedule+0x310/0xa00
> [   25.561543] [c001697876f0] [c091d3e0] schedule+0x40/0xb0
> [   25.561548] [c00169787720] [c0921e40] 
> schedule_timeout+0x200/0x430
> [   25.561553] [c00169787810] [c091db10] 
> io_schedule_timeout+0x30/0x70
> [   25.561558] [c00169787840] [c091e978] 
> wait_for_common_io.constprop.3+0x178/0x280
> [   25.561563] [c001697878c0] [c047f7ec] blk_execute_rq+0x7c/0xd0
> [   25.561567] [c00169787910] [c0614cd0] scsi_execute+0x100/0x230
> [   25.561572] [c00169787990] [c060d29c] 
> scsi_report_opcode+0xbc/0x170
> [   25.561577] [c00169787a50] [d4fe6404] 
> sd_revalidate_disk+0xe04/0x1620 [sd_mod]
> [   25.561583] [c00169787b80] [d4fe6d84] 
> sd_probe_async+0xb4/0x230 [sd_mod]
> [   25.561588] [c00169787c00] [c010fc44] 
> async_run_entry_fn+0x74/0x210
> [   25.561593] [c00169787c90] [c0102f48] 
> process_one_work+0x198/0x480
> [   25.561598] [c00169787d30] [c01032b8] worker_thread+0x88/0x510
> [   25.561603] [c00169787dc0] [c010b030] kthread+0x160/0x1a0
> [   25.561608] [c00169787e30] [c000b3a4] 
> ret_from_kernel_thread+0x5c/0xb8
> 
> I was noticing that we are commonly in scsi_report_opcode. Since ipr RAID 
> arrays don't support
> the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES, I tried setting 
> sdev->no_report_opcodes = 1
> in ipr's slave configure. This seems to eliminate the boot hang for me, but 
> is only working around
> the issue. Since this command is not supported by ipr, it should return with 
> an illegal request.
> When I'm hung at this point, there is nothing outstanding to the adapter / 
> driver. I'll continue
> debugging...

(+linux-scsi)

Hello Brian,

Is kernel debugging enabled on your test system? Is lockdep enabled?
Anyway, stack traces like the above usually mean that a request got stuck in
a block or scsi driver (ipr in this case). Information about pending requests,
including the SCSI CDB, is available under /sys/kernel/debug/block (see also
commit 0eebd005dd07 ("scsi: Implement blk_mq_ops.show_rq()")).

Bart.

Re: [PATCH 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with cr3

2017-08-17 Thread Andy Lutomirski
On Thu, Aug 17, 2017 at 3:35 AM, Will Deacon  wrote:
> On Tue, Aug 15, 2017 at 11:35:41PM +0100, Mark Rutland wrote:
>> On Wed, Aug 16, 2017 at 09:14:41AM -0700, Andy Lutomirski wrote:
>> > On Wed, Aug 16, 2017 at 5:57 AM, Matt Fleming  
>> > wrote:
>> > > On Wed, 16 Aug, at 12:03:22PM, Mark Rutland wrote:
>> > >>
>> > >> I'd expect we'd abort at a higher level, not taking any sample. i.e.
>> > >> we'd have the core overflow handler check in_funny_mm(), and if so, skip
>> > >> the sample, as with the skid case.
>> > >
>> > > FYI, this is my preferred solution for x86 too.
>> >
>> > One option for the "funny mm" flag would be literally the condition
>> > current->mm != current->active_mm.  I *think* this gets all the cases
>> > right as long as efi_switch_mm is careful with its ordering and that
>> > the arch switch_mm() code can handle the resulting ordering.  (x86's
>> > can now, I think, or at least will be able to in 4.14 -- not sure
>> > about other arches).
>>
>> For arm64 we'd have to rework things a bit to get the ordering right
>> (especially when we flip to/from the idmap), but otherwise this sounds sane 
>> to
>> me.
>>
>> > That being said, there's a totally different solution: run EFI
>> > callbacks in a kernel thread.  This has other benefits: we could run
>> > those callbacks in user mode some day, and doing *that* in a user
>> > thread seems like a mistake.
>>
>> I think that wouldn't work for CPU-bound perf events (which are not
>> ctx-switched with the task).
>>
>> It might be desireable to do that anyway, though.
>
> I'm still concerned that we're treating perf specially here -- are we
> absolutely sure that nobody else is going to attempt user accesses off the
> back of an interrupt?

Reasonably sure?  If nothing else, an interrupt taken while mmap_sem()
is held for write that tries to access user memory is asking for
serious trouble.  There are still a few callers of pagefault_disable()
and copy...inatomic(), though.

> If not, then I'd much prefer a solution that catches
> anybody doing that with the EFI page table installed, rather than trying
> to play whack-a-mole like this.

Using a kernel thread solves the problem for real.  Anything that
blindly accesses user memory in kernel thread context is terminally
broken no matter what.

>
> Will


Re: [PATCH 3/3] x86/efi: Use efi_switch_mm() rather than manually twiddling with cr3

2017-08-17 Thread Andy Lutomirski
On Thu, Aug 17, 2017 at 3:35 AM, Will Deacon  wrote:
> On Tue, Aug 15, 2017 at 11:35:41PM +0100, Mark Rutland wrote:
>> On Wed, Aug 16, 2017 at 09:14:41AM -0700, Andy Lutomirski wrote:
>> > On Wed, Aug 16, 2017 at 5:57 AM, Matt Fleming  
>> > wrote:
>> > > On Wed, 16 Aug, at 12:03:22PM, Mark Rutland wrote:
>> > >>
>> > >> I'd expect we'd abort at a higher level, not taking any sample. i.e.
>> > >> we'd have the core overflow handler check in_funny_mm(), and if so, skip
>> > >> the sample, as with the skid case.
>> > >
>> > > FYI, this is my preferred solution for x86 too.
>> >
>> > One option for the "funny mm" flag would be literally the condition
>> > current->mm != current->active_mm.  I *think* this gets all the cases
>> > right as long as efi_switch_mm is careful with its ordering and that
>> > the arch switch_mm() code can handle the resulting ordering.  (x86's
>> > can now, I think, or at least will be able to in 4.14 -- not sure
>> > about other arches).
>>
>> For arm64 we'd have to rework things a bit to get the ordering right
>> (especially when we flip to/from the idmap), but otherwise this sounds sane 
>> to
>> me.
>>
>> > That being said, there's a totally different solution: run EFI
>> > callbacks in a kernel thread.  This has other benefits: we could run
>> > those callbacks in user mode some day, and doing *that* in a user
>> > thread seems like a mistake.
>>
>> I think that wouldn't work for CPU-bound perf events (which are not
>> ctx-switched with the task).
>>
>> It might be desireable to do that anyway, though.
>
> I'm still concerned that we're treating perf specially here -- are we
> absolutely sure that nobody else is going to attempt user accesses off the
> back of an interrupt?

Reasonably sure?  If nothing else, an interrupt taken while mmap_sem()
is held for write that tries to access user memory is asking for
serious trouble.  There are still a few callers of pagefault_disable()
and copy...inatomic(), though.

> If not, then I'd much prefer a solution that catches
> anybody doing that with the EFI page table installed, rather than trying
> to play whack-a-mole like this.

Using a kernel thread solves the problem for real.  Anything that
blindly accesses user memory in kernel thread context is terminally
broken no matter what.

>
> Will


<    4   5   6   7   8   9   10   11   12   13   >