RE: [RFC][PATCH 0/2] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if existed

2017-06-15 Thread Izumi, Taku
Dear Baoquan,

> > Our customer reported that Kernel text may be located on non-mirror
> > region (movable zone) when both address range mirroring feature and
> > KASLR are enabled.

   I know your customer :)

> > The functions of address range mirroring feature are as follows.
> > - The physical memory region whose descriptors in EFI memory map have
> >   EFI_MEMORY_MORE_RELIABLE attribute (bit: 16) are mirrored
> > - The function arranges such mirror region into normal zone and other
> region
> >   into movable zone in order to locate kernel code and data on mirror
> > region
> >
> > So we need restrict kernel to be located inside mirror region if it is
> > existed.
> >
> > The method is very simple. If efi is enabled, just iterate all efi
> > memory map and pick up mirror region to process for adding candidate
> > of slot. If efi disabled or no mirror region existed, still process
> > e820 memory map. This won't bring much efficiency loss, at worst we
> > just go through all efi memory maps and found no mirror.
> >
> > One question:
> > From code, though mirror regions are existed, they are meaningful only
> > if kernelcore=mirror kernel option is specified. Not sure if my
> > understanding is correct.

   Your understanding is almost correct. 
   Only when "kernelcore=mirror" specified, the above procedure works.
   But, if mirrored regions are existed, bootmem allocator tries to 
   allocate from mirrored region independently of "kerenelcore=mirror" option.

   So, IMHO, kernel text is important, so putting it to mirrored (more reliable)
   region is reasonable whether or not "kernelcore=mirror" is specified.

   Anyway thanks for submitting patch.
   We have Address Range Mirroring capable machine, so we'll test your patch.

  Sincerely,
  Taku Izumi

> 
> Since you are the author of kernelcore=mirror related code and expert on
> mirror feature, could you help answer above question?
> 
> Thanks
> Baoquan
> >
> > NOTE:
> > I haven't got a machine with efi mirror region enabled, so only test
> > the
> > e820 map processing case and the case of no mirror region on efi machine.
> > So set this as a RFC patchset, will post formal one after above
> > question is made clear and mirror issue test passed.
> >
> > Baoquan He (2):
> >   x86/boot/KASLR: Adapt process_e820_entry for all kinds of memory map
> >   x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if
> > existed
> >
> >  arch/x86/boot/compressed/kaslr.c | 129
> > +++
> >  1 file changed, 104 insertions(+), 25 deletions(-)
> >
> > --
> > 2.5.5
> >



RE: [RFC][PATCH 0/2] x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if existed

2017-06-15 Thread Izumi, Taku
Dear Baoquan,

> > Our customer reported that Kernel text may be located on non-mirror
> > region (movable zone) when both address range mirroring feature and
> > KASLR are enabled.

   I know your customer :)

> > The functions of address range mirroring feature are as follows.
> > - The physical memory region whose descriptors in EFI memory map have
> >   EFI_MEMORY_MORE_RELIABLE attribute (bit: 16) are mirrored
> > - The function arranges such mirror region into normal zone and other
> region
> >   into movable zone in order to locate kernel code and data on mirror
> > region
> >
> > So we need restrict kernel to be located inside mirror region if it is
> > existed.
> >
> > The method is very simple. If efi is enabled, just iterate all efi
> > memory map and pick up mirror region to process for adding candidate
> > of slot. If efi disabled or no mirror region existed, still process
> > e820 memory map. This won't bring much efficiency loss, at worst we
> > just go through all efi memory maps and found no mirror.
> >
> > One question:
> > From code, though mirror regions are existed, they are meaningful only
> > if kernelcore=mirror kernel option is specified. Not sure if my
> > understanding is correct.

   Your understanding is almost correct. 
   Only when "kernelcore=mirror" specified, the above procedure works.
   But, if mirrored regions are existed, bootmem allocator tries to 
   allocate from mirrored region independently of "kerenelcore=mirror" option.

   So, IMHO, kernel text is important, so putting it to mirrored (more reliable)
   region is reasonable whether or not "kernelcore=mirror" is specified.

   Anyway thanks for submitting patch.
   We have Address Range Mirroring capable machine, so we'll test your patch.

  Sincerely,
  Taku Izumi

> 
> Since you are the author of kernelcore=mirror related code and expert on
> mirror feature, could you help answer above question?
> 
> Thanks
> Baoquan
> >
> > NOTE:
> > I haven't got a machine with efi mirror region enabled, so only test
> > the
> > e820 map processing case and the case of no mirror region on efi machine.
> > So set this as a RFC patchset, will post formal one after above
> > question is made clear and mirror issue test passed.
> >
> > Baoquan He (2):
> >   x86/boot/KASLR: Adapt process_e820_entry for all kinds of memory map
> >   x86/boot/KASLR: Restrict kernel to be randomized in mirror regions if
> > existed
> >
> >  arch/x86/boot/compressed/kaslr.c | 129
> > +++
> >  1 file changed, 104 insertions(+), 25 deletions(-)
> >
> > --
> > 2.5.5
> >



RE: [bug discuss] fjes driver call trace warning, "PNP0C02" used in fjes seems like a bug,

2016-06-09 Thread Izumi, Taku
Dear Gao,

> From a SW perspective it like an acpi driver that uses "PNP0C02"
> as driver ids to perform the driver match in the ACPI table.
> 
> From my understanding this is wrong in principle because that identifier
> must be used to reserve motherboard resources (see par 4.1.2 of the PCI
> Firmware Specifications v3.2)
> 
> Therefore such identifier it is used from
> http://lxr.free-electrons.com/source/drivers/pnp/system.c
> to reserve such resources.
> 
> Basically your driver is breaking any other device that
> needs to reserve motherboard resources through system.c
> driver.
> 
> @David Miller, what is your opinion about this?
> I think this driver should be reverted...

 I'm willing to revise my driver if it's something wrong.
 I can't reproduce this problem. Could you please show me how to reproduce 
problem ?

 Sincerely,
 Taku Izumi


RE: [bug discuss] fjes driver call trace warning, "PNP0C02" used in fjes seems like a bug,

2016-06-09 Thread Izumi, Taku
Dear Gao,

> From a SW perspective it like an acpi driver that uses "PNP0C02"
> as driver ids to perform the driver match in the ACPI table.
> 
> From my understanding this is wrong in principle because that identifier
> must be used to reserve motherboard resources (see par 4.1.2 of the PCI
> Firmware Specifications v3.2)
> 
> Therefore such identifier it is used from
> http://lxr.free-electrons.com/source/drivers/pnp/system.c
> to reserve such resources.
> 
> Basically your driver is breaking any other device that
> needs to reserve motherboard resources through system.c
> driver.
> 
> @David Miller, what is your opinion about this?
> I think this driver should be reverted...

 I'm willing to revise my driver if it's something wrong.
 I can't reproduce this problem. Could you please show me how to reproduce 
problem ?

 Sincerely,
 Taku Izumi


RE: [PATCH] net: fjes: fjes_main: Remove create_workqueue

2016-06-02 Thread Izumi, Taku
Dear Bhaktipriya,

Thanks. Looks good to me.

Sincerely,
Taku Izumi

> -Original Message-
> From: Bhaktipriya Shridhar [mailto:bhaktipriy...@gmail.com]
> Sent: Thursday, June 02, 2016 6:31 PM
> To: David S. Miller; Izumi, Taku/泉 拓; Florian Westphal; Bhaktipriya Shridhar
> Cc: Tejun Heo; net...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] net: fjes: fjes_main: Remove create_workqueue
> 
> alloc_workqueue replaces deprecated create_workqueue().
> 
> The workqueue adapter->txrx_wq has workitem
> >raise_intr_rxdata_task per adapter. Extended Socket Network
> Device is shared memory based, so someone's transmission denotes other's
> reception.  raise_intr_rxdata_task raises interruption of receivers from
> the sender in order to notify receivers.
> 
> The workqueue adapter->control_wq has workitem
> >interrupt_watch_task per adapter. interrupt_watch_task is used
> to prevent delay of interrupts.
> 
> Dedicated workqueues have been used in both cases since the workitems
> on the workqueues are involved in normal device operation and require
> forward progress under memory pressure.
> 
> max_active has been set to 0 since there is no need for throttling
> the number of active work items.
> 
> Since network devices  may be used for memory reclaim,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>
> ---
>  drivers/net/fjes/fjes_main.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c
> index 86c331b..9006877 100644
> --- a/drivers/net/fjes/fjes_main.c
> +++ b/drivers/net/fjes/fjes_main.c
> @@ -1187,8 +1187,9 @@ static int fjes_probe(struct platform_device *plat_dev)
>   adapter->force_reset = false;
>   adapter->open_guard = false;
> 
> - adapter->txrx_wq = create_workqueue(DRV_NAME "/txrx");
> - adapter->control_wq = create_workqueue(DRV_NAME "/control");
> + adapter->txrx_wq = alloc_workqueue(DRV_NAME "/txrx", WQ_MEM_RECLAIM, 0);
> + adapter->control_wq = alloc_workqueue(DRV_NAME "/control",
> +   WQ_MEM_RECLAIM, 0);
> 
>   INIT_WORK(>tx_stall_task, fjes_tx_stall_task);
>   INIT_WORK(>raise_intr_rxdata_task,
> --
> 2.1.4
> 



RE: [PATCH] net: fjes: fjes_main: Remove create_workqueue

2016-06-02 Thread Izumi, Taku
Dear Bhaktipriya,

Thanks. Looks good to me.

Sincerely,
Taku Izumi

> -Original Message-
> From: Bhaktipriya Shridhar [mailto:bhaktipriy...@gmail.com]
> Sent: Thursday, June 02, 2016 6:31 PM
> To: David S. Miller; Izumi, Taku/泉 拓; Florian Westphal; Bhaktipriya Shridhar
> Cc: Tejun Heo; net...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH] net: fjes: fjes_main: Remove create_workqueue
> 
> alloc_workqueue replaces deprecated create_workqueue().
> 
> The workqueue adapter->txrx_wq has workitem
> >raise_intr_rxdata_task per adapter. Extended Socket Network
> Device is shared memory based, so someone's transmission denotes other's
> reception.  raise_intr_rxdata_task raises interruption of receivers from
> the sender in order to notify receivers.
> 
> The workqueue adapter->control_wq has workitem
> >interrupt_watch_task per adapter. interrupt_watch_task is used
> to prevent delay of interrupts.
> 
> Dedicated workqueues have been used in both cases since the workitems
> on the workqueues are involved in normal device operation and require
> forward progress under memory pressure.
> 
> max_active has been set to 0 since there is no need for throttling
> the number of active work items.
> 
> Since network devices  may be used for memory reclaim,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.
> 
> Signed-off-by: Bhaktipriya Shridhar 
> ---
>  drivers/net/fjes/fjes_main.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c
> index 86c331b..9006877 100644
> --- a/drivers/net/fjes/fjes_main.c
> +++ b/drivers/net/fjes/fjes_main.c
> @@ -1187,8 +1187,9 @@ static int fjes_probe(struct platform_device *plat_dev)
>   adapter->force_reset = false;
>   adapter->open_guard = false;
> 
> - adapter->txrx_wq = create_workqueue(DRV_NAME "/txrx");
> - adapter->control_wq = create_workqueue(DRV_NAME "/control");
> + adapter->txrx_wq = alloc_workqueue(DRV_NAME "/txrx", WQ_MEM_RECLAIM, 0);
> + adapter->control_wq = alloc_workqueue(DRV_NAME "/control",
> +   WQ_MEM_RECLAIM, 0);
> 
>   INIT_WORK(>tx_stall_task, fjes_tx_stall_task);
>   INIT_WORK(>raise_intr_rxdata_task,
> --
> 2.1.4
> 



RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Izumi, Taku
Dear Xishi,

 Sorry for late.

> -Original Message-
> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
> Sent: Friday, December 11, 2015 6:44 PM
> To: Izumi, Taku/泉 拓
> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
> 
> On 2015/12/11 13:53, Izumi, Taku wrote:
> 
> > Dear Xishi,
> >
> >> Hi Taku,
> >>
> >> Whether it is possible that we rewrite the fallback function in buddy 
> >> system
> >> when zone_movable and mirrored_kernelcore are both enabled?
> >
> >   What does "when zone_movable and mirrored_kernelcore are both enabled?" 
> > mean ?
> >
> >   My patchset just provides a new way to create ZONE_MOVABLE.
> >
> 
> Hi Taku,
> 
> I mean when zone_movable is from kernelcore=mirror, not kernelcore=nn[KMG].

  I'm not quite sure what you are saying, but if you want to screen user memory
  so that one is allocated from mirrored zone and another is from non-mirrored 
zone,
  I think it is possible to reuse my patchset.

  Sincerely,
  Taku Izumi

> Thanks,
> Xishi Qiu
> 
> >   Sincerely,
> >   Taku Izumi
> >>
> >> It seems something like that we add a new zone but the name is 
> >> zone_movable,
> >> not zone_mirror. And the prerequisite is that we won't enable these two
> >> features(movable memory and mirrored memory) at the same time. Thus we can
> >> reuse the code of movable zone.
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > .
> >
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-16 Thread Izumi, Taku
Dear Xishi,

 Sorry for late.

> -Original Message-
> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
> Sent: Friday, December 11, 2015 6:44 PM
> To: Izumi, Taku/泉 拓
> Cc: Luck, Tony; linux-kernel@vger.kernel.org; linux...@kvack.org; 
> a...@linux-foundation.org; Kamezawa, Hiroyuki/亀澤 寛
> 之; m...@csn.ul.ie; Hansen, Dave; m...@codeblueprint.co.uk
> Subject: Re: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option
> 
> On 2015/12/11 13:53, Izumi, Taku wrote:
> 
> > Dear Xishi,
> >
> >> Hi Taku,
> >>
> >> Whether it is possible that we rewrite the fallback function in buddy 
> >> system
> >> when zone_movable and mirrored_kernelcore are both enabled?
> >
> >   What does "when zone_movable and mirrored_kernelcore are both enabled?" 
> > mean ?
> >
> >   My patchset just provides a new way to create ZONE_MOVABLE.
> >
> 
> Hi Taku,
> 
> I mean when zone_movable is from kernelcore=mirror, not kernelcore=nn[KMG].

  I'm not quite sure what you are saying, but if you want to screen user memory
  so that one is allocated from mirrored zone and another is from non-mirrored 
zone,
  I think it is possible to reuse my patchset.

  Sincerely,
  Taku Izumi

> Thanks,
> Xishi Qiu
> 
> >   Sincerely,
> >   Taku Izumi
> >>
> >> It seems something like that we add a new zone but the name is 
> >> zone_movable,
> >> not zone_mirror. And the prerequisite is that we won't enable these two
> >> features(movable memory and mirrored memory) at the same time. Thus we can
> >> reuse the code of movable zone.
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > .
> >
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-10 Thread Izumi, Taku
Dear Xishi,

> Hi Taku,
> 
> Whether it is possible that we rewrite the fallback function in buddy system
> when zone_movable and mirrored_kernelcore are both enabled?

  What does "when zone_movable and mirrored_kernelcore are both enabled?" mean ?
  
  My patchset just provides a new way to create ZONE_MOVABLE.

  Sincerely,
  Taku Izumi
> 
> It seems something like that we add a new zone but the name is zone_movable,
> not zone_mirror. And the prerequisite is that we won't enable these two
> features(movable memory and mirrored memory) at the same time. Thus we can
> reuse the code of movable zone.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-10 Thread Izumi, Taku
Dear Xishi,

> Hi Taku,
> 
> Whether it is possible that we rewrite the fallback function in buddy system
> when zone_movable and mirrored_kernelcore are both enabled?

  What does "when zone_movable and mirrored_kernelcore are both enabled?" mean ?
  
  My patchset just provides a new way to create ZONE_MOVABLE.

  Sincerely,
  Taku Izumi
> 
> It seems something like that we add a new zone but the name is zone_movable,
> not zone_mirror. And the prerequisite is that we won't enable these two
> features(movable memory and mirrored memory) at the same time. Thus we can
> reuse the code of movable zone.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-09 Thread Izumi, Taku
Dear Tony, Xishi,

> >> How about add some comment, if mirrored memroy is too small, then the
> >> normal zone is small, so it may be oom.
> >> The mirrored memory is at least 1/64 of whole memory, because struct
> >> pages usually take 64 bytes per page.
> >
> > 1/64th is the absolute lower bound (for the page structures as you say). I
> > expect people will need to configure 10% or more to run any real workloads.

> >
> > I made the memblock boot time allocator fall back to non-mirrored memory
> > if mirrored memory ran out.  What happens in the run time allocator if the
> > non-movable zones run out of pages? Will we allocate kernel pages from 
> > movable
> > memory?
> >
> 
> As I know, the kernel pages will not allocated from movable zone.

 Yes, kernel pages are not allocated from ZONE_MOVABLE.

 In this case administrator must review and reconfigure the mirror ratio via 
 "MirrorRequest" EFI variable.
 
  Sincerely,
  Taku Izumi

>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > .
> >
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-09 Thread Izumi, Taku
Dear Tony, Xishi,

> >> How about add some comment, if mirrored memroy is too small, then the
> >> normal zone is small, so it may be oom.
> >> The mirrored memory is at least 1/64 of whole memory, because struct
> >> pages usually take 64 bytes per page.
> >
> > 1/64th is the absolute lower bound (for the page structures as you say). I
> > expect people will need to configure 10% or more to run any real workloads.

> >
> > I made the memblock boot time allocator fall back to non-mirrored memory
> > if mirrored memory ran out.  What happens in the run time allocator if the
> > non-movable zones run out of pages? Will we allocate kernel pages from 
> > movable
> > memory?
> >
> 
> As I know, the kernel pages will not allocated from movable zone.

 Yes, kernel pages are not allocated from ZONE_MOVABLE.

 In this case administrator must review and reconfigure the mirror ratio via 
 "MirrorRequest" EFI variable.
 
  Sincerely,
  Taku Izumi

>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> > .
> >
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option

2015-12-08 Thread Izumi, Taku
Dear Xishi,

 Thanks for reviewing.

> -Original Message-
> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
> Sent: Wednesday, December 09, 2015 11:26 AM
> To: Izumi, Taku/泉 拓
> Cc: linux-kernel@vger.kernel.org; linux...@kvack.org; tony.l...@intel.com; 
> Kamezawa, Hiroyuki/亀澤 寛之; m...@csn.ul.ie;
> a...@linux-foundation.org; dave.han...@intel.com; m...@codeblueprint.co.uk
> Subject: Re: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option
> 
> On 2015/11/27 23:04, Taku Izumi wrote:
> 
> > This patch extends existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> >
> > v1 -> v2:
> >  - Refine so that the following case also can be
> >handled properly:
> >
> >  Node X:  |MM--MM|
> >(legend) M: mirrored  -: not mirrrored
> >
> >  In this case, ZONE_NORMAL and ZONE_MOVABLE are
> >  arranged like bellow:
> >
> >  Node X:  |--|
> >   |ooxxxxxxoo| ZONE_NORMAL
> > |ooxx| ZONE_MOVABLE
> >(legend) o: present  x: absent
> >
> > Signed-off-by: Taku Izumi 
> > ---
> >  Documentation/kernel-parameters.txt |   9 ++-
> >  mm/page_alloc.c | 110 
> > ++--
> >  2 files changed, 112 insertions(+), 7 deletions(-)
> >
> > diff --git a/Documentation/kernel-parameters.txt 
> > b/Documentation/kernel-parameters.txt
> > index f8aae63..ed44c2c8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be 
> > entirely omitted.
> >
> > keepinitrd  [HW,ARM]
> >
> > -   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
> > +   kernelcore= Format: nn[KMG] | "reliable"
> > +   [KNL,X86,IA-64,PPC] This parameter
> > specifies the amount of memory usable by the kernel
> > for non-movable allocations.  The requested amount is
> > spread evenly throughout all nodes in the system. The
> > @@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be 
> > entirely omitted.
> > use the HighMem zone if it exists, and the Normal
> > zone if it does not.
> >
> > +   Instead of specifying the amount of memory (nn[KMS]),
> > +   you can specify "reliable" option. In case "reliable"
> > +   option is specified, reliable memory is used for
> > +   non-movable allocations and remaining memory is used
> > +   for Movable pages.
> > +
> > kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
> > Format: [,poll interval]
> > The controller # is the number of the ehci usb debug
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index acb0b4e..006a3d8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -251,6 +251,7 @@ static unsigned long __meminitdata 
> > arch_zone_highest_possible_pfn[MAX_NR_ZONES];
> >  static unsigned long __initdata required_kernelcore;
> >  static unsigned long __initdata required_movablecore;
> >  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> > +static bool reliable_kernelcore;
> >
> >  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
> >  int movable_zone;
> > @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> > unsigned long pfn;
> > struct zone *z;
> > unsigned long nr_initialised = 0;
> > +   struct memblock_region *r = NULL, *tmp;
> >
> > if (highest_memmap_pfn < end_pfn - 1)
> > highest_memmap_pfn = end_pfn - 1;
> > @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> > if (!update_defer_init(pgdat, pfn, end_pfn,
> > _initialised))
> > break;
> > +
> > +   /*
> > +* if not reliable_kernelcore and ZONE_MOVABLE exists,
> > +* range from zone_mova

[PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-08 Thread Taku Izumi
This patch extends existing "kernelcore" option and
introduces kernelcore=mirror option. By specifying
"mirror" instead of specifying the amount of memory,
non-mirrored (non-reliable) region will be arranged
into ZONE_MOVABLE.

v1 -> v2:
 - Refine so that the following case also can be
   handled properly:

 Node X:  |MM--MM|
   (legend) M: mirrored  -: not mirrrored

 In this case, ZONE_NORMAL and ZONE_MOVABLE are
 arranged like bellow:

 Node X:  |MM--MM|
  |ooxxoo| ZONE_NORMAL
|ooxx| ZONE_MOVABLE
   (legend) o: present  x: absent

v2 -> v3:
 - change the option name from kernelcore=reliable
   into kernelcore=mirror
 - documentation fix so that users can understand
   nn[KMS] and mirror are exclusive

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |  11 +++-
 mm/page_alloc.c | 110 ++--
 2 files changed, 114 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index f8aae63..b0ffc76 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "mirror"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1711,6 +1712,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "mirror" option. In case "mirror"
+   option is specified, mirrored (reliable) memory is used
+   for non-movable allocations and remaining memory is used
+   for Movable pages. nn[KMS] and "mirror" are exclusive,
+   so you can NOT specify nn[KMG] and "mirror" at the same
+   time.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index acb0b4e..4157476 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -251,6 +251,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool mirrored_kernelcore;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
unsigned long pfn;
struct zone *z;
unsigned long nr_initialised = 0;
+   struct memblock_region *r = NULL, *tmp;
 
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
@@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
if (!update_defer_init(pgdat, pfn, end_pfn,
_initialised))
break;
+
+   /*
+* if not mirrored_kernelcore and ZONE_MOVABLE exists,
+* range from zone_movable_pfn[nid] to end of each node
+* should be ZONE_MOVABLE not ZONE_NORMAL. skip it.
+*/
+   if (!mirrored_kernelcore && zone_movable_pfn[nid])
+   if (zone == ZONE_NORMAL &&
+   pfn >= zone_movable_pfn[nid])
+   continue;
+
+   /*
+* check given memblock attribute by firmware which
+* can affect kernel memory layout.
+* if zone==ZONE_MOVABLE but memory is mirrored,
+* it's an overlapped memmap init. skip it.
+*/
+   if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
+   if (!r ||
+

[PATCH v3 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()

2015-12-08 Thread Taku Izumi
Currently each zone's zone_start_pfn is calculated at
free_area_init_core(). However zone's range is fixed at
the time when invoking zone_spanned_pages_in_node().

This patch changes each zone->zone_start_pfn is
calculated at zone_spanned_pages_in_node().

Signed-off-by: Taku Izumi 
---
 mm/page_alloc.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17a3c66..acb0b4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4928,31 +4928,31 @@ static unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *ignored)
 {
-   unsigned long zone_start_pfn, zone_end_pfn;
-
/* When hotadd a new node from cpu_up(), the node should be empty */
if (!node_start_pfn && !node_end_pfn)
return 0;
 
/* Get the start and end of the zone */
-   zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
-   zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+   *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+   *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
adjust_zone_range_for_zone_movable(nid, zone_type,
node_start_pfn, node_end_pfn,
-   _start_pfn, _end_pfn);
+   zone_start_pfn, zone_end_pfn);
 
/* Check that this node has pages within the zone's required range */
-   if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+   if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
return 0;
 
/* Move the zone boundaries inside the node if necessary */
-   zone_end_pfn = min(zone_end_pfn, node_end_pfn);
-   zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+   *zone_end_pfn = min(*zone_end_pfn, node_end_pfn);
+   *zone_start_pfn = max(*zone_start_pfn, node_start_pfn);
 
/* Return the spanned pages */
-   return zone_end_pfn - zone_start_pfn;
+   return *zone_end_pfn - *zone_start_pfn;
 }
 
 /*
@@ -5017,6 +5017,8 @@ static inline unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *zones_size)
 {
return zones_size[zone_type];
@@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct 
pglist_data *pgdat,
 
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *zone = pgdat->node_zones + i;
+   unsigned long zone_start_pfn, zone_end_pfn;
unsigned long size, real_size;
 
size = zone_spanned_pages_in_node(pgdat->node_id, i,
  node_start_pfn,
  node_end_pfn,
+ _start_pfn,
+ _end_pfn,
  zones_size);
real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
  node_start_pfn, node_end_pfn,
  zholes_size);
+   if (size)
+   zone->zone_start_pfn = zone_start_pfn;
+   else
+   zone->zone_start_pfn = 0;
zone->spanned_pages = size;
zone->present_pages = real_size;
 
@@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
 {
enum zone_type j;
int nid = pgdat->node_id;
-   unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
 
pgdat_resize_init(pgdat);
@@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
realsize = freesize = zone->present_pages

[PATCH v3 0/2] mm: Introduce kernelcore=mirror option

2015-12-08 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are mirrored (reliable) via EFI memory map.
Now Linux kernel utilize its information and allocates 
boot time memory from reliable region. 

My requirement is:
  - allocate kernel memory from mirrored region 
  - allocate user memory from non-mirrored region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-mirrored range into ZONE_MOVABLE, 
mirrored memory is used for kernel allocations.

My idea is to extend existing "kernelcore" option and 
introduces kernelcore=mirror option. By specifying
"mirror" instead of specifying the amount of memory,
non-mirrored region will be arranged into ZONE_MOVABLE.  

Earlier discussions are at: 
 https://lkml.org/lkml/2015/10/9/24
 https://lkml.org/lkml/2015/10/15/9
 https://lkml.org/lkml/2015/11/27/18

For example, suppose 2-nodes system with the following memory
 range: 
  node 0 [mem 0x1000-0x00109fff] 
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable (mirrored):
  [0x-0x0001] 
  [0x0001-0x00018000] 
  [0x0008-0x00088000] 
  [0x0010a000-0x00112000]
  [0x0017a000-0x00182000] 

If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE
are arranged like bellow:

 - node 0:
  ZONE_NORMAL : [0x0001-0x0010a000]
  ZONE_MOVABLE: [0x00018000-0x0010a000]
 - node 1: 
  ZONE_NORMAL : [0x0010a000-0x0020a000]
  ZONE_MOVABLE: [0x00112000-0x0020a000] 

In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL
are treated as absent pages, and vice versa.

v1 -> v2:
 Refine so that the above example case also can be
 handled properly:
v2 -> v3:
 Change the option name from kernelcore=reliable
 into kernelcore=mirror and some documentation fix
 according to Andrew Morton's point

 
Taku Izumi (2):
  mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
  mm: Introduce kernelcore=mirror option

 Documentation/kernel-parameters.txt |  11 ++-
 mm/page_alloc.c | 140 +++-
 2 files changed, 133 insertions(+), 18 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option

2015-12-08 Thread Izumi, Taku
Dear Tony,


> >  Which do you think is beter ?
> >- change into kernelcore="mirrored"
> >- keep kernelcore="reliable" and minmal printk fix
> 
> UEFI came up with the "reliable" wording (as a more generic term ...
> as Andrew said
> it could cover differences in ECC modes, or some alternate memory
> technology that
> has lower error rates).
> 
> But I personally like "mirror" more ... it matches current
> implementation. Of course
> I'll look silly if some future system does something other than mirror.
> 

 Okay, I'll change the option name into kernelcore=mirror.

Sincerely,
Taku Izumi


RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option

2015-12-08 Thread Izumi, Taku
Dear Tony,

  Thanks for testing!

Dear Andrew,


> > Xeon E7 v3 based systems supports Address Range Mirroring
> > and UEFI BIOS complied with UEFI spec 2.5 can notify which
> > ranges are reliable (mirrored) via EFI memory map.
> > Now Linux kernel utilize its information and allocates
> > boot time memory from reliable region.
> >
> > My requirement is:
> >   - allocate kernel memory from reliable region
> >   - allocate user memory from non-reliable region
> >
> > In order to meet my requirement, ZONE_MOVABLE is useful.
> > By arranging non-reliable range into ZONE_MOVABLE,
> > reliable memory is only used for kernel allocations.
> >
> > My idea is to extend existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> 
> It is unfortunate that the kernel presently refers to this memory as
> "mirrored", but this patchset introduces the new term "reliable".  I
> think it would be better if we use "mirrored" throughout.
> Of course, mirroring isn't the only way to get reliable memory.

  YES. "mirroring" is not the only way.
  So, in my opinion, we should change "mirrored" into "reliable" in order
  to match terms of UEFI 2.5 spec.

> Perhaps if a part of the system memory has ECC correction then this
> also can be accessed using "reliable", in which case your proposed
> naming makes sense.  reliable == mirrored || ecc?

  "reliable" is better.

  But, I'm willing to change "reliable" into "mirrored".

  Otherwise, I keep "kernelcore=reliable" and add the following minimal fix as 
  a separate patch:

diff  a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -134,7 +134,7 @@ void __init efi_find_mirror(void)
}
}
if (mirror_size)
-   pr_info("Memory: %lldM/%lldM mirrored memory\n",
+   pr_info("Memory: %lldM/%lldM reliable memory\n",
mirror_size>>20, total_size>>20);
 }

 
 Which do you think is beter ?
   - change into kernelcore="mirrored"
   - keep kernelcore="reliable" and minmal printk fix 

> 
> Secondly, does this patchset mean that kernelcore=reliable and
> kernelcore=100M are exclusive?  Or can the user specify
> "kernelcore=reliable,kernelcore=100M" to use 100M of reliable memory
> for kernelcore?

  No, these are exclusive.
> 
> This is unclear from the documentation and I suggest that this be
> spelled out.

  Thanks. I'll update its document.

 Sincerely,
 Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option

2015-12-08 Thread Izumi, Taku
Dear Tony,

  Thanks for testing!

Dear Andrew,


> > Xeon E7 v3 based systems supports Address Range Mirroring
> > and UEFI BIOS complied with UEFI spec 2.5 can notify which
> > ranges are reliable (mirrored) via EFI memory map.
> > Now Linux kernel utilize its information and allocates
> > boot time memory from reliable region.
> >
> > My requirement is:
> >   - allocate kernel memory from reliable region
> >   - allocate user memory from non-reliable region
> >
> > In order to meet my requirement, ZONE_MOVABLE is useful.
> > By arranging non-reliable range into ZONE_MOVABLE,
> > reliable memory is only used for kernel allocations.
> >
> > My idea is to extend existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> 
> It is unfortunate that the kernel presently refers to this memory as
> "mirrored", but this patchset introduces the new term "reliable".  I
> think it would be better if we use "mirrored" throughout.
> Of course, mirroring isn't the only way to get reliable memory.

  YES. "mirroring" is not the only way.
  So, in my opinion, we should change "mirrored" into "reliable" in order
  to match terms of UEFI 2.5 spec.

> Perhaps if a part of the system memory has ECC correction then this
> also can be accessed using "reliable", in which case your proposed
> naming makes sense.  reliable == mirrored || ecc?

  "reliable" is better.

  But, I'm willing to change "reliable" into "mirrored".

  Otherwise, I keep "kernelcore=reliable" and add the following minimal fix as 
  a separate patch:

diff  a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -134,7 +134,7 @@ void __init efi_find_mirror(void)
}
}
if (mirror_size)
-   pr_info("Memory: %lldM/%lldM mirrored memory\n",
+   pr_info("Memory: %lldM/%lldM reliable memory\n",
mirror_size>>20, total_size>>20);
 }

 
 Which do you think is beter ?
   - change into kernelcore="mirrored"
   - keep kernelcore="reliable" and minmal printk fix 

> 
> Secondly, does this patchset mean that kernelcore=reliable and
> kernelcore=100M are exclusive?  Or can the user specify
> "kernelcore=reliable,kernelcore=100M" to use 100M of reliable memory
> for kernelcore?

  No, these are exclusive.
> 
> This is unclear from the documentation and I suggest that this be
> spelled out.

  Thanks. I'll update its document.

 Sincerely,
 Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2 0/2] mm: Introduce kernelcore=reliable option

2015-12-08 Thread Izumi, Taku
Dear Tony,


> >  Which do you think is beter ?
> >- change into kernelcore="mirrored"
> >- keep kernelcore="reliable" and minmal printk fix
> 
> UEFI came up with the "reliable" wording (as a more generic term ...
> as Andrew said
> it could cover differences in ECC modes, or some alternate memory
> technology that
> has lower error rates).
> 
> But I personally like "mirror" more ... it matches current
> implementation. Of course
> I'll look silly if some future system does something other than mirror.
> 

 Okay, I'll change the option name into kernelcore=mirror.

Sincerely,
Taku Izumi


RE: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option

2015-12-08 Thread Izumi, Taku
Dear Xishi,

 Thanks for reviewing.

> -Original Message-
> From: Xishi Qiu [mailto:qiuxi...@huawei.com]
> Sent: Wednesday, December 09, 2015 11:26 AM
> To: Izumi, Taku/泉 拓
> Cc: linux-kernel@vger.kernel.org; linux...@kvack.org; tony.l...@intel.com; 
> Kamezawa, Hiroyuki/亀澤 寛之; m...@csn.ul.ie;
> a...@linux-foundation.org; dave.han...@intel.com; m...@codeblueprint.co.uk
> Subject: Re: [PATCH v2 2/2] mm: Introduce kernelcore=reliable option
> 
> On 2015/11/27 23:04, Taku Izumi wrote:
> 
> > This patch extends existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> >
> > v1 -> v2:
> >  - Refine so that the following case also can be
> >handled properly:
> >
> >  Node X:  |MM--MM|
> >(legend) M: mirrored  -: not mirrrored
> >
> >  In this case, ZONE_NORMAL and ZONE_MOVABLE are
> >  arranged like bellow:
> >
> >  Node X:  |--|
> >   |ooxxxxxxoo| ZONE_NORMAL
> > |ooxx| ZONE_MOVABLE
> >(legend) o: present  x: absent
> >
> > Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
> > ---
> >  Documentation/kernel-parameters.txt |   9 ++-
> >  mm/page_alloc.c | 110 
> > ++--
> >  2 files changed, 112 insertions(+), 7 deletions(-)
> >
> > diff --git a/Documentation/kernel-parameters.txt 
> > b/Documentation/kernel-parameters.txt
> > index f8aae63..ed44c2c8 100644
> > --- a/Documentation/kernel-parameters.txt
> > +++ b/Documentation/kernel-parameters.txt
> > @@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be 
> > entirely omitted.
> >
> > keepinitrd  [HW,ARM]
> >
> > -   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
> > +   kernelcore= Format: nn[KMG] | "reliable"
> > +   [KNL,X86,IA-64,PPC] This parameter
> > specifies the amount of memory usable by the kernel
> > for non-movable allocations.  The requested amount is
> > spread evenly throughout all nodes in the system. The
> > @@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be 
> > entirely omitted.
> > use the HighMem zone if it exists, and the Normal
> > zone if it does not.
> >
> > +   Instead of specifying the amount of memory (nn[KMS]),
> > +   you can specify "reliable" option. In case "reliable"
> > +   option is specified, reliable memory is used for
> > +   non-movable allocations and remaining memory is used
> > +   for Movable pages.
> > +
> > kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
> > Format: <Controller#>[,poll interval]
> > The controller # is the number of the ehci usb debug
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index acb0b4e..006a3d8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -251,6 +251,7 @@ static unsigned long __meminitdata 
> > arch_zone_highest_possible_pfn[MAX_NR_ZONES];
> >  static unsigned long __initdata required_kernelcore;
> >  static unsigned long __initdata required_movablecore;
> >  static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
> > +static bool reliable_kernelcore;
> >
> >  /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
> >  int movable_zone;
> > @@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> > unsigned long pfn;
> > struct zone *z;
> > unsigned long nr_initialised = 0;
> > +   struct memblock_region *r = NULL, *tmp;
> >
> > if (highest_memmap_pfn < end_pfn - 1)
> > highest_memmap_pfn = end_pfn - 1;
> > @@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, 
> > int nid, unsigned long zone,
> > if (!update_defer_init(pgdat, pfn, end_pfn,
> > _initialised))
> > break;
> > +
> > +   /*
> > +* if not reliable_kernelcore and ZONE_MOVABLE exists,
>

[PATCH v3 2/2] mm: Introduce kernelcore=mirror option

2015-12-08 Thread Taku Izumi
This patch extends existing "kernelcore" option and
introduces kernelcore=mirror option. By specifying
"mirror" instead of specifying the amount of memory,
non-mirrored (non-reliable) region will be arranged
into ZONE_MOVABLE.

v1 -> v2:
 - Refine so that the following case also can be
   handled properly:

 Node X:  |MM--MM|
   (legend) M: mirrored  -: not mirrrored

 In this case, ZONE_NORMAL and ZONE_MOVABLE are
 arranged like bellow:

 Node X:  |MM--MM|
  |ooxxoo| ZONE_NORMAL
|ooxx| ZONE_MOVABLE
   (legend) o: present  x: absent

v2 -> v3:
 - change the option name from kernelcore=reliable
   into kernelcore=mirror
 - documentation fix so that users can understand
   nn[KMS] and mirror are exclusive

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  11 +++-
 mm/page_alloc.c | 110 ++--
 2 files changed, 114 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index f8aae63..b0ffc76 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "mirror"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1711,6 +1712,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "mirror" option. In case "mirror"
+   option is specified, mirrored (reliable) memory is used
+   for non-movable allocations and remaining memory is used
+   for Movable pages. nn[KMS] and "mirror" are exclusive,
+   so you can NOT specify nn[KMG] and "mirror" at the same
+   time.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index acb0b4e..4157476 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -251,6 +251,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool mirrored_kernelcore;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
unsigned long pfn;
struct zone *z;
unsigned long nr_initialised = 0;
+   struct memblock_region *r = NULL, *tmp;
 
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
@@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
if (!update_defer_init(pgdat, pfn, end_pfn,
_initialised))
break;
+
+   /*
+* if not mirrored_kernelcore and ZONE_MOVABLE exists,
+* range from zone_movable_pfn[nid] to end of each node
+* should be ZONE_MOVABLE not ZONE_NORMAL. skip it.
+*/
+   if (!mirrored_kernelcore && zone_movable_pfn[nid])
+   if (zone == ZONE_NORMAL &&
+   pfn >= zone_movable_pfn[nid])
+   continue;
+
+   /*
+* check given memblock attribute by firmware which
+* can affect kernel memory layout.
+* if zone==ZONE_MOVABLE but memory is mirrored,
+* it's an overlapped memmap init. skip it.
+*/
+   if (mirrored_kernelcore && zone == ZONE_MOVABLE) {
+   if (!r ||
+

[PATCH v3 0/2] mm: Introduce kernelcore=mirror option

2015-12-08 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are mirrored (reliable) via EFI memory map.
Now Linux kernel utilize its information and allocates 
boot time memory from reliable region. 

My requirement is:
  - allocate kernel memory from mirrored region 
  - allocate user memory from non-mirrored region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-mirrored range into ZONE_MOVABLE, 
mirrored memory is used for kernel allocations.

My idea is to extend existing "kernelcore" option and 
introduces kernelcore=mirror option. By specifying
"mirror" instead of specifying the amount of memory,
non-mirrored region will be arranged into ZONE_MOVABLE.  

Earlier discussions are at: 
 https://lkml.org/lkml/2015/10/9/24
 https://lkml.org/lkml/2015/10/15/9
 https://lkml.org/lkml/2015/11/27/18

For example, suppose 2-nodes system with the following memory
 range: 
  node 0 [mem 0x1000-0x00109fff] 
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable (mirrored):
  [0x-0x0001] 
  [0x0001-0x00018000] 
  [0x0008-0x00088000] 
  [0x0010a000-0x00112000]
  [0x0017a000-0x00182000] 

If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE
are arranged like bellow:

 - node 0:
  ZONE_NORMAL : [0x0001-0x0010a000]
  ZONE_MOVABLE: [0x00018000-0x0010a000]
 - node 1: 
  ZONE_NORMAL : [0x0010a000-0x0020a000]
  ZONE_MOVABLE: [0x00112000-0x0020a000] 

In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL
are treated as absent pages, and vice versa.

v1 -> v2:
 Refine so that the above example case also can be
 handled properly:
v2 -> v3:
 Change the option name from kernelcore=reliable
 into kernelcore=mirror and some documentation fix
 according to Andrew Morton's point

 
Taku Izumi (2):
  mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
  mm: Introduce kernelcore=mirror option

 Documentation/kernel-parameters.txt |  11 ++-
 mm/page_alloc.c | 140 +++-
 2 files changed, 133 insertions(+), 18 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v3 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()

2015-12-08 Thread Taku Izumi
Currently each zone's zone_start_pfn is calculated at
free_area_init_core(). However zone's range is fixed at
the time when invoking zone_spanned_pages_in_node().

This patch changes each zone->zone_start_pfn is
calculated at zone_spanned_pages_in_node().

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 mm/page_alloc.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17a3c66..acb0b4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4928,31 +4928,31 @@ static unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *ignored)
 {
-   unsigned long zone_start_pfn, zone_end_pfn;
-
/* When hotadd a new node from cpu_up(), the node should be empty */
if (!node_start_pfn && !node_end_pfn)
return 0;
 
/* Get the start and end of the zone */
-   zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
-   zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+   *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+   *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
adjust_zone_range_for_zone_movable(nid, zone_type,
node_start_pfn, node_end_pfn,
-   _start_pfn, _end_pfn);
+   zone_start_pfn, zone_end_pfn);
 
/* Check that this node has pages within the zone's required range */
-   if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+   if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
return 0;
 
/* Move the zone boundaries inside the node if necessary */
-   zone_end_pfn = min(zone_end_pfn, node_end_pfn);
-   zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+   *zone_end_pfn = min(*zone_end_pfn, node_end_pfn);
+   *zone_start_pfn = max(*zone_start_pfn, node_start_pfn);
 
/* Return the spanned pages */
-   return zone_end_pfn - zone_start_pfn;
+   return *zone_end_pfn - *zone_start_pfn;
 }
 
 /*
@@ -5017,6 +5017,8 @@ static inline unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *zones_size)
 {
return zones_size[zone_type];
@@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct 
pglist_data *pgdat,
 
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *zone = pgdat->node_zones + i;
+   unsigned long zone_start_pfn, zone_end_pfn;
unsigned long size, real_size;
 
size = zone_spanned_pages_in_node(pgdat->node_id, i,
  node_start_pfn,
  node_end_pfn,
+ _start_pfn,
+ _end_pfn,
  zones_size);
real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
  node_start_pfn, node_end_pfn,
  zholes_size);
+   if (size)
+   zone->zone_start_pfn = zone_start_pfn;
+   else
+   zone->zone_start_pfn = 0;
zone->spanned_pages = size;
zone->present_pages = real_size;
 
@@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
 {
enum zone_type j;
int nid = pgdat->node_id;
-   unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
 
pgdat_resize_init(pgdat);
@@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
reals

[PATCH v2 0/2] mm: Introduce kernelcore=reliable option

2015-11-26 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

My idea is to extend existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Earlier discussions are at:
 https://lkml.org/lkml/2015/10/9/24
 https://lkml.org/lkml/2015/10/15/9

For example, suppose 2-nodes system with the following memory
 range:
  node 0 [mem 0x1000-0x00109fff]
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable:
  [0x-0x0001]
  [0x0001-0x00018000]
  [0x0008-0x00088000]
  [0x0010a000-0x00112000]
  [0x0017a000-0x00182000]

If you specify kernelcore=reliable, ZONE_NORMAL and ZONE_MOVABLE
are arranged like bellow:

 - node 0:
  ZONE_NORMAL : [0x0001-0x0010a000]
  ZONE_MOVABLE: [0x00018000-0x0010a000]
 - node 1:
  ZONE_NORMAL : [0x0010a000-0x0020a000]
  ZONE_MOVABLE: [0x00112000-0x0020a000]

In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL
are treated as absent pages, and vice versa.

v1 -> v2:
 Refine so that the above example case also can be
 handled properly:


Taku Izumi (2):
  mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
  mm: Introduce kernelcore=reliable option

 Documentation/kernel-parameters.txt |   9 ++-
 mm/page_alloc.c | 140 +++-
 2 files changed, 131 insertions(+), 18 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()

2015-11-26 Thread Taku Izumi
Currently each zone's zone_start_pfn is calculated at
free_area_init_core(). However zone's range is fixed at
the time when invoking zone_spanned_pages_in_node().

This patch changes each zone->zone_start_pfn is
calculated at zone_spanned_pages_in_node().

Signed-off-by: Taku Izumi 
---
 mm/page_alloc.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17a3c66..acb0b4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4928,31 +4928,31 @@ static unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *ignored)
 {
-   unsigned long zone_start_pfn, zone_end_pfn;
-
/* When hotadd a new node from cpu_up(), the node should be empty */
if (!node_start_pfn && !node_end_pfn)
return 0;
 
/* Get the start and end of the zone */
-   zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
-   zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+   *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+   *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
adjust_zone_range_for_zone_movable(nid, zone_type,
node_start_pfn, node_end_pfn,
-   _start_pfn, _end_pfn);
+   zone_start_pfn, zone_end_pfn);
 
/* Check that this node has pages within the zone's required range */
-   if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+   if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
return 0;
 
/* Move the zone boundaries inside the node if necessary */
-   zone_end_pfn = min(zone_end_pfn, node_end_pfn);
-   zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+   *zone_end_pfn = min(*zone_end_pfn, node_end_pfn);
+   *zone_start_pfn = max(*zone_start_pfn, node_start_pfn);
 
/* Return the spanned pages */
-   return zone_end_pfn - zone_start_pfn;
+   return *zone_end_pfn - *zone_start_pfn;
 }
 
 /*
@@ -5017,6 +5017,8 @@ static inline unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *zones_size)
 {
return zones_size[zone_type];
@@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct 
pglist_data *pgdat,
 
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *zone = pgdat->node_zones + i;
+   unsigned long zone_start_pfn, zone_end_pfn;
unsigned long size, real_size;
 
size = zone_spanned_pages_in_node(pgdat->node_id, i,
  node_start_pfn,
  node_end_pfn,
+ _start_pfn,
+ _end_pfn,
  zones_size);
real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
  node_start_pfn, node_end_pfn,
  zholes_size);
+   if (size)
+   zone->zone_start_pfn = zone_start_pfn;
+   else
+   zone->zone_start_pfn = 0;
zone->spanned_pages = size;
zone->present_pages = real_size;
 
@@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
 {
enum zone_type j;
int nid = pgdat->node_id;
-   unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
 
pgdat_resize_init(pgdat);
@@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
realsize = freesize = zone->present_pages

[PATCH v2 2/2] mm: Introduce kernelcore=reliable option

2015-11-26 Thread Taku Izumi
This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

v1 -> v2:
 - Refine so that the following case also can be
   handled properly:

 Node X:  |MM--MM|
   (legend) M: mirrored  -: not mirrrored

 In this case, ZONE_NORMAL and ZONE_MOVABLE are
 arranged like bellow:

 Node X:  |--|
  |ooxxoo| ZONE_NORMAL
|ooxx| ZONE_MOVABLE
   (legend) o: present  x: absent

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |   9 ++-
 mm/page_alloc.c | 110 ++--
 2 files changed, 112 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index f8aae63..ed44c2c8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index acb0b4e..006a3d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -251,6 +251,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
unsigned long pfn;
struct zone *z;
unsigned long nr_initialised = 0;
+   struct memblock_region *r = NULL, *tmp;
 
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
@@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
if (!update_defer_init(pgdat, pfn, end_pfn,
_initialised))
break;
+
+   /*
+* if not reliable_kernelcore and ZONE_MOVABLE exists,
+* range from zone_movable_pfn[nid] to end of each node
+* should be ZONE_MOVABLE not ZONE_NORMAL. skip it.
+*/
+   if (!reliable_kernelcore && zone_movable_pfn[nid])
+   if (zone == ZONE_NORMAL &&
+   pfn >= zone_movable_pfn[nid])
+   continue;
+
+   /*
+* check given memblock attribute by firmware which
+* can affect kernel memory layout.
+* if zone==ZONE_MOVABLE but memory is mirrored,
+* it's an overlapped memmap init. skip it.
+*/
+   if (reliable_kernelcore && zone == ZONE_MOVABLE) {
+   if (!r ||
+   pfn >= memblock_region_memory_end_pfn(r)) {
+   for_each_memblock(memory, tmp)
+   if (pfn < 
memblock_region_memory_end_pfn(tmp))
+   break;
+   r = tmp;
+  

[PATCH v2 0/2] mm: Introduce kernelcore=reliable option

2015-11-26 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

My idea is to extend existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Earlier discussions are at:
 https://lkml.org/lkml/2015/10/9/24
 https://lkml.org/lkml/2015/10/15/9

For example, suppose 2-nodes system with the following memory
 range:
  node 0 [mem 0x1000-0x00109fff]
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable:
  [0x-0x0001]
  [0x0001-0x00018000]
  [0x0008-0x00088000]
  [0x0010a000-0x00112000]
  [0x0017a000-0x00182000]

If you specify kernelcore=reliable, ZONE_NORMAL and ZONE_MOVABLE
are arranged like bellow:

 - node 0:
  ZONE_NORMAL : [0x0001-0x0010a000]
  ZONE_MOVABLE: [0x00018000-0x0010a000]
 - node 1:
  ZONE_NORMAL : [0x0010a000-0x0020a000]
  ZONE_MOVABLE: [0x00112000-0x0020a000]

In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL
are treated as absent pages, and vice versa.

v1 -> v2:
 Refine so that the above example case also can be
 handled properly:


Taku Izumi (2):
  mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()
  mm: Introduce kernelcore=reliable option

 Documentation/kernel-parameters.txt |   9 ++-
 mm/page_alloc.c | 140 +++-
 2 files changed, 131 insertions(+), 18 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] mm: Calculate zone_start_pfn at zone_spanned_pages_in_node()

2015-11-26 Thread Taku Izumi
Currently each zone's zone_start_pfn is calculated at
free_area_init_core(). However zone's range is fixed at
the time when invoking zone_spanned_pages_in_node().

This patch changes each zone->zone_start_pfn is
calculated at zone_spanned_pages_in_node().

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 mm/page_alloc.c | 30 +++---
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17a3c66..acb0b4e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4928,31 +4928,31 @@ static unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *ignored)
 {
-   unsigned long zone_start_pfn, zone_end_pfn;
-
/* When hotadd a new node from cpu_up(), the node should be empty */
if (!node_start_pfn && !node_end_pfn)
return 0;
 
/* Get the start and end of the zone */
-   zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
-   zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
+   *zone_start_pfn = arch_zone_lowest_possible_pfn[zone_type];
+   *zone_end_pfn = arch_zone_highest_possible_pfn[zone_type];
adjust_zone_range_for_zone_movable(nid, zone_type,
node_start_pfn, node_end_pfn,
-   _start_pfn, _end_pfn);
+   zone_start_pfn, zone_end_pfn);
 
/* Check that this node has pages within the zone's required range */
-   if (zone_end_pfn < node_start_pfn || zone_start_pfn > node_end_pfn)
+   if (*zone_end_pfn < node_start_pfn || *zone_start_pfn > node_end_pfn)
return 0;
 
/* Move the zone boundaries inside the node if necessary */
-   zone_end_pfn = min(zone_end_pfn, node_end_pfn);
-   zone_start_pfn = max(zone_start_pfn, node_start_pfn);
+   *zone_end_pfn = min(*zone_end_pfn, node_end_pfn);
+   *zone_start_pfn = max(*zone_start_pfn, node_start_pfn);
 
/* Return the spanned pages */
-   return zone_end_pfn - zone_start_pfn;
+   return *zone_end_pfn - *zone_start_pfn;
 }
 
 /*
@@ -5017,6 +5017,8 @@ static inline unsigned long __meminit 
zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long node_start_pfn,
unsigned long node_end_pfn,
+   unsigned long *zone_start_pfn,
+   unsigned long *zone_end_pfn,
unsigned long *zones_size)
 {
return zones_size[zone_type];
@@ -5047,15 +5049,22 @@ static void __meminit calculate_node_totalpages(struct 
pglist_data *pgdat,
 
for (i = 0; i < MAX_NR_ZONES; i++) {
struct zone *zone = pgdat->node_zones + i;
+   unsigned long zone_start_pfn, zone_end_pfn;
unsigned long size, real_size;
 
size = zone_spanned_pages_in_node(pgdat->node_id, i,
  node_start_pfn,
  node_end_pfn,
+ _start_pfn,
+ _end_pfn,
  zones_size);
real_size = size - zone_absent_pages_in_node(pgdat->node_id, i,
  node_start_pfn, node_end_pfn,
  zholes_size);
+   if (size)
+   zone->zone_start_pfn = zone_start_pfn;
+   else
+   zone->zone_start_pfn = 0;
zone->spanned_pages = size;
zone->present_pages = real_size;
 
@@ -5176,7 +5185,6 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
 {
enum zone_type j;
int nid = pgdat->node_id;
-   unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
 
pgdat_resize_init(pgdat);
@@ -5192,6 +5200,7 @@ static void __paginginit free_area_init_core(struct 
pglist_data *pgdat)
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, freesize, memmap_pages;
+   unsigned long zone_start_pfn = zone->zone_start_pfn;
 
size = zone->spanned_pages;
reals

[PATCH v2 2/2] mm: Introduce kernelcore=reliable option

2015-11-26 Thread Taku Izumi
This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

v1 -> v2:
 - Refine so that the following case also can be
   handled properly:

 Node X:  |MM--MM|
   (legend) M: mirrored  -: not mirrrored

 In this case, ZONE_NORMAL and ZONE_MOVABLE are
 arranged like bellow:

 Node X:  |--|
  |ooxxoo| ZONE_NORMAL
|ooxx| ZONE_MOVABLE
   (legend) o: present  x: absent

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |   9 ++-
 mm/page_alloc.c | 110 ++--
 2 files changed, 112 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index f8aae63..ed44c2c8 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1695,7 +1695,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1711,6 +1712,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index acb0b4e..006a3d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -251,6 +251,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -4472,6 +4473,7 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
unsigned long pfn;
struct zone *z;
unsigned long nr_initialised = 0;
+   struct memblock_region *r = NULL, *tmp;
 
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;
@@ -4491,6 +4493,38 @@ void __meminit memmap_init_zone(unsigned long size, int 
nid, unsigned long zone,
if (!update_defer_init(pgdat, pfn, end_pfn,
_initialised))
break;
+
+   /*
+* if not reliable_kernelcore and ZONE_MOVABLE exists,
+* range from zone_movable_pfn[nid] to end of each node
+* should be ZONE_MOVABLE not ZONE_NORMAL. skip it.
+*/
+   if (!reliable_kernelcore && zone_movable_pfn[nid])
+   if (zone == ZONE_NORMAL &&
+   pfn >= zone_movable_pfn[nid])
+   continue;
+
+   /*
+* check given memblock attribute by firmware which
+* can affect kernel memory layout.
+* if zone==ZONE_MOVABLE but memory is mirrored,
+* it's an overlapped memmap init. skip it.
+*/
+   if (reliable_kernelcore && zone == ZONE_MOVABLE) {
+   if (!r ||
+   pfn >= memblock_region_memory_end_pfn(r)) {
+   for_each_memblock(memory, tmp)
+   if (pfn < 
memblock_region_memory_end_pfn(tmp))
+   break;
+   r = tmp;

RE: [PATCH] fjes: fix inconsistent indenting

2015-11-11 Thread Izumi, Taku
Thanks, Colin.

Signed-off-by: Taku Izumi 

> -Original Message-
> From: Colin King [mailto:colin.k...@canonical.com]
> Sent: Thursday, November 12, 2015 12:23 AM
> To: David S. Miller; Izumi, Taku/泉 拓; Markus Elfring; net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: [PATCH] fjes: fix inconsistent indenting
> 
> From: Colin Ian King 
> 
> minor change, indenting is one tab out.
> 
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/fjes/fjes_hw.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/fjes/fjes_hw.c b/drivers/net/fjes/fjes_hw.c
> index bb8b530..b103adb 100644
> --- a/drivers/net/fjes/fjes_hw.c
> +++ b/drivers/net/fjes/fjes_hw.c
> @@ -599,7 +599,7 @@ int fjes_hw_unregister_buff_addr(struct fjes_hw *hw, int 
> dest_epid)
>   FJES_CMD_REQ_RES_CODE_BUSY) &&
>  (timeout > 0)) {
>   msleep(200 + hw->my_epid * 20);
> - timeout -= (200 + hw->my_epid * 20);
> + timeout -= (200 + hw->my_epid * 20);
> 
>   res_buf->unshare_buffer.length = 0;
>   res_buf->unshare_buffer.code = 0;
> --
> 2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] fjes: fix inconsistent indenting

2015-11-11 Thread Izumi, Taku
Thanks, Colin.

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>

> -Original Message-
> From: Colin King [mailto:colin.k...@canonical.com]
> Sent: Thursday, November 12, 2015 12:23 AM
> To: David S. Miller; Izumi, Taku/泉 拓; Markus Elfring; net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Subject: [PATCH] fjes: fix inconsistent indenting
> 
> From: Colin Ian King <colin.k...@canonical.com>
> 
> minor change, indenting is one tab out.
> 
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---
>  drivers/net/fjes/fjes_hw.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/fjes/fjes_hw.c b/drivers/net/fjes/fjes_hw.c
> index bb8b530..b103adb 100644
> --- a/drivers/net/fjes/fjes_hw.c
> +++ b/drivers/net/fjes/fjes_hw.c
> @@ -599,7 +599,7 @@ int fjes_hw_unregister_buff_addr(struct fjes_hw *hw, int 
> dest_epid)
>   FJES_CMD_REQ_RES_CODE_BUSY) &&
>  (timeout > 0)) {
>   msleep(200 + hw->my_epid * 20);
> - timeout -= (200 + hw->my_epid * 20);
> + timeout -= (200 + hw->my_epid * 20);
> 
>   res_buf->unshare_buffer.length = 0;
>   res_buf->unshare_buffer.code = 0;
> --
> 2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:core/efi] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-28 Thread tip-bot for Taku Izumi
Commit-ID:  78b9bc947b18ed16b6c2c573d774e6d54ad9452d
Gitweb: http://git.kernel.org/tip/78b9bc947b18ed16b6c2c573d774e6d54ad9452d
Author: Taku Izumi 
AuthorDate: Fri, 23 Oct 2015 11:48:17 +0200
Committer:  Ingo Molnar 
CommitDate: Wed, 28 Oct 2015 12:28:06 +0100

efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

Commit:

  0f96a99dab36 ("efi: Add "efi_fake_mem" boot option")

introduced the following warning message:

  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer 
of different size [-Wint-to-pointer-cast]

new_memmap_phy was defined as a u64 value and cast to void*,
causing a int-to-pointer-cast warning on x86 32-bit builds.
However, since the void* type is inappropriate for a physical
address, the definition of struct efi_memory_map::phys_map has
been changed to phys_addr_t in the previous patch, and so the
cast can be dropped entirely.

This patch also changes the type of the "new_memmap_phy"
variable from "u64" to "phys_addr_t" to align with the types of
memblock_alloc() and struct efi_memory_map::phys_map.

Reported-by: Ingo Molnar 
Signed-off-by: Taku Izumi 
[ Removed void* cast, updated commit log]
Signed-off-by: Ard Biesheuvel 
Reviewed-by: Matt Fleming 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: linux-...@vger.kernel.org
Cc: matt.flem...@intel.com
Link: 
http://lkml.kernel.org/r/1445593697-1342-2-git-send-email-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/fake_mem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 32bcb14..ed3a854 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -59,7 +59,7 @@ void __init efi_fake_memmap(void)
u64 start, end, m_start, m_end, m_attr;
int new_nr_map = memmap.nr_map;
efi_memory_desc_t *md;
-   u64 new_memmap_phy;
+   phys_addr_t new_memmap_phy;
void *new_memmap;
void *old, *new;
int i;
@@ -183,7 +183,7 @@ void __init efi_fake_memmap(void)
/* swap into new EFI memmap */
efi_unmap_memmap();
memmap.map = new_memmap;
-   memmap.phys_map = (void *)new_memmap_phy;
+   memmap.phys_map = new_memmap_phy;
memmap.nr_map = new_nr_map;
memmap.map_end = memmap.map + memmap.nr_map * memmap.desc_size;
set_bit(EFI_MEMMAP, );
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:core/efi] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-28 Thread tip-bot for Taku Izumi
Commit-ID:  78b9bc947b18ed16b6c2c573d774e6d54ad9452d
Gitweb: http://git.kernel.org/tip/78b9bc947b18ed16b6c2c573d774e6d54ad9452d
Author: Taku Izumi <izumi.t...@jp.fujitsu.com>
AuthorDate: Fri, 23 Oct 2015 11:48:17 +0200
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 28 Oct 2015 12:28:06 +0100

efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

Commit:

  0f96a99dab36 ("efi: Add "efi_fake_mem" boot option")

introduced the following warning message:

  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from integer 
of different size [-Wint-to-pointer-cast]

new_memmap_phy was defined as a u64 value and cast to void*,
causing a int-to-pointer-cast warning on x86 32-bit builds.
However, since the void* type is inappropriate for a physical
address, the definition of struct efi_memory_map::phys_map has
been changed to phys_addr_t in the previous patch, and so the
cast can be dropped entirely.

This patch also changes the type of the "new_memmap_phy"
variable from "u64" to "phys_addr_t" to align with the types of
memblock_alloc() and struct efi_memory_map::phys_map.

Reported-by: Ingo Molnar <mi...@kernel.org>
Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
[ Removed void* cast, updated commit log]
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
Reviewed-by: Matt Fleming <m...@codeblueprint.co.uk>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: kamezawa.hir...@jp.fujitsu.com
Cc: linux-...@vger.kernel.org
Cc: matt.flem...@intel.com
Link: 
http://lkml.kernel.org/r/1445593697-1342-2-git-send-email-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 drivers/firmware/efi/fake_mem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 32bcb14..ed3a854 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -59,7 +59,7 @@ void __init efi_fake_memmap(void)
u64 start, end, m_start, m_end, m_attr;
int new_nr_map = memmap.nr_map;
efi_memory_desc_t *md;
-   u64 new_memmap_phy;
+   phys_addr_t new_memmap_phy;
void *new_memmap;
void *old, *new;
int i;
@@ -183,7 +183,7 @@ void __init efi_fake_memmap(void)
/* swap into new EFI memmap */
efi_unmap_memmap();
memmap.map = new_memmap;
-   memmap.phys_map = (void *)new_memmap_phy;
+   memmap.phys_map = new_memmap_phy;
memmap.nr_map = new_nr_map;
memmap.map_end = memmap.map + memmap.nr_map * memmap.desc_size;
set_bit(EFI_MEMMAP, );
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Izumi, Taku
 Dear Tony,

> -Original Message-
> From: Luck, Tony [mailto:tony.l...@intel.com]
> Sent: Friday, October 23, 2015 8:27 AM
> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; 
> linux...@kvack.org
> Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, 
> Dave; m...@codeblueprint.co.uk
> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
> 
> > I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
> > for meminfo ?
> 
> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my 
> system it says
> 
> $ zoneinfo
> Node  Normal Movable DMA   DMA32
>00.00   103020.078.94 1554.46
>1 9284.5489870.43
>2 9626.3394050.09
>3 9602.8293650.04
> 
> Not sure why I have zero Normal memory free on node0.  The sum of all those
> free counts is 410667.72 MB ... which is close enough to the boot time message
> showing the amount of mirror/total memory:
> 
> [0.00] efi: Memory: 80979/420096M mirrored memory
> 
> but a fair amount of the 80G of mirrored memory seems to have been miscounted
> as Movable instead of Normal. Perhaps this is because I have two blocks of 
> mirrored
> memory on each node and the movable zone code doesn't expect that?

 You were saying that OS view of memory of node is something like the following 
?
  
Node X:  |MM--MM|  
   (legend) M: mirrored  -: not mirrrored

 If so, is this a real Box's configuration?
 Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
 I thought mirroring range is concatenated at the first part of each node.

 Sincerely,
 Taku Izumi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-22 Thread Taku Izumi
commit-0f96a99 introduces the following warning message:

  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer
  from integer of different size [-Wint-to-pointer-cast]

new_memmap_phy was defined as a u64 value and casted to void*.
This causes a warning of int-to-pointer-cast on x86 32-bit
environment.

This patch changes the type of "new_memmap_phy" variable
from "u64" into "ulong" to avoid it.

v1 -> v2:
 - change the type of "new_memmap_phy" from phys_addr_t
   into ulong according to Ard's comment

Reported-by: Ingo Molnar 
Signed-off-by: Taku Izumi 
---
 drivers/firmware/efi/fake_mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 32bcb14..1f483b4 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -59,7 +59,7 @@ void __init efi_fake_memmap(void)
u64 start, end, m_start, m_end, m_attr;
int new_nr_map = memmap.nr_map;
efi_memory_desc_t *md;
-   u64 new_memmap_phy;
+   ulong new_memmap_phy;
void *new_memmap;
void *old, *new;
int i;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-22 Thread Izumi, Taku
Dear Ard,

> > commit-0f96a99 introduces the following warning message:
> >
> >  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer
> >  from integer of different size [-Wint-to-pointer-cast]
> >
> > new_memmap_phy was defined as a u64 value and casted to void*.
> > This causes a warning of int-to-pointer-cast on x86 32-bit
> > environment.
> >
> > This patch changes the type of "new_memmap_phy" variable
> > from "u64" into "phys_addr_t" to avoid it.
> 
> This assumes sizeof(void*) == sizeof(phys_addr_t), which is not always true, 
> e.g., on 32-bit ARM (whose UEFI support is
> in development but not yet merged) with LPAE enabled.
> 
> Could we use unsigned long instead?

  Okay. I'll update my patch.

Sincerely,
Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-22 Thread Taku Izumi
commit-0f96a99 introduces the following warning message:

  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer
  from integer of different size [-Wint-to-pointer-cast]

new_memmap_phy was defined as a u64 value and casted to void*.
This causes a warning of int-to-pointer-cast on x86 32-bit
environment.

This patch changes the type of "new_memmap_phy" variable
from "u64" into "phys_addr_t" to avoid it.

Reported-by: Ingo Molnar 
Signed-off-by: Taku Izumi 
---
 drivers/firmware/efi/fake_mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 32bcb14..b65bc07 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -59,7 +59,7 @@ void __init efi_fake_memmap(void)
u64 start, end, m_start, m_end, m_attr;
int new_nr_map = memmap.nr_map;
efi_memory_desc_t *md;
-   u64 new_memmap_phy;
+   phys_addr_t new_memmap_phy;
void *new_memmap;
void *old, *new;
int i;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-22 Thread Taku Izumi
commit-0f96a99 introduces the following warning message:

  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer
  from integer of different size [-Wint-to-pointer-cast]

new_memmap_phy was defined as a u64 value and casted to void*.
This causes a warning of int-to-pointer-cast on x86 32-bit
environment.

This patch changes the type of "new_memmap_phy" variable
from "u64" into "phys_addr_t" to avoid it.

Reported-by: Ingo Molnar <mi...@kernel.org>
Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 drivers/firmware/efi/fake_mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 32bcb14..b65bc07 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -59,7 +59,7 @@ void __init efi_fake_memmap(void)
u64 start, end, m_start, m_end, m_attr;
int new_nr_map = memmap.nr_map;
efi_memory_desc_t *md;
-   u64 new_memmap_phy;
+   phys_addr_t new_memmap_phy;
void *new_memmap;
void *old, *new;
int i;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-22 Thread Izumi, Taku
Dear Ard,

> > commit-0f96a99 introduces the following warning message:
> >
> >  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer
> >  from integer of different size [-Wint-to-pointer-cast]
> >
> > new_memmap_phy was defined as a u64 value and casted to void*.
> > This causes a warning of int-to-pointer-cast on x86 32-bit
> > environment.
> >
> > This patch changes the type of "new_memmap_phy" variable
> > from "u64" into "phys_addr_t" to avoid it.
> 
> This assumes sizeof(void*) == sizeof(phys_addr_t), which is not always true, 
> e.g., on 32-bit ARM (whose UEFI support is
> in development but not yet merged) with LPAE enabled.
> 
> Could we use unsigned long instead?

  Okay. I'll update my patch.

Sincerely,
Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] efi: Fix warning of int-to-pointer-cast on x86 32-bit builds

2015-10-22 Thread Taku Izumi
commit-0f96a99 introduces the following warning message:

  drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer
  from integer of different size [-Wint-to-pointer-cast]

new_memmap_phy was defined as a u64 value and casted to void*.
This causes a warning of int-to-pointer-cast on x86 32-bit
environment.

This patch changes the type of "new_memmap_phy" variable
from "u64" into "ulong" to avoid it.

v1 -> v2:
 - change the type of "new_memmap_phy" from phys_addr_t
   into ulong according to Ard's comment

Reported-by: Ingo Molnar <mi...@kernel.org>
Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 drivers/firmware/efi/fake_mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/fake_mem.c b/drivers/firmware/efi/fake_mem.c
index 32bcb14..1f483b4 100644
--- a/drivers/firmware/efi/fake_mem.c
+++ b/drivers/firmware/efi/fake_mem.c
@@ -59,7 +59,7 @@ void __init efi_fake_memmap(void)
u64 start, end, m_start, m_end, m_attr;
int new_nr_map = memmap.nr_map;
efi_memory_desc_t *md;
-   u64 new_memmap_phy;
+   ulong new_memmap_phy;
void *new_memmap;
void *old, *new;
int i;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-22 Thread Izumi, Taku
 Dear Tony,

> -Original Message-
> From: Luck, Tony [mailto:tony.l...@intel.com]
> Sent: Friday, October 23, 2015 8:27 AM
> To: Kamezawa, Hiroyuki/亀澤 寛之; Izumi, Taku/泉 拓; linux-kernel@vger.kernel.org; 
> linux...@kvack.org
> Cc: qiuxi...@huawei.com; m...@csn.ul.ie; a...@linux-foundation.org; Hansen, 
> Dave; m...@codeblueprint.co.uk
> Subject: RE: [PATCH] mm: Introduce kernelcore=reliable option
> 
> > I think /proc/zoneinfo can show detailed numbers per zone. Do we need some 
> > for meminfo ?
> 
> I wrote a little script (attached) to summarize /proc/zoneinfo ... on my 
> system it says
> 
> $ zoneinfo
> Node  Normal Movable DMA   DMA32
>00.00   103020.078.94 1554.46
>1 9284.5489870.43
>2 9626.3394050.09
>3 9602.8293650.04
> 
> Not sure why I have zero Normal memory free on node0.  The sum of all those
> free counts is 410667.72 MB ... which is close enough to the boot time message
> showing the amount of mirror/total memory:
> 
> [0.00] efi: Memory: 80979/420096M mirrored memory
> 
> but a fair amount of the 80G of mirrored memory seems to have been miscounted
> as Movable instead of Normal. Perhaps this is because I have two blocks of 
> mirrored
> memory on each node and the movable zone code doesn't expect that?

 You were saying that OS view of memory of node is something like the following 
?
  
Node X:  |MM--MM|  
   (legend) M: mirrored  -: not mirrrored

 If so, is this a real Box's configuration?
 Sorry, I haven't got a real Address Range Mirror capable boxes yet ...
 I thought mirroring range is concatenated at the first part of each node.

 Sincerely,
 Taku Izumi

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-19 Thread Izumi, Taku
 Hi Xishi,

> On 2015/10/15 21:32, Taku Izumi wrote:
> 
> > Xeon E7 v3 based systems supports Address Range Mirroring
> > and UEFI BIOS complied with UEFI spec 2.5 can notify which
> > ranges are reliable (mirrored) via EFI memory map.
> > Now Linux kernel utilize its information and allocates
> > boot time memory from reliable region.
> >
> > My requirement is:
> >   - allocate kernel memory from reliable region
> >   - allocate user memory from non-reliable region
> >
> > In order to meet my requirement, ZONE_MOVABLE is useful.
> > By arranging non-reliable range into ZONE_MOVABLE,
> > reliable memory is only used for kernel allocations.
> >
> > This patch extends existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> >
> > Earlier discussion is at:
> >  https://lkml.org/lkml/2015/10/9/24
> >
> 
> Hi Taku,
> 
> If user don't want to waste a lot of memory, and he only set
> a few memory to mirrored memory, then the kernelcore is very
> small, right? That means OS will have a very small normal zone
> and a very large movable zone.

 Right.

> Kernel allocation could only use the unmovable zone. As the
> normal zone is very small, the kernel allocation maybe OOM,
> right?

 Right.

> Do you mean that we will reuse the movable zone in short-term
> solution and create a new zone(mirrored zone) in future?

 If there is that kind of requirements, I don't oppose 
 creating a new zone.

 Sincerely,
 Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] mm: Introduce kernelcore=reliable option

2015-10-19 Thread Izumi, Taku
 Hi Xishi,

> On 2015/10/15 21:32, Taku Izumi wrote:
> 
> > Xeon E7 v3 based systems supports Address Range Mirroring
> > and UEFI BIOS complied with UEFI spec 2.5 can notify which
> > ranges are reliable (mirrored) via EFI memory map.
> > Now Linux kernel utilize its information and allocates
> > boot time memory from reliable region.
> >
> > My requirement is:
> >   - allocate kernel memory from reliable region
> >   - allocate user memory from non-reliable region
> >
> > In order to meet my requirement, ZONE_MOVABLE is useful.
> > By arranging non-reliable range into ZONE_MOVABLE,
> > reliable memory is only used for kernel allocations.
> >
> > This patch extends existing "kernelcore" option and
> > introduces kernelcore=reliable option. By specifying
> > "reliable" instead of specifying the amount of memory,
> > non-reliable region will be arranged into ZONE_MOVABLE.
> >
> > Earlier discussion is at:
> >  https://lkml.org/lkml/2015/10/9/24
> >
> 
> Hi Taku,
> 
> If user don't want to waste a lot of memory, and he only set
> a few memory to mirrored memory, then the kernelcore is very
> small, right? That means OS will have a very small normal zone
> and a very large movable zone.

 Right.

> Kernel allocation could only use the unmovable zone. As the
> normal zone is very small, the kernel allocation maybe OOM,
> right?

 Right.

> Do you mean that we will reuse the movable zone in short-term
> solution and create a new zone(mirrored zone) in future?

 If there is that kind of requirements, I don't oppose 
 creating a new zone.

 Sincerely,
 Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: Introduce kernelcore=reliable option

2015-10-14 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Earlier discussion is at:
 https://lkml.org/lkml/2015/10/9/24

For example, suppose 2-nodes system with the following
 memory range:
  node 0 [mem 0x1000-0x00109fff]
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable (*):
  [0x-0x0001]
  [0x0001-0x00018000]
  [0x0010a000-0x00112000]

If you specify kernelcore=reliable, Movable zones are
arranged like the following:
  Movable zone start for each node
Node 0: 0x00018000
Node 1: 0x00112000

(*) I specified the following instead of using UEFI BIOS
complied with UEFI spec 2.5,
efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1
efi_fake_mem is found at:
 git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 tags/efi-next

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |  9 -
 mm/page_alloc.c | 26 ++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index cd5312f..b2c8c13 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beda417..d0b3ac9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -221,6 +221,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
}
 
/*
+* If kernelcore=reliable is specified, ignore movablecore option
+*/
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;
+
+   nid = r->nid;
+
+   usable_startpfn = PFN_DOWN(r->base);
+   zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+   min(usable_startpfn, zone_movable_pfn[nid]) :
+   usable_startpfn;
+   }
+
+   goto out2;
+   }
+
+   /*
 * If movablecore=nn[KMG] was specified, calculate what size of
 * kernelcore that corresponds so that memory usable for
 * any allocation type is evenly spread. If both kernelcore
@@ -5873,6 +5893,12 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
  */
 stati

[PATCH] mm: Introduce kernelcore=reliable option

2015-10-14 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Earlier discussion is at:
 https://lkml.org/lkml/2015/10/9/24

For example, suppose 2-nodes system with the following
 memory range:
  node 0 [mem 0x1000-0x00109fff]
  node 1 [mem 0x0010a000-0x00209fff]

and the following ranges are marked as reliable (*):
  [0x-0x0001]
  [0x0001-0x00018000]
  [0x0010a000-0x00112000]

If you specify kernelcore=reliable, Movable zones are
arranged like the following:
  Movable zone start for each node
Node 0: 0x00018000
Node 1: 0x00112000

(*) I specified the following instead of using UEFI BIOS
complied with UEFI spec 2.5,
efi_fake_mem=4G@0:0x1,2G@0x10a000:0x1,2G@4G:0x1
efi_fake_mem is found at:
 git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi.git
 tags/efi-next

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  9 -
 mm/page_alloc.c | 26 ++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index cd5312f..b2c8c13 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1663,7 +1663,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1679,6 +1680,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index beda417..d0b3ac9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -221,6 +221,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5618,6 +5619,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
}
 
/*
+* If kernelcore=reliable is specified, ignore movablecore option
+*/
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;
+
+   nid = r->nid;
+
+   usable_startpfn = PFN_DOWN(r->base);
+   zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+   min(usable_startpfn, zone_movable_pfn[nid]) :
+   usable_startpfn;
+   }
+
+   goto out2;
+   }
+
+   /*
 * If movablecore=nn[KMG] was specified, calculate what size of
 * kernelcore that corresponds so that memory usable for
 * any allocation type is evenly spread. If both kernelcore
@@ -5873,6 +5893,12 @@ static int __

RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-13 Thread Izumi, Taku
> > I remember Kame has already suggested this idea. In my opinion,
> > I still think it's better to add a new migratetype or a new zone,
> > so both user and kernel could use mirrored memory.
> 
> A new zone would be more flexible ... and probably the right long
> term solution.  But this looks like a very clever was to try out the
> feature with a minimally invasive patch.

 Yes. I agree creating a new zone is the right solution for long term.
 I believe this approach using MOVABLE_ZONE is good and reasonable 
 for short-term solution.

Sincerely,
Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-13 Thread Izumi, Taku
> > I remember Kame has already suggested this idea. In my opinion,
> > I still think it's better to add a new migratetype or a new zone,
> > so both user and kernel could use mirrored memory.
> 
> A new zone would be more flexible ... and probably the right long
> term solution.  But this looks like a very clever was to try out the
> feature with a minimally invasive patch.

 Yes. I agree creating a new zone is the right solution for long term.
 I believe this approach using MOVABLE_ZONE is good and reasonable 
 for short-term solution.

Sincerely,
Taku Izumi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-08 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |  9 -
 mm/page_alloc.c | 26 ++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 50fc09b..6791cbb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: [,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b..91d7556 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,6 +242,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
}
 
/*
+* If kernelcore=reliable is specified, ignore movablecore option
+*/
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;
+
+   nid = r->nid;
+
+   usable_startpfn = PFN_DOWN(r->base);
+   zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+   min(usable_startpfn, zone_movable_pfn[nid]) :
+   usable_startpfn;
+   }
+
+   goto out2;
+   }
+
+   /*
 * If movablecore=nn[KMG] was specified, calculate what size of
 * kernelcore that corresponds so that memory usable for
 * any allocation type is evenly spread. If both kernelcore
@@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
  */
 static int __init cmdline_parse_kernelcore(char *p)
 {
+   /* parse kernelcore=reliable */
+   if (parse_option_str(p, "reliable")) {
+   reliable_kernelcore = true;
+   return 0;
+   }
+
return cmdline_parse_core(p, _kernelcore);
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH][RFC] mm: Introduce kernelcore=reliable option

2015-10-08 Thread Taku Izumi
Xeon E7 v3 based systems supports Address Range Mirroring
and UEFI BIOS complied with UEFI spec 2.5 can notify which
ranges are reliable (mirrored) via EFI memory map.
Now Linux kernel utilize its information and allocates
boot time memory from reliable region.

My requirement is:
  - allocate kernel memory from reliable region
  - allocate user memory from non-reliable region

In order to meet my requirement, ZONE_MOVABLE is useful.
By arranging non-reliable range into ZONE_MOVABLE,
reliable memory is only used for kernel allocations.

This patch extends existing "kernelcore" option and
introduces kernelcore=reliable option. By specifying
"reliable" instead of specifying the amount of memory,
non-reliable region will be arranged into ZONE_MOVABLE.

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  9 -
 mm/page_alloc.c | 26 ++
 2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 50fc09b..6791cbb 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1669,7 +1669,8 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
 
keepinitrd  [HW,ARM]
 
-   kernelcore=nn[KMG]  [KNL,X86,IA-64,PPC] This parameter
+   kernelcore= Format: nn[KMG] | "reliable"
+   [KNL,X86,IA-64,PPC] This parameter
specifies the amount of memory usable by the kernel
for non-movable allocations.  The requested amount is
spread evenly throughout all nodes in the system. The
@@ -1685,6 +1686,12 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
use the HighMem zone if it exists, and the Normal
zone if it does not.
 
+   Instead of specifying the amount of memory (nn[KMS]),
+   you can specify "reliable" option. In case "reliable"
+   option is specified, reliable memory is used for
+   non-movable allocations and remaining memory is used
+   for Movable pages.
+
kgdbdbgp=   [KGDB,HW] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b..91d7556 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -242,6 +242,7 @@ static unsigned long __meminitdata 
arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 static unsigned long __initdata required_kernelcore;
 static unsigned long __initdata required_movablecore;
 static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+static bool reliable_kernelcore __initdata;
 
 /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
 int movable_zone;
@@ -5652,6 +5653,25 @@ static void __init find_zone_movable_pfns_for_nodes(void)
}
 
/*
+* If kernelcore=reliable is specified, ignore movablecore option
+*/
+   if (reliable_kernelcore) {
+   for_each_memblock(memory, r) {
+   if (memblock_is_mirror(r))
+   continue;
+
+   nid = r->nid;
+
+   usable_startpfn = PFN_DOWN(r->base);
+   zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+   min(usable_startpfn, zone_movable_pfn[nid]) :
+   usable_startpfn;
+   }
+
+   goto out2;
+   }
+
+   /*
 * If movablecore=nn[KMG] was specified, calculate what size of
 * kernelcore that corresponds so that memory usable for
 * any allocation type is evenly spread. If both kernelcore
@@ -5907,6 +5927,12 @@ static int __init cmdline_parse_core(char *p, unsigned 
long *core)
  */
 static int __init cmdline_parse_kernelcore(char *p)
 {
+   /* parse kernelcore=reliable */
+   if (parse_option_str(p, "reliable")) {
+   reliable_kernelcore = true;
+   return 0;
+   }
+
return cmdline_parse_core(p, _kernelcore);
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[tip:perf/core] perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore

2015-10-06 Thread tip-bot for Taku Izumi
Commit-ID:  712df65ccb63da08a484bf57c40b250dfd4103a7
Gitweb: http://git.kernel.org/tip/712df65ccb63da08a484bf57c40b250dfd4103a7
Author: Taku Izumi 
AuthorDate: Thu, 24 Sep 2015 21:10:21 +0900
Committer:  Ingo Molnar 
CommitDate: Tue, 6 Oct 2015 17:31:51 +0200

perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore

In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0:

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case, relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes this problem by introducing the segment-aware pci2phy_map 
instead.

Signed-off-by: Taku Izumi 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: a...@kernel.org
Cc: h...@zytor.com
Link: 
http://lkml.kernel.org/r/1443096621-4119-1-git-send-email-izumi.t...@jp.fujitsu.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 61 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 16 --
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +---
 4 files changed, 106 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..61215a6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map, *alloc = NULL;
+   int i;
+
+   lockdep_assert_held(_map_lock);
+
+lookup:
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   goto end;
+   }
+
+   if (!alloc) {
+   raw_spin_unlock(_map_lock);
+   alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL);
+   raw_spin_lock(_map_lock);
+
+   if (!alloc)
+   return NULL;
+
+   goto lookup;
+   }
+
+   map = alloc;
+   alloc = NULL;
+   map->segment = segment;
+   for (i = 0; i < 256; i++)
+   map->pbus_to_physid[i] = -1;
+   list_add_tail(>list, _map_head);
+
+end:
+   kfree(alloc);
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcib

[tip:perf/core] perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore

2015-10-06 Thread tip-bot for Taku Izumi
Commit-ID:  712df65ccb63da08a484bf57c40b250dfd4103a7
Gitweb: http://git.kernel.org/tip/712df65ccb63da08a484bf57c40b250dfd4103a7
Author: Taku Izumi <izumi.t...@jp.fujitsu.com>
AuthorDate: Thu, 24 Sep 2015 21:10:21 +0900
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Tue, 6 Oct 2015 17:31:51 +0200

perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore

In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0:

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case, relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes this problem by introducing the segment-aware pci2phy_map 
instead.

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: a...@kernel.org
Cc: h...@zytor.com
Link: 
http://lkml.kernel.org/r/1443096621-4119-1-git-send-email-izumi.t...@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 61 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 16 --
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +---
 4 files changed, 106 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..61215a6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map, *alloc = NULL;
+   int i;
+
+   lockdep_assert_held(_map_lock);
+
+lookup:
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   goto end;
+   }
+
+   if (!alloc) {
+   raw_spin_unlock(_map_lock);
+   alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL);
+   raw_spin_lock(_map_lock);
+
+   if (!alloc)
+   return NULL;
+
+   goto lookup;
+   }
+
+   map = alloc;
+   alloc = NULL;
+   map->segment = segment;
+   for (i = 0; i < 256; i++)
+   map->pbus_to_physid[i] = -1;
+   list_add_tail(>list, _map_head);
+
+end:
+   kfree(alloc);
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev

[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option

2015-09-29 Thread Taku Izumi
This patch introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute
to specific memory range.
This is useful for debugging of Address Range Mirroring feature.

For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1"
is specified, the original (firmware provided) EFI memmap will be
updated so that the specified memory regions have
EFI_MEMORY_MORE_RELIABLE attribute (0x1):

 
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)

 
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |  15 +++
 arch/x86/kernel/setup.c |   4 +-
 drivers/firmware/efi/Kconfig|  22 
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 238 
 include/linux/efi.h |   6 +
 6 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 22a4b68..50fc09b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem=   nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
+   Add arbitrary attribute to specific memory range by
+   updating original EFI memory map.
+   Region of memory which aa attribute is added to is
+   from ss to ss+nn.
+   If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1
+   is specified, EFI_MEMORY_MORE_RELIABLE(0x1)
+   attribute is added to range 0x1-0x18000 and
+   0x10a000-0x112000.
+
+   Using this parameter you can do debugging of EFI memmap
+   related feature. For example, you can do debugging of
+   Address Range Mirroring feature even if your box
+   doesn't support it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a..30b4c44 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror();
+   }
 
/*
 * The EFI specification says that boot service code won't be called
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 84533e0..ac47cc4d 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP
 
  See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map.
 
+config EFI_FAKE_MEMMAP
+   bool "Enable EFI fake memory map"
+   depends on EFI && X86
+   default n
+   help
+ Saying Y here will enable "efi_fake_mem" boot option.
+ By specifying this parameter, you can add arbitrary attribute
+ to specific memory range by updating original (firmware provided)
+ EFI memmap.
+ This is useful for debugging of EFI memmap related feature.
+ e.g. Address Range Mirroring feature.
+
+config EFI_MAX_FAKE_MEM
+   int "maximum allowable number of ranges in efi_fake_mem boot option"
+   depends on EFI && X86 && EFI_FAKE_MEMMAP
+   range 1 128
+   default 8
+   help
+ Maximum allowable number of ranges in efi_fake_mem boot option.
+ Ranges can be set up to this value using comma-separated list.
+ The default value is 8.
+
 config EFI_PARAMS_FROM_FDT
bool
help
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index 6fd3da9..c24

RE: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option

2015-09-29 Thread Izumi, Taku
 I've missed git-format-patch after rebasing.
 I'll resend right one..

> -Original Message-
> From: kbuild test robot [mailto:l...@intel.com]
> Sent: Wednesday, September 30, 2015 10:37 AM
> To: Izumi, Taku/泉 拓
> Cc: kbuild-...@01.org; linux-kernel@vger.kernel.org; 
> linux-...@vger.kernel.org; x...@kernel.org; matt.flem...@intel.com;
> t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; 
> qinxi...@huawei.com; Kamezawa, Hiroyuki/亀
> 澤 寛之; ard.biesheu...@linaro.org; Izumi, Taku/泉 拓
> Subject: Re: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
> 
> Hi Taku,
> 
> [auto build test results on v4.3-rc3 -- if it's inappropriate base, please 
> ignore]
> 
> config: i386-allmodconfig (attached as .config)
> reproduce:
>   git checkout afcc94d3f91a00ce97d735a563a8e5d595f45a03
>   # save the attached .config to linux build tree
>   make ARCH=i386
> 
> All error/warnings (new ones prefixed by >>):
> 
> >> drivers/firmware/efi/fake_mem.c:36:25: error: 'CONFIG_EFI_MAX_FAKEMEM' 
> >> undeclared here (not in a function)
> #define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM
> ^
> >> drivers/firmware/efi/fake_mem.c:42:34: note: in expansion of macro 
> >> 'EFI_MAX_FAKEMEM'
> static struct fake_mem fake_mems[EFI_MAX_FAKEMEM];
>  ^
>drivers/firmware/efi/fake_mem.c: In function 'efi_fake_memmap':
> >> drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from 
> >> integer of different size [-Wint-to-pointer-cast]
>  memmap.phys_map = (void *)new_memmap_phy;
>^
>drivers/firmware/efi/fake_mem.c: At top level:
> >> drivers/firmware/efi/fake_mem.c:42:24: warning: 'fake_mems' defined but 
> >> not used [-Wunused-variable]
> static struct fake_mem fake_mems[EFI_MAX_FAKEMEM];
>^
> 
> vim +/CONFIG_EFI_MAX_FAKEMEM +36 drivers/firmware/efi/fake_mem.c
> 
> 30#include 
> 31#include 
> 32#include 
> 33#include 
> 34#include 
> 35
>   > 36#define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM
> 37
> 38struct fake_mem {
> 39struct range range;
> 40u64 attribute;
> 41};
>   > 42static struct fake_mem fake_mems[EFI_MAX_FAKEMEM];
> 43static int nr_fake_mem;
> 44
> 45static int __init cmp_fake_mem(const void *x1, const void *x2)
> 46{
> 47const struct fake_mem *m1 = x1;
> 48const struct fake_mem *m2 = x2;
> 49
> 50if (m1->range.start < m2->range.start)
> 51return -1;
> 52if (m1->range.start > m2->range.start)
> 53return 1;
> 54return 0;
> 55}
> 56
> 57void __init efi_fake_memmap(void)
> 58{
> 59u64 start, end, m_start, m_end, m_attr;
> 60int new_nr_map = memmap.nr_map;
> 61efi_memory_desc_t *md;
> 62u64 new_memmap_phy;
> 63void *new_memmap;
> 64void *old, *new;
> 65int i;
> 66
> 67if (!nr_fake_mem || !efi_enabled(EFI_MEMMAP))
> 68return;
> 69
> 70/* count up the number of EFI memory descriptor */
> 71for (old = memmap.map; old < memmap.map_end; old += 
> memmap.desc_size) {
> 72md = old;
> 73start = md->phys_addr;
> 74end = start + (md->num_pages << EFI_PAGE_SHIFT) 
> - 1;
> 75
> 76for (i = 0; i < nr_fake_mem; i++) {
> 77/* modifying range */
> 78m_start = fake_mems[i].range.start;
> 79m_end = fake_mems[i].range.end;
> 80
> 81if (m_start <= start) {
> 82/* split into 2 parts */
> 83if (start < m_end && m_end < 
> end)
> 84new_nr_map++;
> 85}
> 86if (start < m_start && m_start < end) {
> 87/

[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option

2015-09-29 Thread Taku Izumi
This patch introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute
to specific memory range.
This is useful for debugging of Address Range Mirroring feature.

For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1"
is specified, the original (firmware provided) EFI memmap will be
updated so that the specified memory regions have
EFI_MEMORY_MORE_RELIABLE attribute (0x1):

 
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)

 
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |  15 +++
 arch/x86/kernel/setup.c |   4 +-
 drivers/firmware/efi/Kconfig|  22 
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 238 
 include/linux/efi.h |   6 +
 6 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 22a4b68..50fc09b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem=   nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
+   Add arbitrary attribute to specific memory range by
+   updating original EFI memory map.
+   Region of memory which aa attribute is added to is
+   from ss to ss+nn.
+   If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1
+   is specified, EFI_MEMORY_MORE_RELIABLE(0x1)
+   attribute is added to range 0x1-0x18000 and
+   0x10a000-0x112000.
+
+   Using this parameter you can do debugging of EFI memmap
+   related feature. For example, you can do debugging of
+   Address Range Mirroring feature even if your box
+   doesn't support it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a..30b4c44 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror();
+   }
 
/*
 * The EFI specification says that boot service code won't be called
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 84533e0..ac47cc4d 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP
 
  See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map.
 
+config EFI_FAKE_MEMMAP
+   bool "Enable EFI fake memory map"
+   depends on EFI && X86
+   default n
+   help
+ Saying Y here will enable "efi_fake_mem" boot option.
+ By specifying this parameter, you can add arbitrary attribute
+ to specific memory range by updating original (firmware provided)
+ EFI memmap.
+ This is useful for debugging of EFI memmap related feature.
+ e.g. Address Range Mirroring feature.
+
+config EFI_MAX_FAKE_MEM
+   int "maximum allowable number of ranges in efi_fake_mem boot option"
+   depends on EFI && X86 && EFI_FAKE_MEMMAP
+   range 1 128
+   default 8
+   help
+ Maximum allowable number of ranges in efi_fake_mem boot option.
+ Ranges can be set up to this value using comma-separated list.
+ The default value is 8.
+
 config EFI_PARAMS_FROM_FDT
bool
help
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/Makefile
index 6fd3da9..c24

[PATCH 0/2] Introduce "efi_fake_mem" boot option

2015-09-29 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are
mirrored. Now linux kernel can recognize which memory ranges are mirrored
by handling EFI_MEMORY_MORE_RELIABLE attributes.
However testing this feature necesitates boxes with UEFI spec 2.5 complied
firmware.

This patchset introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute to
specific memory range. This is useful for debugging of Memory 
Address Range Mirroring feature.

This is updated version one of the former patch posted at
 http://www.mail-archive.com/linux-efi@vger.kernel.org/msg05936.html

changelog:
 - change boot option name and spec
   efi_fake_mem_mirror=nn@ss -> efi_fake_mem=nn@ss:aa
 - rename print_efi_memmap() to efi_print_memmap()
 - introduce new config named CONFIG_EFI_MAX_FAKE_MEM
 - and some fix pointed by Matt Flemming

Taku Izumi (2):
  x86, efi: rename print_efi_memmap() to efi_print_memmap()
  x86, efi: Add "efi_fake_mem" boot option

 Documentation/kernel-parameters.txt |  15 +++
 arch/x86/include/asm/efi.h  |   1 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   4 +-
 drivers/firmware/efi/Kconfig|  22 
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 238 
 include/linux/efi.h |   6 +
 8 files changed, 288 insertions(+), 3 deletions(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] x86, efi: rename print_efi_memmap() to efi_print_memmap()

2015-09-29 Thread Taku Izumi
This patch renames print_efi_memmap() to efi_print_memmap() and
make it global function so that we can invoke it outside of
arch/x86/platform/efi/efi.c

Signed-off-by: Taku Izumi 
---
 arch/x86/include/asm/efi.h  | 1 +
 arch/x86/platform/efi/efi.c | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index ab5f1d4..f8b93d6 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -103,6 +103,7 @@ extern void __init efi_set_executable(efi_memory_desc_t 
*md, bool executable);
 extern int __init efi_memblock_x86_reserve_range(void);
 extern pgd_t * __init efi_call_phys_prolog(void);
 extern void __init efi_call_phys_epilog(pgd_t *save_pgd);
+extern void __init efi_print_memmap(void);
 extern void __init efi_unmap_memmap(void);
 extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 1db84c0..1f95caf 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -222,7 +222,7 @@ int __init efi_memblock_x86_reserve_range(void)
return 0;
 }
 
-static void __init print_efi_memmap(void)
+void __init efi_print_memmap(void)
 {
 #ifdef EFI_DEBUG
efi_memory_desc_t *md;
@@ -524,7 +524,7 @@ void __init efi_init(void)
return;
 
if (efi_enabled(EFI_DBG))
-   print_efi_memmap();
+   efi_print_memmap();
 
efi_esrt_init();
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option

2015-09-29 Thread Taku Izumi
This patch introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute
to specific memory range.
This is useful for debugging of Address Range Mirroring feature.

For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1"
is specified, the original (firmware provided) EFI memmap will be
updated so that the specified memory regions have
EFI_MEMORY_MORE_RELIABLE attribute (0x1):

 
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)

 
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  15 +++
 arch/x86/kernel/setup.c |   4 +-
 drivers/firmware/efi/Kconfig|  22 
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 238 
 include/linux/efi.h |   6 +
 6 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 22a4b68..50fc09b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem=   nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
+   Add arbitrary attribute to specific memory range by
+   updating original EFI memory map.
+   Region of memory which aa attribute is added to is
+   from ss to ss+nn.
+   If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1
+   is specified, EFI_MEMORY_MORE_RELIABLE(0x1)
+   attribute is added to range 0x1-0x18000 and
+   0x10a000-0x112000.
+
+   Using this parameter you can do debugging of EFI memmap
+   related feature. For example, you can do debugging of
+   Address Range Mirroring feature even if your box
+   doesn't support it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a..30b4c44 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror();
+   }
 
/*
 * The EFI specification says that boot service code won't be called
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 84533e0..ac47cc4d 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP
 
  See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map.
 
+config EFI_FAKE_MEMMAP
+   bool "Enable EFI fake memory map"
+   depends on EFI && X86
+   default n
+   help
+ Saying Y here will enable "efi_fake_mem" boot option.
+ By specifying this parameter, you can add arbitrary attribute
+ to specific memory range by updating original (firmware provided)
+ EFI memmap.
+ This is useful for debugging of EFI memmap related feature.
+ e.g. Address Range Mirroring feature.
+
+config EFI_MAX_FAKE_MEM
+   int "maximum allowable number of ranges in efi_fake_mem boot option"
+   depends on EFI && X86 && EFI_FAKE_MEMMAP
+   range 1 128
+   default 8
+   help
+ Maximum allowable number of ranges in efi_fake_mem boot option.
+ Ranges can be set up to this value using comma-separated list.
+ The default value is 8.
+
 config EFI_PARAMS_FROM_FDT
bool
help
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/

[PATCH 0/2] Introduce "efi_fake_mem" boot option

2015-09-29 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are
mirrored. Now linux kernel can recognize which memory ranges are mirrored
by handling EFI_MEMORY_MORE_RELIABLE attributes.
However testing this feature necesitates boxes with UEFI spec 2.5 complied
firmware.

This patchset introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute to
specific memory range. This is useful for debugging of Memory 
Address Range Mirroring feature.

This is updated version one of the former patch posted at
 http://www.mail-archive.com/linux-efi@vger.kernel.org/msg05936.html

changelog:
 - change boot option name and spec
   efi_fake_mem_mirror=nn@ss -> efi_fake_mem=nn@ss:aa
 - rename print_efi_memmap() to efi_print_memmap()
 - introduce new config named CONFIG_EFI_MAX_FAKE_MEM
 - and some fix pointed by Matt Flemming

Taku Izumi (2):
  x86, efi: rename print_efi_memmap() to efi_print_memmap()
  x86, efi: Add "efi_fake_mem" boot option

 Documentation/kernel-parameters.txt |  15 +++
 arch/x86/include/asm/efi.h  |   1 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   4 +-
 drivers/firmware/efi/Kconfig|  22 
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 238 
 include/linux/efi.h |   6 +
 8 files changed, 288 insertions(+), 3 deletions(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option

2015-09-29 Thread Taku Izumi
This patch introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute
to specific memory range.
This is useful for debugging of Address Range Mirroring feature.

For example, if "efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1"
is specified, the original (firmware provided) EFI memmap will be
updated so that the specified memory regions have
EFI_MEMORY_MORE_RELIABLE attribute (0x1):

 
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)

 
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 Documentation/kernel-parameters.txt |  15 +++
 arch/x86/kernel/setup.c |   4 +-
 drivers/firmware/efi/Kconfig|  22 
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 238 
 include/linux/efi.h |   6 +
 6 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 22a4b68..50fc09b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1094,6 +1094,21 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem=   nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
+   Add arbitrary attribute to specific memory range by
+   updating original EFI memory map.
+   Region of memory which aa attribute is added to is
+   from ss to ss+nn.
+   If efi_fake_mem=2G@4G:0x1,2G@0x10a000:0x1
+   is specified, EFI_MEMORY_MORE_RELIABLE(0x1)
+   attribute is added to range 0x1-0x18000 and
+   0x10a000-0x112000.
+
+   Using this parameter you can do debugging of EFI memmap
+   related feature. For example, you can do debugging of
+   Address Range Mirroring feature even if your box
+   doesn't support it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb7f2a..30b4c44 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1079,8 +1079,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror();
+   }
 
/*
 * The EFI specification says that boot service code won't be called
diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig
index 84533e0..ac47cc4d 100644
--- a/drivers/firmware/efi/Kconfig
+++ b/drivers/firmware/efi/Kconfig
@@ -52,6 +52,28 @@ config EFI_RUNTIME_MAP
 
  See also Documentation/ABI/testing/sysfs-firmware-efi-runtime-map.
 
+config EFI_FAKE_MEMMAP
+   bool "Enable EFI fake memory map"
+   depends on EFI && X86
+   default n
+   help
+ Saying Y here will enable "efi_fake_mem" boot option.
+ By specifying this parameter, you can add arbitrary attribute
+ to specific memory range by updating original (firmware provided)
+ EFI memmap.
+ This is useful for debugging of EFI memmap related feature.
+ e.g. Address Range Mirroring feature.
+
+config EFI_MAX_FAKE_MEM
+   int "maximum allowable number of ranges in efi_fake_mem boot option"
+   depends on EFI && X86 && EFI_FAKE_MEMMAP
+   range 1 128
+   default 8
+   help
+ Maximum allowable number of ranges in efi_fake_mem boot option.
+ Ranges can be set up to this value using comma-separated list.
+ The default value is 8.
+
 config EFI_PARAMS_FROM_FDT
bool
help
diff --git a/drivers/firmware/efi/Makefile b/drivers/firmware/efi/

RE: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option

2015-09-29 Thread Izumi, Taku
 I've missed git-format-patch after rebasing.
 I'll resend right one..

> -Original Message-
> From: kbuild test robot [mailto:l...@intel.com]
> Sent: Wednesday, September 30, 2015 10:37 AM
> To: Izumi, Taku/泉 拓
> Cc: kbuild-...@01.org; linux-kernel@vger.kernel.org; 
> linux-...@vger.kernel.org; x...@kernel.org; matt.flem...@intel.com;
> t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; 
> qinxi...@huawei.com; Kamezawa, Hiroyuki/亀
> 澤 寛之; ard.biesheu...@linaro.org; Izumi, Taku/泉 拓
> Subject: Re: [PATCH 2/2] x86, efi: Add "efi_fake_mem" boot option
> 
> Hi Taku,
> 
> [auto build test results on v4.3-rc3 -- if it's inappropriate base, please 
> ignore]
> 
> config: i386-allmodconfig (attached as .config)
> reproduce:
>   git checkout afcc94d3f91a00ce97d735a563a8e5d595f45a03
>   # save the attached .config to linux build tree
>   make ARCH=i386
> 
> All error/warnings (new ones prefixed by >>):
> 
> >> drivers/firmware/efi/fake_mem.c:36:25: error: 'CONFIG_EFI_MAX_FAKEMEM' 
> >> undeclared here (not in a function)
> #define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM
> ^
> >> drivers/firmware/efi/fake_mem.c:42:34: note: in expansion of macro 
> >> 'EFI_MAX_FAKEMEM'
> static struct fake_mem fake_mems[EFI_MAX_FAKEMEM];
>  ^
>drivers/firmware/efi/fake_mem.c: In function 'efi_fake_memmap':
> >> drivers/firmware/efi/fake_mem.c:186:20: warning: cast to pointer from 
> >> integer of different size [-Wint-to-pointer-cast]
>  memmap.phys_map = (void *)new_memmap_phy;
>^
>drivers/firmware/efi/fake_mem.c: At top level:
> >> drivers/firmware/efi/fake_mem.c:42:24: warning: 'fake_mems' defined but 
> >> not used [-Wunused-variable]
> static struct fake_mem fake_mems[EFI_MAX_FAKEMEM];
>^
> 
> vim +/CONFIG_EFI_MAX_FAKEMEM +36 drivers/firmware/efi/fake_mem.c
> 
> 30#include 
> 31#include 
> 32#include 
> 33#include 
> 34#include 
> 35
>   > 36#define EFI_MAX_FAKEMEM CONFIG_EFI_MAX_FAKEMEM
> 37
> 38struct fake_mem {
> 39struct range range;
> 40u64 attribute;
> 41};
>   > 42static struct fake_mem fake_mems[EFI_MAX_FAKEMEM];
> 43static int nr_fake_mem;
> 44
> 45static int __init cmp_fake_mem(const void *x1, const void *x2)
> 46{
> 47const struct fake_mem *m1 = x1;
> 48const struct fake_mem *m2 = x2;
> 49
> 50if (m1->range.start < m2->range.start)
> 51return -1;
> 52if (m1->range.start > m2->range.start)
> 53return 1;
> 54return 0;
> 55}
> 56
> 57void __init efi_fake_memmap(void)
> 58{
> 59u64 start, end, m_start, m_end, m_attr;
> 60int new_nr_map = memmap.nr_map;
> 61efi_memory_desc_t *md;
> 62u64 new_memmap_phy;
> 63void *new_memmap;
> 64void *old, *new;
> 65int i;
> 66
> 67if (!nr_fake_mem || !efi_enabled(EFI_MEMMAP))
> 68return;
> 69
> 70/* count up the number of EFI memory descriptor */
> 71for (old = memmap.map; old < memmap.map_end; old += 
> memmap.desc_size) {
> 72md = old;
> 73start = md->phys_addr;
> 74end = start + (md->num_pages << EFI_PAGE_SHIFT) 
> - 1;
> 75
> 76for (i = 0; i < nr_fake_mem; i++) {
> 77/* modifying range */
> 78m_start = fake_mems[i].range.start;
> 79m_end = fake_mems[i].range.end;
> 80
> 81if (m_start <= start) {
> 82/* split into 2 parts */
> 83if (start < m_end && m_end < 
> end)
> 84new_nr_map++;
> 85}
> 86if (start < m_start && m_start < end) {
> 87/

[PATCH 1/2] x86, efi: rename print_efi_memmap() to efi_print_memmap()

2015-09-29 Thread Taku Izumi
This patch renames print_efi_memmap() to efi_print_memmap() and
make it global function so that we can invoke it outside of
arch/x86/platform/efi/efi.c

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 arch/x86/include/asm/efi.h  | 1 +
 arch/x86/platform/efi/efi.c | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index ab5f1d4..f8b93d6 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -103,6 +103,7 @@ extern void __init efi_set_executable(efi_memory_desc_t 
*md, bool executable);
 extern int __init efi_memblock_x86_reserve_range(void);
 extern pgd_t * __init efi_call_phys_prolog(void);
 extern void __init efi_call_phys_epilog(pgd_t *save_pgd);
+extern void __init efi_print_memmap(void);
 extern void __init efi_unmap_memmap(void);
 extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index 1db84c0..1f95caf 100644
--- a/arch/x86/platform/efi/efi.c
+++ b/arch/x86/platform/efi/efi.c
@@ -222,7 +222,7 @@ int __init efi_memblock_x86_reserve_range(void)
return 0;
 }
 
-static void __init print_efi_memmap(void)
+void __init efi_print_memmap(void)
 {
 #ifdef EFI_DEBUG
efi_memory_desc_t *md;
@@ -524,7 +524,7 @@ void __init efi_init(void)
return;
 
if (efi_enabled(EFI_DBG))
-   print_efi_memmap();
+   efi_print_memmap();
 
efi_esrt_init();
 }
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v5][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-09-23 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case, relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v4 -> v5:
   - Add initializaton code of pci2phy_map when newly alloced at 
__find_pci2phy_map()

 v3 -> v4:
   - avoid GFP_ATOMIC allocation at __find_pci2phy_map()
   - Add missing pci_dev_put at snb_pci2phy_map_init()
   - Add missing raw_spin_unlock at snbep_pci2phy_map_init()

Signed-off-by: Taku Izumi 
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 61 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 16 --
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +---
 4 files changed, 106 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..61215a6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map, *alloc = NULL;
+   int i;
+
+   lockdep_assert_held(_map_lock);
+
+lookup:
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   goto end;
+   }
+
+   if (!alloc) {
+   raw_spin_unlock(_map_lock);
+   alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL);
+   raw_spin_lock(_map_lock);
+
+   if (!alloc)
+   return NULL;
+
+   goto lookup;
+   }
+
+   map = alloc;
+   alloc = NULL;
+   map->segment = segment;
+   for (i = 0; i < 256; i++)
+   map->pbus_to_physid[i] = -1;
+   list_add_tail(>list, _map_head);
+
+end:
+   kfree(alloc);
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 72c54c2..2f0a4a9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event

[PATCH v5][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-09-23 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case, relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v4 -> v5:
   - Add initializaton code of pci2phy_map when newly alloced at 
__find_pci2phy_map()

 v3 -> v4:
   - avoid GFP_ATOMIC allocation at __find_pci2phy_map()
   - Add missing pci_dev_put at snb_pci2phy_map_init()
   - Add missing raw_spin_unlock at snbep_pci2phy_map_init()

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 61 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 16 --
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +---
 4 files changed, 106 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..61215a6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,59 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map, *alloc = NULL;
+   int i;
+
+   lockdep_assert_held(_map_lock);
+
+lookup:
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   goto end;
+   }
+
+   if (!alloc) {
+   raw_spin_unlock(_map_lock);
+   alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL);
+   raw_spin_lock(_map_lock);
+
+   if (!alloc)
+   return NULL;
+
+   goto lookup;
+   }
+
+   map = alloc;
+   alloc = NULL;
+   map->segment = segment;
+   for (i = 0; i < 256; i++)
+   map->pbus_to_physid[i] = -1;
+   list_add_tail(>list, _map_head);
+
+end:
+   kfree(alloc);
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +863,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +910,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 72c54c2..2f0a4a9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+

[PATCH v4] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-09-16 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case, relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v3 -> v4:
   - avoid GFP_ATOMIC allocation at __find_pci2phy_map()
   - Add missing pci_dev_put at snb_pci2phy_map_init()
   - Add missing raw_spin_unlock at snbep_pci2phy_map_init()

Signed-off-by: Taku Izumi 
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 58 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 16 --
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +---
 4 files changed, 103 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..3fba445 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,56 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map, *alloc = NULL;
+
+   lockdep_assert_held(_map_lock);
+
+lookup:
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   goto end;
+   }
+
+   if (!alloc) {
+   raw_spin_unlock(_map_lock);
+   alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL);
+   raw_spin_lock(_map_lock);
+
+   if (!alloc)
+   return NULL;
+
+   goto lookup;
+   }
+
+   map = alloc;
+   alloc = NULL;
+   map->segment = segment;
+   list_add_tail(>list, _map_head);
+
+end:
+   kfree(alloc);
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +860,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +907,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 72c54c2..2f0a4a9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,15 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[2

[PATCH v4] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-09-16 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case, relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v3 -> v4:
   - avoid GFP_ATOMIC allocation at __find_pci2phy_map()
   - Add missing pci_dev_put at snb_pci2phy_map_init()
   - Add missing raw_spin_unlock at snbep_pci2phy_map_init()

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 58 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 16 --
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 32 +---
 4 files changed, 103 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..3fba445 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,56 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map, *alloc = NULL;
+
+   lockdep_assert_held(_map_lock);
+
+lookup:
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   goto end;
+   }
+
+   if (!alloc) {
+   raw_spin_unlock(_map_lock);
+   alloc = kmalloc(sizeof(struct pci2phy_map), GFP_KERNEL);
+   raw_spin_lock(_map_lock);
+
+   if (!alloc)
+   return NULL;
+
+   goto lookup;
+   }
+
+   map = alloc;
+   alloc = NULL;
+   map->segment = segment;
+   list_add_tail(>list, _map_head);
+
+end:
+   kfree(alloc);
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +860,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +907,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 72c54c2..2f0a4a9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,15 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segm

[PATCH v3] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-09-03 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v2 -> v3:
   - fix up according to Peter's comment
   - introduce __find_pci2phy_map() to avert repetition

Signed-off-by: Taku Izumi 
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 45 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 +-
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 13 ++-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 31 +++
 4 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..1ddac35 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,43 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map;
+
+   lockdep_assert_held(_map_lock);
+
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   return map;
+   }
+
+   map = kmalloc(sizeof(struct pci2phy_map), GFP_ATOMIC);
+   if (map) {
+   map->segment = segment;
+   list_add_tail(>list, _map_head);
+   }
+
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +847,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +894,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 72c54c2..2f0a4a9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,15 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
+int uncore_pcibus_to_physid(struct pci_bus *bus);
+struct pci2phy_map *__find_pci2phy_map(int segment);
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +326,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, 
int idx);
 exter

[PATCH v3] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-09-03 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v2 -> v3:
   - fix up according to Peter's comment
   - introduce __find_pci2phy_map() to avert repetition

Signed-off-by: Taku Izumi <izumi.t...@jp.fujitsu.com>
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 45 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 12 +-
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 13 ++-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 31 +++
 4 files changed, 87 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 560e525..1ddac35 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,43 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   struct pci2phy_map *map;
+   int phys_id = -1;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
+struct pci2phy_map *__find_pci2phy_map(int segment)
+{
+   struct pci2phy_map *map;
+
+   lockdep_assert_held(_map_lock);
+
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == segment)
+   return map;
+   }
+
+   map = kmalloc(sizeof(struct pci2phy_map), GFP_ATOMIC);
+   if (map) {
+   map->segment = segment;
+   list_add_tail(>list, _map_head);
+   }
+
+   return map;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +847,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +894,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 72c54c2..2f0a4a9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,15 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
+int uncore_pcibus_to_physid(struct pci_bus *bus);
+struct pci2phy_map *__find_pci2phy_map(int segment);
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +326,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box

[PATCH v2][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-08-26 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v1 -> v2:
   - Extract method named uncore_pcibus_to_physid to avoid repetetion of
 retrieving phys_id code

Signed-off-by: Taku Izumi 
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 25 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 11 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 23 +-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 --
 4 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 21b5e38..0ed6f2b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   int phys_id = -1;
+   struct pci2phy_map *map;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 0f77f0a..6c96ee9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,14 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
+int uncore_pcibus_to_physid(struct pci_bus *bus);
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, 
int idx);
 extern struct intel_uncore_type **uncore_msr_uncores;
 extern struct intel_uncore_type **uncore_pci_uncores;
 extern struct pci_driver *uncore_pci_driver;
-extern int uncore_pcibus_to_physid[256];
+extern raw_spinlock_t pci2phy_map_lock;
+extern struct list_head pci2phy_map_head;
 extern struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 extern struct event_constraint uncore_constraint_empty;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
index b005a78..ccbc817 1

RE: [PATCH 2/2] x86, efi: Add "efi_fake_mem_mirror" boot option

2015-08-26 Thread Izumi, Taku
Dear Matt,

Thank you for reviewing.

I updated my patchset.
I'm happy if you review new one.

Sincerely,
Taku Izumi

> -Original Message-
> From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
> Sent: Wednesday, August 26, 2015 8:46 AM
> To: Izumi, Taku/泉 拓
> Cc: linux-kernel@vger.kernel.org; linux-...@vger.kernel.org; x...@kernel.org; 
> matt.flem...@intel.com;
> t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; 
> qiuxi...@huawei.com; Kamezawa, Hiroyuki/亀
> 澤 寛之
> Subject: Re: [PATCH 2/2] x86, efi: Add "efi_fake_mem_mirror" boot option
> 
> On Fri, 21 Aug, at 02:16:00AM, Taku Izumi wrote:
> > This patch introduces new boot option named "efi_fake_mem_mirror".
> > By specifying this parameter, you can mark specific memory as
> > mirrored memory. This is useful for debugging of Address Range
> > Mirroring feature.
> >
> > For example, if you specify "efi_fake_mem_mirror=2G@4G,2G@0x10a000",
> > the original (firmware provided) EFI memmap will be updated so that
> > the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute:
> >
> >  
> >efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x-0x1000)
> (0MB)
> >efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x1000-0x2000)
> (0MB)
> >...
> >efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x47ee6000-0x48014000)
> (1MB)
> >efi: mem36: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x0001-0x0020a000)
> (129536MB)
> >efi: mem37: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
> > range=[0x6000-0x9000)
> (768MB)
> >
> >  
> >efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x-0x1000)
> (0MB)
> >efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x1000-0x2000)
> (0MB)
> >...
> >efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x47ee6000-0x48014000)
> (1MB)
> >efi: mem36: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
> > range=[0x0001-0x00018000)
> (2048MB)
> >efi: mem37: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x00018000-0x0010a000)
> (61952MB)
> >efi: mem38: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
> > range=[0x0010a000-0x00112000)
> (2048MB)
> >efi: mem39: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
> > range=[0x00112000-0x0020a000)
> (63488MB)
> >efi: mem40: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
> > range=[0x6000-0x9000)
> (768MB)
> >
> > And you will find that the following message is output:
> >
> >efi: Memory: 4096M/131455M mirrored memory
> >
> > Signed-off-by: Taku Izumi 
> > ---
> >  Documentation/kernel-parameters.txt |   8 ++
> >  arch/x86/include/asm/efi.h  |   2 +
> >  arch/x86/kernel/setup.c |   4 +-
> >  arch/x86/platform/efi/efi.c |   2 +-
> >  arch/x86/platform/efi/quirks.c  | 169 
> > 
> >  5 files changed, 183 insertions(+), 2 deletions(-)
> 
> [...]
> 
> > diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
> > index 1c7380d..5c785e1 100644
> > --- a/arch/x86/platform/efi/quirks.c
> > +++ b/arch/x86/platform/efi/quirks.c
> > @@ -18,6 +18,10 @@
> >
> 
> The quirks file isn't intended to be used for this kind of feature.
> It's very much a repository for workarounds for quirky firmware, i.e.
> known bugs.
> 
> Instead, how about putting all this into a new fake_mem.c file? Going
> further than that, there's nothing that I can see that looks
> particularly x86-specific, so how about sticking all this in
> drivers/firmware/efi/fake_mem.c so that the arm64 folks can make use
> of it if/when they want to start playing around with
> EFI_MEMORY_MORE_RELIABLE?
> 
> >  static efi_char16_t efi_dummy_name[6] = { 'D', 'U', 'M', 'M', 'Y', 0 };
> >
> > +#define EFI_MAX_FAKE_MIRROR 8
> > +static struct range fake_mirrors[EFI_MAX_FAKE_MIRROR];
> > +static int num_fake_mirror;
> > +
> >  static bool efi_no_storage_paranoia;
> >
> >  

[PATCH v2 1/3] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()

2015-08-26 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute
support to efi_md_typeattr_format().

Signed-off-by: Taku Izumi 
---
 drivers/firmware/efi/efi.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d6144e3..8124078 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t 
size,
attr = md->attribute;
if (attr & ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT |
 EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP |
-EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME))
+EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME |
+EFI_MEMORY_MORE_RELIABLE))
snprintf(pos, size, "|attr=0x%016llx]",
 (unsigned long long)attr);
else
-   snprintf(pos, size, "|%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
+   snprintf(pos, size, "|%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
 attr & EFI_MEMORY_RUNTIME ? "RUN" : "",
+attr & EFI_MEMORY_MORE_RELIABLE ? "MR" : "",
 attr & EFI_MEMORY_XP  ? "XP"  : "",
 attr & EFI_MEMORY_RP  ? "RP"  : "",
 attr & EFI_MEMORY_WP  ? "WP"  : "",
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/3] Introduce "efi_fake_mem_mirror" boot option

2015-08-26 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are
mirrored. Now linux kernel can recognize which memory ranges are mirrored
by handling EFI_MEMORY_MORE_RELIABLE attributes.
However testing this feature necesitates boxes with UEFI spec 2.5 complied
firmware.

This patchset introduces new boot option named "efi_fake_mem_mirror".
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Memory Address Range
Mirroring feature.

v1 -> v2:
 - change abbreviation of EFI_MEMORY_MORE_RELIABLE from "RELY" to "MR"
 - add patch (2/3) for changing abbreviation of EFI_MEMORY_RUNTIME
 - migrate some code from arch/x86/platform/efi/quirks to
   drivers/firmware/efi/fake_mem.c and create config EFI_FAKE_MEMMAP

Taku Izumi (3):
  efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
  efi: Change abbreviation of EFI_MEMORY_RUNTIME from "RUN" to "RT"
  x86, efi: Add "efi_fake_mem_mirror" boot option

 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   1 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 drivers/firmware/efi/Kconfig|  12 +++
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/efi.c  |   8 +-
 drivers/firmware/efi/fake_mem.c | 204 
 include/linux/efi.h |   6 ++
 9 files changed, 241 insertions(+), 5 deletions(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/3] efi: Change abbreviation of EFI_MEMORY_RUNTIME from "RUN" to "RT"

2015-08-26 Thread Taku Izumi
Now efi_md_typeattr_format() outputs "RUN" if passed EFI memory
descriptor has EFI_MEMORY_RUNTIME attribute. But "RT" is preferer
because it is shorter and clearer.

This patch changes abbreviation of EFI_MEMORY_RUNTIME from "RUN"
to "RT".

Suggested-by: Ard Biesheuvel 
Signed-off-by: Taku Izumi 
---
 drivers/firmware/efi/efi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 8124078..25b6477 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -594,8 +594,8 @@ char * __init efi_md_typeattr_format(char *buf, size_t size,
snprintf(pos, size, "|attr=0x%016llx]",
 (unsigned long long)attr);
else
-   snprintf(pos, size, "|%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
-attr & EFI_MEMORY_RUNTIME ? "RUN" : "",
+   snprintf(pos, size, "|%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
+attr & EFI_MEMORY_RUNTIME ? "RT" : "",
 attr & EFI_MEMORY_MORE_RELIABLE ? "MR" : "",
 attr & EFI_MEMORY_XP  ? "XP"  : "",
 attr & EFI_MEMORY_RP  ? "RP"  : "",
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 3/3] x86, efi: Add "efi_fake_mem_mirror" boot option

2015-08-26 Thread Taku Izumi
This patch introduces new boot option named "efi_fake_mem_mirror".
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Address Range
Mirroring feature.

For example, if you specify "efi_fake_mem_mirror=2G@4G,2G@0x10a000",
the original (firmware provided) EFI memmap will be updated so that
the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute:

 
   efi: mem00: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)
   efi: mem37: [Reserved   |RT|  |  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

 
   efi: mem00: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)
   efi: mem40: [Reserved   |RT|  |  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   1 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 drivers/firmware/efi/Kconfig|  12 +++
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 204 
 include/linux/efi.h |   6 ++
 8 files changed, 236 insertions(+), 2 deletions(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 1d6f045..0efded6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86]
+   Mark specific memory as mirrored memory and update
+   EFI memory map.
+   Region of memory to be marked is from ss to ss+nn.
+   Using this parameter you can do debugging of Address
+   Range Mirroring feature even if your box doesn't support
+   it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 155162e..479fd51 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, 
bool executable);
 extern int __init efi_memblock_x86_reserve_range(void);
 extern pgd_t * __init efi_call_phys_prolog(void);
 extern void __init efi_call_phys_epilog(pgd_t *save_pgd);
+extern void __init print_efi_memmap(void);
 extern void __init efi_unmap_memmap(void);
 extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 80f874b..e3ed628 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror();
+   }
 
/*
 * The EFI specification says that boot service code won't be called
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c
index e4308fe..eee8068 100644
--- a/a

RE: [PATCH 2/2] x86, efi: Add efi_fake_mem_mirror boot option

2015-08-26 Thread Izumi, Taku
Dear Matt,

Thank you for reviewing.

I updated my patchset.
I'm happy if you review new one.

Sincerely,
Taku Izumi

 -Original Message-
 From: Matt Fleming [mailto:m...@codeblueprint.co.uk]
 Sent: Wednesday, August 26, 2015 8:46 AM
 To: Izumi, Taku/泉 拓
 Cc: linux-kernel@vger.kernel.org; linux-...@vger.kernel.org; x...@kernel.org; 
 matt.flem...@intel.com;
 t...@linutronix.de; mi...@redhat.com; h...@zytor.com; tony.l...@intel.com; 
 qiuxi...@huawei.com; Kamezawa, Hiroyuki/亀
 澤 寛之
 Subject: Re: [PATCH 2/2] x86, efi: Add efi_fake_mem_mirror boot option
 
 On Fri, 21 Aug, at 02:16:00AM, Taku Izumi wrote:
  This patch introduces new boot option named efi_fake_mem_mirror.
  By specifying this parameter, you can mark specific memory as
  mirrored memory. This is useful for debugging of Address Range
  Mirroring feature.
 
  For example, if you specify efi_fake_mem_mirror=2G@4G,2G@0x10a000,
  the original (firmware provided) EFI memmap will be updated so that
  the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute:
 
   original EFI memmap
 efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x-0x1000)
 (0MB)
 efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x1000-0x2000)
 (0MB)
 ...
 efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x47ee6000-0x48014000)
 (1MB)
 efi: mem36: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x0001-0x0020a000)
 (129536MB)
 efi: mem37: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
  range=[0x6000-0x9000)
 (768MB)
 
   updated EFI memmap
 efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x-0x1000)
 (0MB)
 efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x1000-0x2000)
 (0MB)
 ...
 efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x47ee6000-0x48014000)
 (1MB)
 efi: mem36: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
  range=[0x0001-0x00018000)
 (2048MB)
 efi: mem37: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x00018000-0x0010a000)
 (61952MB)
 efi: mem38: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
  range=[0x0010a000-0x00112000)
 (2048MB)
 efi: mem39: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
  range=[0x00112000-0x0020a000)
 (63488MB)
 efi: mem40: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
  range=[0x6000-0x9000)
 (768MB)
 
  And you will find that the following message is output:
 
 efi: Memory: 4096M/131455M mirrored memory
 
  Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
  ---
   Documentation/kernel-parameters.txt |   8 ++
   arch/x86/include/asm/efi.h  |   2 +
   arch/x86/kernel/setup.c |   4 +-
   arch/x86/platform/efi/efi.c |   2 +-
   arch/x86/platform/efi/quirks.c  | 169 
  
   5 files changed, 183 insertions(+), 2 deletions(-)
 
 [...]
 
  diff --git a/arch/x86/platform/efi/quirks.c b/arch/x86/platform/efi/quirks.c
  index 1c7380d..5c785e1 100644
  --- a/arch/x86/platform/efi/quirks.c
  +++ b/arch/x86/platform/efi/quirks.c
  @@ -18,6 +18,10 @@
 
 
 The quirks file isn't intended to be used for this kind of feature.
 It's very much a repository for workarounds for quirky firmware, i.e.
 known bugs.
 
 Instead, how about putting all this into a new fake_mem.c file? Going
 further than that, there's nothing that I can see that looks
 particularly x86-specific, so how about sticking all this in
 drivers/firmware/efi/fake_mem.c so that the arm64 folks can make use
 of it if/when they want to start playing around with
 EFI_MEMORY_MORE_RELIABLE?
 
   static efi_char16_t efi_dummy_name[6] = { 'D', 'U', 'M', 'M', 'Y', 0 };
 
  +#define EFI_MAX_FAKE_MIRROR 8
  +static struct range fake_mirrors[EFI_MAX_FAKE_MIRROR];
  +static int num_fake_mirror;
  +
   static bool efi_no_storage_paranoia;
 
   /*
  @@ -288,3 +292,168 @@ bool efi_poweroff_required(void)
   {
  return !!acpi_gbl_reduced_hardware;
   }
  +
  +void __init efi_fake_memmap(void)
  +{
  +   efi_memory_desc_t *md;
  +   void *p, *q;
  +   int i;
  +   int nr_map = memmap.nr_map;
  +   u64 start, end, m_start, m_end;
  +   u64 new_memmap_phy;
  +   void *new_memmap;
  +
  +   if (!num_fake_mirror)
  +   return;
  +
  +   /* count up the number of EFI memory descriptor */
  +   for (p = memmap.map; p  memmap.map_end; p += memmap.desc_size) {
  +   md = p;
  +   start = md-phys_addr;
  +   end = start + (md-num_pages  EFI_PAGE_SHIFT) - 1;
  +
  +   for (i = 0; i  num_fake_mirror; i

[PATCH v2 3/3] x86, efi: Add efi_fake_mem_mirror boot option

2015-08-26 Thread Taku Izumi
This patch introduces new boot option named efi_fake_mem_mirror.
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Address Range
Mirroring feature.

For example, if you specify efi_fake_mem_mirror=2G@4G,2G@0x10a000,
the original (firmware provided) EFI memmap will be updated so that
the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute:

 original EFI memmap
   efi: mem00: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)
   efi: mem37: [Reserved   |RT|  |  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

 updated EFI memmap
   efi: mem00: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)
   efi: mem40: [Reserved   |RT|  |  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   1 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 drivers/firmware/efi/Kconfig|  12 +++
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/fake_mem.c | 204 
 include/linux/efi.h |   6 ++
 8 files changed, 236 insertions(+), 2 deletions(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 1d6f045..0efded6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86]
+   Mark specific memory as mirrored memory and update
+   EFI memory map.
+   Region of memory to be marked is from ss to ss+nn.
+   Using this parameter you can do debugging of Address
+   Range Mirroring feature even if your box doesn't support
+   it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 155162e..479fd51 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, 
bool executable);
 extern int __init efi_memblock_x86_reserve_range(void);
 extern pgd_t * __init efi_call_phys_prolog(void);
 extern void __init efi_call_phys_epilog(pgd_t *save_pgd);
+extern void __init print_efi_memmap(void);
 extern void __init efi_unmap_memmap(void);
 extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 80f874b..e3ed628 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror();
+   }
 
/*
 * The EFI specification says that boot service code won't be called
diff --git a/arch/x86/platform/efi/efi.c b/arch/x86/platform/efi/efi.c

[PATCH v2 2/3] efi: Change abbreviation of EFI_MEMORY_RUNTIME from RUN to RT

2015-08-26 Thread Taku Izumi
Now efi_md_typeattr_format() outputs RUN if passed EFI memory
descriptor has EFI_MEMORY_RUNTIME attribute. But RT is preferer
because it is shorter and clearer.

This patch changes abbreviation of EFI_MEMORY_RUNTIME from RUN
to RT.

Suggested-by: Ard Biesheuvel ard.biesheu...@linaro.org
Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 drivers/firmware/efi/efi.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 8124078..25b6477 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -594,8 +594,8 @@ char * __init efi_md_typeattr_format(char *buf, size_t size,
snprintf(pos, size, |attr=0x%016llx],
 (unsigned long long)attr);
else
-   snprintf(pos, size, |%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s],
-attr  EFI_MEMORY_RUNTIME ? RUN : ,
+   snprintf(pos, size, |%2s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s],
+attr  EFI_MEMORY_RUNTIME ? RT : ,
 attr  EFI_MEMORY_MORE_RELIABLE ? MR : ,
 attr  EFI_MEMORY_XP  ? XP  : ,
 attr  EFI_MEMORY_RP  ? RP  : ,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/3] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()

2015-08-26 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute
support to efi_md_typeattr_format().

Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 drivers/firmware/efi/efi.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d6144e3..8124078 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t 
size,
attr = md-attribute;
if (attr  ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT |
 EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP |
-EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME))
+EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME |
+EFI_MEMORY_MORE_RELIABLE))
snprintf(pos, size, |attr=0x%016llx],
 (unsigned long long)attr);
else
-   snprintf(pos, size, |%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s],
+   snprintf(pos, size, |%3s|%2s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s],
 attr  EFI_MEMORY_RUNTIME ? RUN : ,
+attr  EFI_MEMORY_MORE_RELIABLE ? MR : ,
 attr  EFI_MEMORY_XP  ? XP  : ,
 attr  EFI_MEMORY_RP  ? RP  : ,
 attr  EFI_MEMORY_WP  ? WP  : ,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/3] Introduce efi_fake_mem_mirror boot option

2015-08-26 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are
mirrored. Now linux kernel can recognize which memory ranges are mirrored
by handling EFI_MEMORY_MORE_RELIABLE attributes.
However testing this feature necesitates boxes with UEFI spec 2.5 complied
firmware.

This patchset introduces new boot option named efi_fake_mem_mirror.
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Memory Address Range
Mirroring feature.

v1 - v2:
 - change abbreviation of EFI_MEMORY_MORE_RELIABLE from RELY to MR
 - add patch (2/3) for changing abbreviation of EFI_MEMORY_RUNTIME
 - migrate some code from arch/x86/platform/efi/quirks to
   drivers/firmware/efi/fake_mem.c and create config EFI_FAKE_MEMMAP

Taku Izumi (3):
  efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
  efi: Change abbreviation of EFI_MEMORY_RUNTIME from RUN to RT
  x86, efi: Add efi_fake_mem_mirror boot option

 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   1 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 drivers/firmware/efi/Kconfig|  12 +++
 drivers/firmware/efi/Makefile   |   1 +
 drivers/firmware/efi/efi.c  |   8 +-
 drivers/firmware/efi/fake_mem.c | 204 
 include/linux/efi.h |   6 ++
 9 files changed, 241 insertions(+), 5 deletions(-)
 create mode 100644 drivers/firmware/efi/fake_mem.c

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2][RESEND] perf, x86: Fix multi-segment problem of perf_event_intel_uncore

2015-08-26 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because uncore_pcibus_to_physid doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
uncore_pcibus_to_physid array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v1 - v2:
   - Extract method named uncore_pcibus_to_physid to avoid repetetion of
 retrieving phys_id code

Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 25 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 11 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 23 +-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 --
 4 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 21b5e38..0ed6f2b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   int phys_id = -1;
+   struct pci2phy_map *map;
+
+   raw_spin_lock(pci2phy_map_lock);
+   list_for_each_entry(map, pci2phy_map_head, list) {
+   if (map-segment == pci_domain_nr(bus)) {
+   phys_id = map-pbus_to_physid[bus-number];
+   break;
+   }
+   }
+   raw_spin_unlock(pci2phy_map_lock);
+
+   return phys_id;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev-bus-number];
+   phys_id = uncore_pcibus_to_physid(pdev-bus);
if (phys_id  0)
return -ENODEV;
 
@@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev-bus-number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev-bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i  UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 0f77f0a..6c96ee9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,14 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
+int uncore_pcibus_to_physid(struct pci_bus *bus);
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, 
int idx);
 extern struct intel_uncore_type **uncore_msr_uncores;
 extern struct intel_uncore_type **uncore_pci_uncores;
 extern struct pci_driver *uncore_pci_driver;
-extern int uncore_pcibus_to_physid[256];
+extern raw_spinlock_t pci2phy_map_lock;
+extern struct list_head pci2phy_map_head;
 extern struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 extern struct event_constraint uncore_constraint_empty;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
index b005a78..ccbc817 100644
--- a/arch/x86/kernel/cpu

[PATCH 0/2][RFC] Introduce "efi_fake_mem_mirror" boot option

2015-08-20 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are
mirrored. Now linux kernel can recognize which memory ranges are mirrored
by handling EFI_MEMORY_MORE_RELIABLE attributes.
However testing this feature necesitates boxes with UEFI spec 2.5 complied
firmware. 

This patchset introduces new boot option named "efi_fake_mem_mirror".
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Memory Address Range
Mirroring feature.

Taku Izumi (2):
  efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
  x86, efi: Add "efi_fake_mem_mirror" boot option

 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   2 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 arch/x86/platform/efi/quirks.c  | 169 
 drivers/firmware/efi/efi.c  |   6 +-
 6 files changed, 187 insertions(+), 4 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()

2015-08-20 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute
support to efi_md_typeattr_format().

Signed-off-by: Taku Izumi 
---
 drivers/firmware/efi/efi.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d6144e3..aadc1c4 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t 
size,
attr = md->attribute;
if (attr & ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT |
 EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP |
-EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME))
+EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME |
+EFI_MEMORY_MORE_RELIABLE))
snprintf(pos, size, "|attr=0x%016llx]",
 (unsigned long long)attr);
else
-   snprintf(pos, size, "|%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
+   snprintf(pos, size, "|%3s|%4s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s]",
 attr & EFI_MEMORY_RUNTIME ? "RUN" : "",
+attr & EFI_MEMORY_MORE_RELIABLE ? "RELY" : "",
 attr & EFI_MEMORY_XP  ? "XP"  : "",
 attr & EFI_MEMORY_RP  ? "RP"  : "",
 attr & EFI_MEMORY_WP  ? "WP"  : "",
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86, efi: Add "efi_fake_mem_mirror" boot option

2015-08-20 Thread Taku Izumi
This patch introduces new boot option named "efi_fake_mem_mirror".
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Address Range
Mirroring feature.

For example, if you specify "efi_fake_mem_mirror=2G@4G,2G@0x10a000",
the original (firmware provided) EFI memmap will be updated so that
the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute:

 
   efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)
   efi: mem37: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

 
   efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)
   efi: mem40: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi 
---
 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   2 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 arch/x86/platform/efi/quirks.c  | 169 
 5 files changed, 183 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 1d6f045..0efded6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86]
+   Mark specific memory as mirrored memory and update
+   EFI memory map.
+   Region of memory to be marked is from ss to ss+nn.
+   Using this parameter you can do debugging of Address
+   Range Mirroring feature even if your box doesn't support
+   it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 155162e..50e53cc 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, 
bool executable);
 extern int __init efi_memblock_x86_reserve_range(void);
 extern pgd_t * __init efi_call_phys_prolog(void);
 extern void __init efi_call_phys_epilog(pgd_t *save_pgd);
+extern void __init print_efi_memmap(void);
 extern void __init efi_unmap_memmap(void);
 extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
@@ -107,6 +108,7 @@ extern void __init efi_dump_pagetable(void);
 extern void __init efi_apply_memmap_quirks(void);
 extern int __init efi_reuse_config(u64 tables, int nr_tables);
 extern void efi_delete_dummy_variable(void);
+extern void __init efi_fake_memmap(void);
 
 struct efi_setup_data {
u64 fw_vendor;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 80f874b..e3ed628 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror();
+   }
 
/*
 * The EF

[PATCH 1/2] efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()

2015-08-20 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE. This patch adds this new attribute
support to efi_md_typeattr_format().

Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 drivers/firmware/efi/efi.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index d6144e3..aadc1c4 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -589,12 +589,14 @@ char * __init efi_md_typeattr_format(char *buf, size_t 
size,
attr = md-attribute;
if (attr  ~(EFI_MEMORY_UC | EFI_MEMORY_WC | EFI_MEMORY_WT |
 EFI_MEMORY_WB | EFI_MEMORY_UCE | EFI_MEMORY_WP |
-EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME))
+EFI_MEMORY_RP | EFI_MEMORY_XP | EFI_MEMORY_RUNTIME |
+EFI_MEMORY_MORE_RELIABLE))
snprintf(pos, size, |attr=0x%016llx],
 (unsigned long long)attr);
else
-   snprintf(pos, size, |%3s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s],
+   snprintf(pos, size, |%3s|%4s|%2s|%2s|%2s|%3s|%2s|%2s|%2s|%2s],
 attr  EFI_MEMORY_RUNTIME ? RUN : ,
+attr  EFI_MEMORY_MORE_RELIABLE ? RELY : ,
 attr  EFI_MEMORY_XP  ? XP  : ,
 attr  EFI_MEMORY_RP  ? RP  : ,
 attr  EFI_MEMORY_WP  ? WP  : ,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86, efi: Add efi_fake_mem_mirror boot option

2015-08-20 Thread Taku Izumi
This patch introduces new boot option named efi_fake_mem_mirror.
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Address Range
Mirroring feature.

For example, if you specify efi_fake_mem_mirror=2G@4G,2G@0x10a000,
the original (firmware provided) EFI memmap will be updated so that
the specified memory regions have EFI_MEMORY_MORE_RELIABLE attribute:

 original EFI memmap
   efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x0020a000) (129536MB)
   efi: mem37: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

 updated EFI memmap
   efi: mem00: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x-0x1000) (0MB)
   efi: mem01: [Loader Data|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x1000-0x2000) (0MB)
   ...
   efi: mem35: [Boot Data  |   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x47ee6000-0x48014000) (1MB)
   efi: mem36: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
range=[0x0001-0x00018000) (2048MB)
   efi: mem37: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x00018000-0x0010a000) (61952MB)
   efi: mem38: [Conventional Memory|   |RELY|  |  |  |   |WB|WT|WC|UC] 
range=[0x0010a000-0x00112000) (2048MB)
   efi: mem39: [Conventional Memory|   ||  |  |  |   |WB|WT|WC|UC] 
range=[0x00112000-0x0020a000) (63488MB)
   efi: mem40: [Reserved   |RUN||  |  |  |   |  |  |  |UC] 
range=[0x6000-0x9000) (768MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   2 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 arch/x86/platform/efi/quirks.c  | 169 
 5 files changed, 183 insertions(+), 2 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index 1d6f045..0efded6 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1092,6 +1092,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
 
+   efi_fake_mem_mirror=nn[KMG]@ss[KMG][,nn[KMG]@ss[KMG],..] [EFI; X86]
+   Mark specific memory as mirrored memory and update
+   EFI memory map.
+   Region of memory to be marked is from ss to ss+nn.
+   Using this parameter you can do debugging of Address
+   Range Mirroring feature even if your box doesn't support
+   it.
+
eisa_irq_edge=  [PARISC,HW]
See header of drivers/parisc/eisa.c.
 
diff --git a/arch/x86/include/asm/efi.h b/arch/x86/include/asm/efi.h
index 155162e..50e53cc 100644
--- a/arch/x86/include/asm/efi.h
+++ b/arch/x86/include/asm/efi.h
@@ -93,6 +93,7 @@ extern void __init efi_set_executable(efi_memory_desc_t *md, 
bool executable);
 extern int __init efi_memblock_x86_reserve_range(void);
 extern pgd_t * __init efi_call_phys_prolog(void);
 extern void __init efi_call_phys_epilog(pgd_t *save_pgd);
+extern void __init print_efi_memmap(void);
 extern void __init efi_unmap_memmap(void);
 extern void __init efi_memory_uc(u64 addr, unsigned long size);
 extern void __init efi_map_region(efi_memory_desc_t *md);
@@ -107,6 +108,7 @@ extern void __init efi_dump_pagetable(void);
 extern void __init efi_apply_memmap_quirks(void);
 extern int __init efi_reuse_config(u64 tables, int nr_tables);
 extern void efi_delete_dummy_variable(void);
+extern void __init efi_fake_memmap(void);
 
 struct efi_setup_data {
u64 fw_vendor;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 80f874b..e3ed628 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1104,8 +1104,10 @@ void __init setup_arch(char **cmdline_p)
memblock_set_current_limit(ISA_END_ADDRESS);
memblock_x86_fill();
 
-   if (efi_enabled(EFI_BOOT))
+   if (efi_enabled(EFI_BOOT)) {
+   efi_fake_memmap();
efi_find_mirror

[PATCH 0/2][RFC] Introduce efi_fake_mem_mirror boot option

2015-08-20 Thread Taku Izumi
UEFI spec 2.5 introduces new Memory Attribute Definition named
EFI_MEMORY_MORE_RELIABLE which indicates which memory ranges are
mirrored. Now linux kernel can recognize which memory ranges are mirrored
by handling EFI_MEMORY_MORE_RELIABLE attributes.
However testing this feature necesitates boxes with UEFI spec 2.5 complied
firmware. 

This patchset introduces new boot option named efi_fake_mem_mirror.
By specifying this parameter, you can mark specific memory as
mirrored memory. This is useful for debugging of Memory Address Range
Mirroring feature.

Taku Izumi (2):
  efi: Add EFI_MEMORY_MORE_RELIABLE support to efi_md_typeattr_format()
  x86, efi: Add efi_fake_mem_mirror boot option

 Documentation/kernel-parameters.txt |   8 ++
 arch/x86/include/asm/efi.h  |   2 +
 arch/x86/kernel/setup.c |   4 +-
 arch/x86/platform/efi/efi.c |   2 +-
 arch/x86/platform/efi/quirks.c  | 169 
 drivers/firmware/efi/efi.c  |   6 +-
 6 files changed, 187 insertions(+), 4 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] perf: Fix multi-segment problem of perf_event_intel_uncore

2015-08-04 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v1->v2:
  - Extract method named uncore_pcibus_to_physid to avoid repetetion of
retrieving phys_id code

Signed-off-by: Taku Izumi 
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 25 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 11 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 23 +-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 --
 4 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 21b5e38..0ed6f2b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   int phys_id = -1;
+   struct pci2phy_map *map;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(bus)) {
+   phys_id = map->pbus_to_physid[bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
+
+   return phys_id;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev->bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 0f77f0a..6c96ee9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,14 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
+int uncore_pcibus_to_physid(struct pci_bus *bus);
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, 
int idx);
 extern struct intel_uncore_type **uncore_msr_uncores;
 extern struct intel_uncore_type **uncore_pci_uncores;
 extern struct pci_driver *uncore_pci_driver;
-extern int uncore_pcibus_to_physid[256];
+extern raw_spinlock_t pci2phy_map_lock;
+extern struct list_head pci2phy_map_head;
 extern struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 extern struct event_constraint uncore_constraint_empty;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
index b005a78..ccbc817 10064

[PATCH v2] perf: Fix multi-segment problem of perf_event_intel_uncore

2015-08-04 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because uncore_pcibus_to_physid doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
uncore_pcibus_to_physid array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

 v1-v2:
  - Extract method named uncore_pcibus_to_physid to avoid repetetion of
retrieving phys_id code

Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 25 --
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  | 11 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 23 +-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 53 --
 4 files changed, 94 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 21b5e38..0ed6f2b 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -20,6 +21,23 @@ static struct event_constraint uncore_constraint_fixed =
 struct event_constraint uncore_constraint_empty =
EVENT_CONSTRAINT(0, 0, 0);
 
+int uncore_pcibus_to_physid(struct pci_bus *bus)
+{
+   int phys_id = -1;
+   struct pci2phy_map *map;
+
+   raw_spin_lock(pci2phy_map_lock);
+   list_for_each_entry(map, pci2phy_map_head, list) {
+   if (map-segment == pci_domain_nr(bus)) {
+   phys_id = map-pbus_to_physid[bus-number];
+   break;
+   }
+   }
+   raw_spin_unlock(pci2phy_map_lock);
+
+   return phys_id;
+}
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf)
 {
@@ -809,7 +827,7 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
int phys_id;
bool first_box = false;
 
-   phys_id = uncore_pcibus_to_physid[pdev-bus-number];
+   phys_id = uncore_pcibus_to_physid(pdev-bus);
if (phys_id  0)
return -ENODEV;
 
@@ -856,9 +874,10 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev-bus-number];
+   int i, cpu, phys_id;
bool last_box = false;
 
+   phys_id = uncore_pcibus_to_physid(pdev-bus);
box = pci_get_drvdata(pdev);
if (!box) {
for (i = 0; i  UNCORE_EXTRA_PCI_DEV_MAX; i++) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 0f77f0a..6c96ee9 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,14 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
+int uncore_pcibus_to_physid(struct pci_bus *bus);
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +325,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, 
int idx);
 extern struct intel_uncore_type **uncore_msr_uncores;
 extern struct intel_uncore_type **uncore_pci_uncores;
 extern struct pci_driver *uncore_pci_driver;
-extern int uncore_pcibus_to_physid[256];
+extern raw_spinlock_t pci2phy_map_lock;
+extern struct list_head pci2phy_map_head;
 extern struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 extern struct event_constraint uncore_constraint_empty;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
index b005a78..ccbc817 100644
--- a/arch/x86/kernel/cpu

[PATCH] perf: Fix multi-segment problem of perf_event_intel_uncore

2015-06-30 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad & Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

Signed-off-by: Taku Izumi 
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 27 +++---
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  |  9 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 23 +++-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 41 ++
 4 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 21b5e38..78c8686 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -806,10 +807,18 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
struct intel_uncore_pmu *pmu;
struct intel_uncore_box *box;
struct intel_uncore_type *type;
-   int phys_id;
+   int phys_id = -1;
bool first_box = false;
+   struct pci2phy_map *map;
 
-   phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(pdev->bus)) {
+   phys_id = map->pbus_to_physid[pdev->bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
if (phys_id < 0)
return -ENODEV;
 
@@ -856,8 +865,18 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev->bus->number];
+   int i, cpu, phys_id = -1;
bool last_box = false;
+   struct pci2phy_map *map;
+
+   raw_spin_lock(_map_lock);
+   list_for_each_entry(map, _map_head, list) {
+   if (map->segment == pci_domain_nr(pdev->bus)) {
+   phys_id = map->pbus_to_physid[pdev->bus->number];
+   break;
+   }
+   }
+   raw_spin_unlock(_map_lock);
 
box = pci_get_drvdata(pdev);
if (!box) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 0f77f0a..0fb2a23 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,12 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +323,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, 
int idx);
 extern struct intel_uncore_type **uncore_msr_uncores;
 extern struct intel_uncore_type **uncore_pci_uncores;
 extern struct pci_driver *uncore_pci_driver;
-extern int uncore_pcibus_to_physid[256];
+extern raw_spinlock_t pci2phy_map_lock;
+extern struct list_head pci2phy_map_head;
 extern struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 extern struct event_constraint uncore_constraint_empty;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
index b005a78..ccbc817 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
@@ -402,14 +402,35 @@ static int snb_pci2phy_map_init(int devid)
 {
   

[PATCH] perf: Fix multi-segment problem of perf_event_intel_uncore

2015-06-30 Thread Taku Izumi
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0.

  
  :ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 
Scratchpad  Semaphore Registers (rev 03
  ...

In that case relation of bus number and physical id may be broken
because uncore_pcibus_to_physid doesn't take account of PCI segment.
For example, bus :ff and 0001:ff uses the same entry of
uncore_pcibus_to_physid array.

This patch fixes ths problem by introducing segment-aware pci2phy_map instead.

Signed-off-by: Taku Izumi izumi.t...@jp.fujitsu.com
---
 arch/x86/kernel/cpu/perf_event_intel_uncore.c  | 27 +++---
 arch/x86/kernel/cpu/perf_event_intel_uncore.h  |  9 -
 arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c  | 23 +++-
 .../x86/kernel/cpu/perf_event_intel_uncore_snbep.c | 41 ++
 4 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index 21b5e38..78c8686 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -7,7 +7,8 @@ struct intel_uncore_type **uncore_pci_uncores = empty_uncore;
 static bool pcidrv_registered;
 struct pci_driver *uncore_pci_driver;
 /* pci bus to socket mapping */
-int uncore_pcibus_to_physid[256] = { [0 ... 255] = -1, };
+DEFINE_RAW_SPINLOCK(pci2phy_map_lock);
+struct list_head pci2phy_map_head = LIST_HEAD_INIT(pci2phy_map_head);
 struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 
 static DEFINE_RAW_SPINLOCK(uncore_box_lock);
@@ -806,10 +807,18 @@ static int uncore_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id
struct intel_uncore_pmu *pmu;
struct intel_uncore_box *box;
struct intel_uncore_type *type;
-   int phys_id;
+   int phys_id = -1;
bool first_box = false;
+   struct pci2phy_map *map;
 
-   phys_id = uncore_pcibus_to_physid[pdev-bus-number];
+   raw_spin_lock(pci2phy_map_lock);
+   list_for_each_entry(map, pci2phy_map_head, list) {
+   if (map-segment == pci_domain_nr(pdev-bus)) {
+   phys_id = map-pbus_to_physid[pdev-bus-number];
+   break;
+   }
+   }
+   raw_spin_unlock(pci2phy_map_lock);
if (phys_id  0)
return -ENODEV;
 
@@ -856,8 +865,18 @@ static void uncore_pci_remove(struct pci_dev *pdev)
 {
struct intel_uncore_box *box = pci_get_drvdata(pdev);
struct intel_uncore_pmu *pmu;
-   int i, cpu, phys_id = uncore_pcibus_to_physid[pdev-bus-number];
+   int i, cpu, phys_id = -1;
bool last_box = false;
+   struct pci2phy_map *map;
+
+   raw_spin_lock(pci2phy_map_lock);
+   list_for_each_entry(map, pci2phy_map_head, list) {
+   if (map-segment == pci_domain_nr(pdev-bus)) {
+   phys_id = map-pbus_to_physid[pdev-bus-number];
+   break;
+   }
+   }
+   raw_spin_unlock(pci2phy_map_lock);
 
box = pci_get_drvdata(pdev);
if (!box) {
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
index 0f77f0a..0fb2a23 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
@@ -117,6 +117,12 @@ struct uncore_event_desc {
const char *config;
 };
 
+struct pci2phy_map {
+   struct list_head list;
+   int segment;
+   int pbus_to_physid[256];
+};
+
 ssize_t uncore_event_show(struct kobject *kobj,
  struct kobj_attribute *attr, char *buf);
 
@@ -317,7 +323,8 @@ u64 uncore_shared_reg_config(struct intel_uncore_box *box, 
int idx);
 extern struct intel_uncore_type **uncore_msr_uncores;
 extern struct intel_uncore_type **uncore_pci_uncores;
 extern struct pci_driver *uncore_pci_driver;
-extern int uncore_pcibus_to_physid[256];
+extern raw_spinlock_t pci2phy_map_lock;
+extern struct list_head pci2phy_map_head;
 extern struct pci_dev 
*uncore_extra_pci_dev[UNCORE_SOCKET_MAX][UNCORE_EXTRA_PCI_DEV_MAX];
 extern struct event_constraint uncore_constraint_empty;
 
diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c 
b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
index b005a78..ccbc817 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore_snb.c
@@ -402,14 +402,35 @@ static int snb_pci2phy_map_init(int devid)
 {
struct pci_dev *dev = NULL;
int bus

RE: [RFC PATCH 0/2 shit_A shit_B] workqueue: fix wq_numa bug

2015-01-22 Thread Izumi, Taku

> This patches are un-changloged, un-compiled, un-booted, un-tested,
> they are just shits, I even hope them un-sent or blocked.
> 
> The patches include two -solutions-:
> 
> Shit_A:
>   workqueue: reset pool->node and unhash the pool when the node is
> offline
>   update wq_numa when cpu_present_mask changed
> 
>  kernel/workqueue.c | 107 
> +
>  1 file changed, 84 insertions(+), 23 deletions(-)
> 
> 
> Shit_B:
>   workqueue: reset pool->node and unhash the pool when the node is
> offline
>   workqueue: remove wq_numa_possible_cpumask
>   workqueue: directly update attrs of pools when cpu hot[un]plug
> 
>  kernel/workqueue.c | 135 
> +++--
>  1 file changed, 101 insertions(+), 34 deletions(-)
> 

  I tried your patchsets.
  linux-3.18.3 + Shit_A:

Build OK. 
I tried to reproduce the problem that Ishimatsu had reported, but it 
doesn't occur.
It seems that your patch fixes this problem.

  linux-3.18.3  + Shit_B: 

Build OK, but I encountered kernel panic at boot time.

[0.189000] BUG: unable to handle kernel NULL pointer dereference at 
0008
[0.189000] IP: [] __list_add+0x16/0xc0
[0.189000] PGD 0 
[0.189000] Oops:  [#1] SMP 
[0.189000] Modules linked in:
[0.189000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.3+ #3
[0.189000] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 
Series BIOS Version 01.81 12/03/2014
[0.189000] task: 880869678000 ti: 880869664000 task.ti: 
880869664000
[0.189000] RIP: 0010:[]  [] 
__list_add+0x16/0xc0
[0.189000] RSP: :880869667be8  EFLAGS: 00010296
[0.189000] RAX: 88087f83cda8 RBX: 88087f83cd80 RCX: 
[0.189000] RDX:  RSI: 88086912bb98 RDI: 88087f83cd80
[0.189000] RBP: 880869667c08 R08:  R09: 88087f807480
[0.189000] R10: 810911b6 R11: 810956ac R12: 
[0.189000] R13: 88086912bb98 R14: 0400 R15: 0400
[0.189000] FS:  () GS:88087fc0() 
knlGS:
[0.189000] CS:  0010 DS:  ES:  CR0: 80050033
[0.189000] CR2: 0008 CR3: 01998000 CR4: 001407f0
[0.189000] Stack:
[0.189000]  000a 88086912b800 88087f83cd00 
88087f80c000
[0.189000]  880869667c48 810912c8 880869667c28 
88087f803f00
[0.189000]  fff4 88086964b760 88086964b6a0 
88086964b740
[0.189000] Call Trace:
[0.189000]  [] alloc_unbound_pwq+0x298/0x3b0
[0.189000]  [] apply_workqueue_attrs+0x158/0x4c0
[0.189000]  [] __alloc_workqueue_key+0x174/0x5b0
[0.189000]  [] ? alloc_cpumask_var_node+0x56/0x80
[0.189000]  [] init_workqueues+0x33d/0x40f
[0.189000]  [] ? 
ftrace_define_fields_workqueue_execute_start+0x6a/0x6a
[0.189000]  [] do_one_initcall+0xd4/0x210
[0.189000]  [] ? native_smp_prepare_cpus+0x34d/0x352
[0.189000]  [] kernel_init_freeable+0xf5/0x23c
[0.189000]  [] ? rest_init+0x80/0x80
[0.189000]  [] kernel_init+0xe/0xf0
[0.189000]  [] ret_from_fork+0x7c/0xb0
[0.189000]  [] ? rest_init+0x80/0x80
[0.189000] Code: ff b8 f4 ff ff ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff 
ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb 48 83 ec 08 <4c> 8b 
42 08 49 39 f0 75 2e 4d 8b 45 00 4d 39 c4 75 6c 4c 39 e3 
[0.189000] RIP  [] __list_add+0x16/0xc0
[0.189000]  RSP 
[0.189000] CR2: 0008
[0.189000] ---[ end trace 58feee6875cf67cf ]---
[0.189000] Kernel panic - not syncing: Fatal exception
[0.189000] ---[ end Kernel panic - not syncing: Fatal exception

   
  Sincerely,
  Taku Izumi


> Both patch1 of the both solutions are: reset pool->node and unhash the pool,
> it is suggested by TJ, I found it is a good leading-step for fixing the bug.
> 
> The other patches are handling wq_numa_possible_cpumask where the solutions
> diverge.
> 
> Solution_A uses present_mask rather than possible_cpumask. It adds
> wq_numa_notify_cpu_present_set/cleared() for notifications of
> the changes of cpu_present_mask.  But the notifications are un-existed
> right now, so I fake one (wq_numa_check_present_cpumask_changes())
> to imitate them.  I hope the memory people add a real one.
> 
> Solution_B uses online_mask rather than possible_cpumask.
> this solution remove more coupling between numa_code and workqueue,
> it just depends on cpumask_of_node(node).
> 
> Patch2_of_Solution_B removes the wq_numa_possible_cpumask and add
> overhead when cpu hot[un]plug, Patch3 reduce this overhead.
> 
> Thanks,
> Lai
> 
> 
> Reported-by: Yasuaki Ishimatsu 
> Cc: Tejun Heo 

RE: [RFC PATCH 0/2 shit_A shit_B] workqueue: fix wq_numa bug

2015-01-22 Thread Izumi, Taku

 This patches are un-changloged, un-compiled, un-booted, un-tested,
 they are just shits, I even hope them un-sent or blocked.
 
 The patches include two -solutions-:
 
 Shit_A:
   workqueue: reset pool-node and unhash the pool when the node is
 offline
   update wq_numa when cpu_present_mask changed
 
  kernel/workqueue.c | 107 
 +
  1 file changed, 84 insertions(+), 23 deletions(-)
 
 
 Shit_B:
   workqueue: reset pool-node and unhash the pool when the node is
 offline
   workqueue: remove wq_numa_possible_cpumask
   workqueue: directly update attrs of pools when cpu hot[un]plug
 
  kernel/workqueue.c | 135 
 +++--
  1 file changed, 101 insertions(+), 34 deletions(-)
 

  I tried your patchsets.
  linux-3.18.3 + Shit_A:

Build OK. 
I tried to reproduce the problem that Ishimatsu had reported, but it 
doesn't occur.
It seems that your patch fixes this problem.

  linux-3.18.3  + Shit_B: 

Build OK, but I encountered kernel panic at boot time.

[0.189000] BUG: unable to handle kernel NULL pointer dereference at 
0008
[0.189000] IP: [8131ef96] __list_add+0x16/0xc0
[0.189000] PGD 0 
[0.189000] Oops:  [#1] SMP 
[0.189000] Modules linked in:
[0.189000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.18.3+ #3
[0.189000] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 
Series BIOS Version 01.81 12/03/2014
[0.189000] task: 880869678000 ti: 880869664000 task.ti: 
880869664000
[0.189000] RIP: 0010:[8131ef96]  [8131ef96] 
__list_add+0x16/0xc0
[0.189000] RSP: :880869667be8  EFLAGS: 00010296
[0.189000] RAX: 88087f83cda8 RBX: 88087f83cd80 RCX: 
[0.189000] RDX:  RSI: 88086912bb98 RDI: 88087f83cd80
[0.189000] RBP: 880869667c08 R08:  R09: 88087f807480
[0.189000] R10: 810911b6 R11: 810956ac R12: 
[0.189000] R13: 88086912bb98 R14: 0400 R15: 0400
[0.189000] FS:  () GS:88087fc0() 
knlGS:
[0.189000] CS:  0010 DS:  ES:  CR0: 80050033
[0.189000] CR2: 0008 CR3: 01998000 CR4: 001407f0
[0.189000] Stack:
[0.189000]  000a 88086912b800 88087f83cd00 
88087f80c000
[0.189000]  880869667c48 810912c8 880869667c28 
88087f803f00
[0.189000]  fff4 88086964b760 88086964b6a0 
88086964b740
[0.189000] Call Trace:
[0.189000]  [810912c8] alloc_unbound_pwq+0x298/0x3b0
[0.189000]  [81091ce8] apply_workqueue_attrs+0x158/0x4c0
[0.189000]  [81092424] __alloc_workqueue_key+0x174/0x5b0
[0.189000]  [813052a6] ? alloc_cpumask_var_node+0x56/0x80
[0.189000]  [81b21573] init_workqueues+0x33d/0x40f
[0.189000]  [81b21236] ? 
ftrace_define_fields_workqueue_execute_start+0x6a/0x6a
[0.189000]  [81002144] do_one_initcall+0xd4/0x210
[0.189000]  [81b12f4d] ? native_smp_prepare_cpus+0x34d/0x352
[0.189000]  [81b0026d] kernel_init_freeable+0xf5/0x23c
[0.189000]  [81653370] ? rest_init+0x80/0x80
[0.189000]  [8165337e] kernel_init+0xe/0xf0
[0.189000]  [8166bcfc] ret_from_fork+0x7c/0xb0
[0.189000]  [81653370] ? rest_init+0x80/0x80
[0.189000] Code: ff b8 f4 ff ff ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff 
ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 48 89 fb 48 83 ec 08 4c 8b 
42 08 49 39 f0 75 2e 4d 8b 45 00 4d 39 c4 75 6c 4c 39 e3 
[0.189000] RIP  [8131ef96] __list_add+0x16/0xc0
[0.189000]  RSP 880869667be8
[0.189000] CR2: 0008
[0.189000] ---[ end trace 58feee6875cf67cf ]---
[0.189000] Kernel panic - not syncing: Fatal exception
[0.189000] ---[ end Kernel panic - not syncing: Fatal exception

   
  Sincerely,
  Taku Izumi


 Both patch1 of the both solutions are: reset pool-node and unhash the pool,
 it is suggested by TJ, I found it is a good leading-step for fixing the bug.
 
 The other patches are handling wq_numa_possible_cpumask where the solutions
 diverge.
 
 Solution_A uses present_mask rather than possible_cpumask. It adds
 wq_numa_notify_cpu_present_set/cleared() for notifications of
 the changes of cpu_present_mask.  But the notifications are un-existed
 right now, so I fake one (wq_numa_check_present_cpumask_changes())
 to imitate them.  I hope the memory people add a real one.
 
 Solution_B uses online_mask rather than possible_cpumask.
 this solution remove more coupling between numa_code and workqueue,
 it just depends on cpumask_of_node(node).
 
 Patch2_of_Solution_B removes the wq_numa_possible_cpumask and add
 overhead when cpu hot[un]plug, Patch3 reduce this overhead

Re: [PATCH 2/3 v2] Do not use acpi_device to find pci root bridge in _init code.

2012-10-14 Thread Taku Izumi
On Fri, 12 Oct 2012 20:34:20 +0800
Tang Chen  wrote:

> When the kernel is being initialized, and some hardwares are not added
> to system, there won't be acpi_device structs for these devices. But
> acpi_is_root_bridge() depends on acpi_device struct. As a result, all
> the not-added root bridge will not be judged as a root bridge in
> find_root_bridges(). And further more, no handle_hotplug_event_root()
> notifier will be installed for them.
> 
> This patch introduces a new api to find all root bridges in system by
> getting HID directly from ACPI namespace, not depending on acpi_device
> struct.

  How about squashing patch #2 into patch #1 ?
  The caller and callee should be the same place in my mind.

  Best regards,
  Taku Izumi

> Signed-off-by: Tang Chen 
> Signed-off-by: Liu Jiang 
> ---
>  drivers/acpi/pci_root.c |   19 +++
>  1 files changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> index 6151d83..582eb11 100644
> --- a/drivers/acpi/pci_root.c
> +++ b/drivers/acpi/pci_root.c
> @@ -129,20 +129,23 @@ EXPORT_SYMBOL_GPL(acpi_get_pci_rootbridge_handle);
>   * acpi_is_root_bridge - determine whether an ACPI CA node is a PCI root 
> bridge
>   * @handle - the ACPI CA node in question.
>   *
> - * Note: we could make this API take a struct acpi_device * instead, but
> - * for now, it's more convenient to operate on an acpi_handle.
> + * Note: If a device is not added to the system yet, there won't be an
> + * acpi_device struct for it. So do not get HID and CID from acpi_device,
> + * get them from ACPI namespace directly.
>   */
>  int acpi_is_root_bridge(acpi_handle handle)
>  {
> - int ret;
> - struct acpi_device *device;
> + struct acpi_device_info *info;
> + acpi_status status;
>  
> - ret = acpi_bus_get_device(handle, );
> - if (ret)
> + status = acpi_get_object_info(handle, );
> + if (ACPI_FAILURE(status)) {
> + printk(KERN_ERR PREFIX "%s: Error reading"
> +"device info\n", __func__);
>   return 0;
> + }
>  
> - ret = acpi_match_device_ids(device, root_device_ids);
> - if (ret)
> + if (acpi_match_object_info_ids(info, root_device_ids))
>   return 0;
>   else
>   return 1;
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


-- 
Taku Izumi 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3 v2] Do not use acpi_device to find pci root bridge in _init code.

2012-10-14 Thread Taku Izumi
On Fri, 12 Oct 2012 20:34:20 +0800
Tang Chen tangc...@cn.fujitsu.com wrote:

 When the kernel is being initialized, and some hardwares are not added
 to system, there won't be acpi_device structs for these devices. But
 acpi_is_root_bridge() depends on acpi_device struct. As a result, all
 the not-added root bridge will not be judged as a root bridge in
 find_root_bridges(). And further more, no handle_hotplug_event_root()
 notifier will be installed for them.
 
 This patch introduces a new api to find all root bridges in system by
 getting HID directly from ACPI namespace, not depending on acpi_device
 struct.

  How about squashing patch #2 into patch #1 ?
  The caller and callee should be the same place in my mind.

  Best regards,
  Taku Izumi

 Signed-off-by: Tang Chen tangc...@cn.fujitsu.com
 Signed-off-by: Liu Jiang jiang@huawei.com
 ---
  drivers/acpi/pci_root.c |   19 +++
  1 files changed, 11 insertions(+), 8 deletions(-)
 
 diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
 index 6151d83..582eb11 100644
 --- a/drivers/acpi/pci_root.c
 +++ b/drivers/acpi/pci_root.c
 @@ -129,20 +129,23 @@ EXPORT_SYMBOL_GPL(acpi_get_pci_rootbridge_handle);
   * acpi_is_root_bridge - determine whether an ACPI CA node is a PCI root 
 bridge
   * @handle - the ACPI CA node in question.
   *
 - * Note: we could make this API take a struct acpi_device * instead, but
 - * for now, it's more convenient to operate on an acpi_handle.
 + * Note: If a device is not added to the system yet, there won't be an
 + * acpi_device struct for it. So do not get HID and CID from acpi_device,
 + * get them from ACPI namespace directly.
   */
  int acpi_is_root_bridge(acpi_handle handle)
  {
 - int ret;
 - struct acpi_device *device;
 + struct acpi_device_info *info;
 + acpi_status status;
  
 - ret = acpi_bus_get_device(handle, device);
 - if (ret)
 + status = acpi_get_object_info(handle, info);
 + if (ACPI_FAILURE(status)) {
 + printk(KERN_ERR PREFIX %s: Error reading
 +device info\n, __func__);
   return 0;
 + }
  
 - ret = acpi_match_device_ids(device, root_device_ids);
 - if (ret)
 + if (acpi_match_object_info_ids(info, root_device_ids))
   return 0;
   else
   return 1;
 -- 
 1.7.1
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
 


-- 
Taku Izumi izumi.t...@jp.fujitsu.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 2/3] ACPIHP: ACPI system device hotplug slot enumerator

2012-08-03 Thread Taku Izumi
andled by acpiphp or pciehp drivers.
> +  */
> + if (type == ACPIHP_DEV_TYPE_HOST_BRIDGE)
> + return AE_CTRL_DEPTH;
> +
> + return AE_OK;
> +}
> +
> +/*
> + * Get types of child devices connected to this slot.
> + * We only care about CPU, memory, PCI host bridge and CONTAINER here.
> + * Values used here must be in consistence with acpihp_enum_get_slot_type().
> + */
> +static acpi_status __init
> +acpihp_enum_get_dev_type(acpi_handle handle, u32 lvl, void *context, void 
> **rv)
> +{
> + acpi_status status = AE_OK;
> + enum acpihp_dev_type type;
> + u32 *tp = (u32 *)rv;
> +
> + if (!acpihp_dev_get_type(handle, )) {
> + switch (type) {
> + case ACPIHP_DEV_TYPE_CPU:
> + *tp |= 0x0001;
> + status = AE_CTRL_DEPTH;
> + break;
> + case ACPIHP_DEV_TYPE_MEM:
> + *tp |= 0x0002;
> + status = AE_CTRL_DEPTH;
> + break;
> + case ACPIHP_DEV_TYPE_HOST_BRIDGE:
> + *tp |= 0x0004;
> + status = AE_CTRL_DEPTH;
> + break;
> + case ACPIHP_DEV_TYPE_CONTAINER:
> + *tp |= 0x0008;
> + break;
> + default:
> + break;
> + }
> + }
> +
> + return status;
> +}
> +
> +/*
> + * Guess type of a hotplug slot according to child devices connecting to it.
> + */
> +static enum acpihp_slot_type __init acpihp_enum_get_slot_type(u32 dev_types)
> +{
> + BUG_ON(dev_types > 15);
> +
> + switch (dev_types) {
> + case 0:
> + /* Generic CONTAINER */
> + return ACPIHP_SLOT_TYPE_COMMON;
> + case 1:
> + /* Physical processor with logical CPUs */
> + return ACPIHP_SLOT_TYPE_CPU;
> + case 2:
> + /* Memory board/box with memory devices */
> + return ACPIHP_SLOT_TYPE_MEM;
> + case 3:
> + /* Physical processor with CPUs and memory controllers */
> + return ACPIHP_SLOT_TYPE_CPU;
> + case 4:
> + /* IO eXtension board/box with IO host bridges */
> + return ACPIHP_SLOT_TYPE_IOX;
> + case 7:
> + /* Physical processor with CPUs, IO host bridges and MCs. */
> + return ACPIHP_SLOT_TYPE_CPU;


   Why is this case ACPIHP_SLOT_TYPE_CPU? 
   I think this case is ACPIHP_SLOT_TYPE_COMMON or else.
   By the way how about simplifying slot type category?
   Do we need to differentiate case7, 8, 9, 11 and 15?
 
   Best regards,
   Taku Izumi


> + case 8:
> + /* Generic CONTAINER */
> + return ACPIHP_SLOT_TYPE_COMMON;
> + case 9:
> + /* System board with physical processors */
> + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD;
> + case 11:
> + /* System board with physical processors and memory */
> + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD;
> + case 15:
> + /* Node with processor, memory and IO host bridge */
> + return ACPIHP_SLOT_TYPE_NODE;
> + default:
> + return ACPIHP_SLOT_TYPE_UNKNOWN;
> + }
> +}
> +
> +/*
> + * Guess type of a hotplug slot according to the device type of the
> + * corresponding ACPI object itself.
> + */
> +static enum acpihp_slot_type __init
> +acpihp_enum_check_slot_self(struct acpihp_slot *slot)
> +{
> + enum acpihp_dev_type type;
> +
> + if (acpihp_dev_get_type(slot->handle, ))
> + return ACPIHP_SLOT_TYPE_UNKNOWN;
> +
> + switch (type) {
> + case ACPIHP_DEV_TYPE_CPU:
> + /* Logical CPU used in virtualization environment */
> + return ACPIHP_SLOT_TYPE_CPU;
> + case ACPIHP_DEV_TYPE_MEM:
> + /* Memory board with single memory device */
> + return ACPIHP_SLOT_TYPE_MEM;
> + case ACPIHP_DEV_TYPE_HOST_BRIDGE:
> + /* IO eXtension board/box with single IO host bridge */
> + return ACPIHP_SLOT_TYPE_IOX;
> + default:
> + return ACPIHP_SLOT_TYPE_UNKNOWN;
> + }
> +}
> +
> +static int __init acpihp_enum_generate_slot_name(struct acpihp_slot *slot)
> +{
> + int found = 0;
> + struct list_head *list;
> + struct acpihp_slot_id *slot_id;
> + unsigned long long uid;
> +
> + /* Respect firmware settings if _UID return an integer. */
> + if (ACPI_SUCCESS(acpi_evaluate_integer(slot->handle, METHOD_NAME__UID,
> +NULL, )))
> + 

Re: [RFC PATCH 2/3] ACPIHP: ACPI system device hotplug slot enumerator

2012-08-03 Thread Taku Izumi
;
 + break;
 + case ACPIHP_DEV_TYPE_HOST_BRIDGE:
 + *tp |= 0x0004;
 + status = AE_CTRL_DEPTH;
 + break;
 + case ACPIHP_DEV_TYPE_CONTAINER:
 + *tp |= 0x0008;
 + break;
 + default:
 + break;
 + }
 + }
 +
 + return status;
 +}
 +
 +/*
 + * Guess type of a hotplug slot according to child devices connecting to it.
 + */
 +static enum acpihp_slot_type __init acpihp_enum_get_slot_type(u32 dev_types)
 +{
 + BUG_ON(dev_types  15);
 +
 + switch (dev_types) {
 + case 0:
 + /* Generic CONTAINER */
 + return ACPIHP_SLOT_TYPE_COMMON;
 + case 1:
 + /* Physical processor with logical CPUs */
 + return ACPIHP_SLOT_TYPE_CPU;
 + case 2:
 + /* Memory board/box with memory devices */
 + return ACPIHP_SLOT_TYPE_MEM;
 + case 3:
 + /* Physical processor with CPUs and memory controllers */
 + return ACPIHP_SLOT_TYPE_CPU;
 + case 4:
 + /* IO eXtension board/box with IO host bridges */
 + return ACPIHP_SLOT_TYPE_IOX;
 + case 7:
 + /* Physical processor with CPUs, IO host bridges and MCs. */
 + return ACPIHP_SLOT_TYPE_CPU;


   Why is this case ACPIHP_SLOT_TYPE_CPU? 
   I think this case is ACPIHP_SLOT_TYPE_COMMON or else.
   By the way how about simplifying slot type category?
   Do we need to differentiate case7, 8, 9, 11 and 15?
 
   Best regards,
   Taku Izumi


 + case 8:
 + /* Generic CONTAINER */
 + return ACPIHP_SLOT_TYPE_COMMON;
 + case 9:
 + /* System board with physical processors */
 + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD;
 + case 11:
 + /* System board with physical processors and memory */
 + return ACPIHP_SLOT_TYPE_SYSTEM_BOARD;
 + case 15:
 + /* Node with processor, memory and IO host bridge */
 + return ACPIHP_SLOT_TYPE_NODE;
 + default:
 + return ACPIHP_SLOT_TYPE_UNKNOWN;
 + }
 +}
 +
 +/*
 + * Guess type of a hotplug slot according to the device type of the
 + * corresponding ACPI object itself.
 + */
 +static enum acpihp_slot_type __init
 +acpihp_enum_check_slot_self(struct acpihp_slot *slot)
 +{
 + enum acpihp_dev_type type;
 +
 + if (acpihp_dev_get_type(slot-handle, type))
 + return ACPIHP_SLOT_TYPE_UNKNOWN;
 +
 + switch (type) {
 + case ACPIHP_DEV_TYPE_CPU:
 + /* Logical CPU used in virtualization environment */
 + return ACPIHP_SLOT_TYPE_CPU;
 + case ACPIHP_DEV_TYPE_MEM:
 + /* Memory board with single memory device */
 + return ACPIHP_SLOT_TYPE_MEM;
 + case ACPIHP_DEV_TYPE_HOST_BRIDGE:
 + /* IO eXtension board/box with single IO host bridge */
 + return ACPIHP_SLOT_TYPE_IOX;
 + default:
 + return ACPIHP_SLOT_TYPE_UNKNOWN;
 + }
 +}
 +
 +static int __init acpihp_enum_generate_slot_name(struct acpihp_slot *slot)
 +{
 + int found = 0;
 + struct list_head *list;
 + struct acpihp_slot_id *slot_id;
 + unsigned long long uid;
 +
 + /* Respect firmware settings if _UID return an integer. */
 + if (ACPI_SUCCESS(acpi_evaluate_integer(slot-handle, METHOD_NAME__UID,
 +NULL, uid)))
 + goto set_name;
 +
 + if (slot-parent)
 + list = slot-parent-slot_id_list;
 + else
 + list = slot_id_list;
 +
 + list_for_each_entry(slot_id, list, node)
 + if (slot_id-type == slot-type) {
 + found = 1;
 + break;
 + }
 + if (!found) {
 + slot_id = kzalloc(sizeof(struct acpihp_slot_id), GFP_KERNEL);
 + if (!slot_id) {
 + ACPIHP_DEBUG(fails to allocate slot instance ID.\n);
 + return -ENOMEM;
 + }
 + slot_id-type = slot-type;
 + list_add_tail(slot_id-node, list);
 + }
 +
 + uid = slot_id-instance_id++;
 +
 +set_name:
 + snprintf(slot-name, sizeof(slot-name) - 1, %s%02llx,
 +  acpihp_get_slot_type_name(slot-type), uid);
 + dev_set_name(slot-dev, %s, slot-name);
 +
 + return 0;
 +}
 +
 +/*
 + * Generate a meaningful name for the slot according to devices connecting
 + * to this slot
 + */
 +static int __init acpihp_enum_rename_slot(struct acpihp_slot *slot)
 +{
 + u32 child_types = 0;
 +
 + slot-type = acpihp_enum_check_slot_self(slot);
 + if (slot-type == ACPIHP_SLOT_TYPE_UNKNOWN) {
 + acpi_walk_namespace(ACPI_TYPE_DEVICE, slot-handle,
 + ACPI_UINT32_MAX, acpihp_enum_get_dev_type,
 + NULL, NULL, (void **)child_types);
 + acpi_walk_namespace

Re: [RFC PATCH 01/14] PCI: add pcie_flags into struct pci_dev to cache PCIe capabilities register

2012-07-11 Thread Taku Izumi
On Tue, 10 Jul 2012 23:54:02 +0800
Jiang Liu  wrote:

> From: Yijing Wang 
> 
> From: Yijing Wang 
> 
> Since PCI Express Capabilities Register is read only, cache its value
> into struct pci_dev to avoid repeatedly calling pci_read_config_*().
> 
> Signed-off-by: Yijing Wang 
> Signed-off-by: Jiang Liu 
> ---
>  drivers/pci/probe.c |1 +
>  include/linux/pci.h |1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 6c143b4..65e82e3 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -929,6 +929,7 @@ void set_pcie_port_type(struct pci_dev *pdev)
>   pdev->is_pcie = 1;
>   pdev->pcie_cap = pos;
>   pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, );
> + pdev->pcie_flags = reg16;
>   pdev->pcie_type = (reg16 & PCI_EXP_FLAGS_TYPE) >> 4;
>   pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, );
>   pdev->pcie_mpss = reg16 & PCI_EXP_DEVCAP_PAYLOAD;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 5faa831..f4a7ad6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -258,6 +258,7 @@ struct pci_dev {
>   u8  pcie_mpss:3;/* PCI-E Max Payload Size Supported */
>   u8  rom_base_reg;   /* which config register controls the 
> ROM */
>   u8  pin;/* which interrupt pin this device uses 
> */
> + u16 pcie_flags; /* cached PCI-E Capabilities Register */

 "xxx_flags" sounds like a bit flag. This variable stores a value of PCIe 
capability 
 register, doesn't it?   How about "pcie_cap_reg" ?

>  
>   struct pci_driver *driver;  /* which driver has allocated this 
> device */
>   u64 dma_mask;   /* Mask of the bits of bus address this
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Taku Izumi 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 01/14] PCI: add pcie_flags into struct pci_dev to cache PCIe capabilities register

2012-07-11 Thread Taku Izumi
On Tue, 10 Jul 2012 23:54:02 +0800
Jiang Liu liu...@gmail.com wrote:

 From: Yijing Wang wangyij...@huawei.com
 
 From: Yijing Wang wangyij...@huawei.com
 
 Since PCI Express Capabilities Register is read only, cache its value
 into struct pci_dev to avoid repeatedly calling pci_read_config_*().
 
 Signed-off-by: Yijing Wang wangyij...@huawei.com
 Signed-off-by: Jiang Liu liu...@gmail.com
 ---
  drivers/pci/probe.c |1 +
  include/linux/pci.h |1 +
  2 files changed, 2 insertions(+)
 
 diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
 index 6c143b4..65e82e3 100644
 --- a/drivers/pci/probe.c
 +++ b/drivers/pci/probe.c
 @@ -929,6 +929,7 @@ void set_pcie_port_type(struct pci_dev *pdev)
   pdev-is_pcie = 1;
   pdev-pcie_cap = pos;
   pci_read_config_word(pdev, pos + PCI_EXP_FLAGS, reg16);
 + pdev-pcie_flags = reg16;
   pdev-pcie_type = (reg16  PCI_EXP_FLAGS_TYPE)  4;
   pci_read_config_word(pdev, pos + PCI_EXP_DEVCAP, reg16);
   pdev-pcie_mpss = reg16  PCI_EXP_DEVCAP_PAYLOAD;
 diff --git a/include/linux/pci.h b/include/linux/pci.h
 index 5faa831..f4a7ad6 100644
 --- a/include/linux/pci.h
 +++ b/include/linux/pci.h
 @@ -258,6 +258,7 @@ struct pci_dev {
   u8  pcie_mpss:3;/* PCI-E Max Payload Size Supported */
   u8  rom_base_reg;   /* which config register controls the 
 ROM */
   u8  pin;/* which interrupt pin this device uses 
 */
 + u16 pcie_flags; /* cached PCI-E Capabilities Register */

 xxx_flags sounds like a bit flag. This variable stores a value of PCIe 
capability 
 register, doesn't it?   How about pcie_cap_reg ?

  
   struct pci_driver *driver;  /* which driver has allocated this 
 device */
   u64 dma_mask;   /* Mask of the bits of bus address this
 -- 
 1.7.9.5
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-pci in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Taku Izumi izumi.t...@jp.fujitsu.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][BUG] Fix the graphic corruption issue on IA64 machines

2007-06-28 Thread izumi
HI,

As a result of the discussion with Pete Zaitcev([EMAIL PROTECTED]),
I re-create a patch. This attached patch is revised version.

He pointed out that the former patch may violate the assumptions and
was not safe. Concretely speaking, he concerned that an unexpected
problem may arise somewhere if "blank_state", which is intended to
reflect the state of timer, was shuffled arround.

This revised patch reflects his pointed out. I confirmed this also fixed
the problem.

Regards,
Taku Izumi <[EMAIL PROTECTED]>
Fix the graphic corruption issue on IA64 machines.
VGA console driver can misunderstand the current mode(Text/Graphic) under 
"disable console blanking" setting. When "disable console blank" is set 
(blankinterval=0), 
"do_unblank_screen()" function returns without changing "blank_state", and when 
"blank_state" is "blank_off", "do_blank_screen() function returns without 
invoking sw->con_blank() function. That's why VGA console driver can 
misunderstand 
the current mode.

Signed-off-by: Nobuhiro Tachino <[EMAIL PROTECTED]>
Signed-off-by: Pete Zaitcev <[EMAIL PROTECTED]>
Signed-off-by: Taku Izumi <[EMAIL PROTECTED]>
---
 drivers/char/vt.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)


Index: linux-2.6.22/drivers/char/vt.c
=
--- linux-2.6.22.org/drivers/char/vt.c  2007-06-27 11:40:03.0 -0400
+++ linux-2.6.22/drivers/char/vt.c  2007-06-27 11:24:32.0 -0400
@@ -3491,9 +3491,6 @@ void do_blank_screen(int entering_gfx)
}
return;
}
-   if (blank_state != blank_normal_wait)
-   return;
-   blank_state = blank_off;
 
/* entering graphics mode? */
if (entering_gfx) {
@@ -3501,10 +3498,15 @@ void do_blank_screen(int entering_gfx)
save_screen(vc);
vc->vc_sw->con_blank(vc, -1, 1);
console_blanked = fg_console + 1;
+   blank_state = blank_off;
set_origin(vc);
return;
}
 
+   if (blank_state != blank_normal_wait)
+   return;
+   blank_state = blank_off;
+
/* don't blank graphics */
if (vc->vc_mode != KD_TEXT) {
console_blanked = fg_console + 1;


Re: [PATCH][BUG] Fix the graphic corruption issue on IA64 machines

2007-06-28 Thread izumi
HI,

As a result of the discussion with Pete Zaitcev([EMAIL PROTECTED]),
I re-create a patch. This attached patch is revised version.

He pointed out that the former patch may violate the assumptions and
was not safe. Concretely speaking, he concerned that an unexpected
problem may arise somewhere if blank_state, which is intended to
reflect the state of timer, was shuffled arround.

This revised patch reflects his pointed out. I confirmed this also fixed
the problem.

Regards,
Taku Izumi [EMAIL PROTECTED]
Fix the graphic corruption issue on IA64 machines.
VGA console driver can misunderstand the current mode(Text/Graphic) under 
disable console blanking setting. When disable console blank is set 
(blankinterval=0), 
do_unblank_screen() function returns without changing blank_state, and when 
blank_state is blank_off, do_blank_screen() function returns without 
invoking sw-con_blank() function. That's why VGA console driver can 
misunderstand 
the current mode.

Signed-off-by: Nobuhiro Tachino [EMAIL PROTECTED]
Signed-off-by: Pete Zaitcev [EMAIL PROTECTED]
Signed-off-by: Taku Izumi [EMAIL PROTECTED]
---
 drivers/char/vt.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)


Index: linux-2.6.22/drivers/char/vt.c
=
--- linux-2.6.22.org/drivers/char/vt.c  2007-06-27 11:40:03.0 -0400
+++ linux-2.6.22/drivers/char/vt.c  2007-06-27 11:24:32.0 -0400
@@ -3491,9 +3491,6 @@ void do_blank_screen(int entering_gfx)
}
return;
}
-   if (blank_state != blank_normal_wait)
-   return;
-   blank_state = blank_off;
 
/* entering graphics mode? */
if (entering_gfx) {
@@ -3501,10 +3498,15 @@ void do_blank_screen(int entering_gfx)
save_screen(vc);
vc-vc_sw-con_blank(vc, -1, 1);
console_blanked = fg_console + 1;
+   blank_state = blank_off;
set_origin(vc);
return;
}
 
+   if (blank_state != blank_normal_wait)
+   return;
+   blank_state = blank_off;
+
/* don't blank graphics */
if (vc-vc_mode != KD_TEXT) {
console_blanked = fg_console + 1;


  1   2   >