RE: [PATCH v1 1/2] dt-bindings: mmc: sdhci-of-aspeed: Add power-gpio and power-switch-gpio

2021-04-12 Thread Milton Miller II



-"openbmc"  wrote: 
-

>To: Rob Herring 
>From: Steven Lee 
>Sent by: "openbmc" 
>Date: 04/12/2021 08:31PM
>Cc: "open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS"
>, Ulf Hansson ,
>Ryan Chen , "moderated list:ASPEED SD/MMC
>DRIVER" , Andrew Jeffery
>, "open list:ASPEED SD/MMC DRIVER"
>, "moderated list:ASPEED SD/MMC DRIVER"
>, Ryan Chen ,
>Adrian Hunter , open list
>, Chin-Ting Kuo
>, "moderated list:ARM/ASPEED MACHINE
>SUPPORT" 
>Subject: [EXTERNAL] Re: [PATCH v1 1/2] dt-bindings: mmc:
>sdhci-of-aspeed: Add power-gpio and power-switch-gpio
>
>The 04/10/2021 02:41, Rob Herring wrote:
>> On Thu, Apr 08, 2021 at 09:52:17AM +0800, Steven Lee wrote:
>> > AST2600-A2 EVB provides the reference design for enabling SD bus
>power
>> > and toggling SD bus signal voltage by GPIO pins.
>> > Add the definition and example for power-gpio and
>power-switch-gpio
>> > properties.
>> > 
>> > In the reference design, GPIOV0 of AST2600-A2 EVB is connected to
>power
>> > load switch that providing 3.3v to SD1 bus vdd. GPIOV1 is
>connected to
>> > a 1.8v and a 3.3v power load switch that providing signal voltage
>to
>> > SD1 bus.
>> > If GPIOV0 is active high, SD1 bus is enabled. Otherwise, SD1 bus
>is
>> > disabled.
>> > If GPIOV1 is active high, 3.3v power load switch is enabled, SD1
>signal
>> > voltage is 3.3v. Otherwise, 1.8v power load switch will be
>enabled, SD1
>> > signal voltage becomes 1.8v.
>> > 
>> > AST2600-A2 EVB also support toggling signal voltage for SD2 bus.
>> > The design is the same as SD1 bus. It uses GPIOV2 as power-gpio
>and GPIOV3
>> > as power-switch-gpio.
>> > 
>> > Signed-off-by: Steven Lee 
>> > ---
>> >  .../devicetree/bindings/mmc/aspeed,sdhci.yaml | 25
>+++
>> >  1 file changed, 25 insertions(+)
>> > 
>> > diff --git
>a/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml
>b/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml
>> > index 987b287f3bff..515a74614f3c 100644
>> > --- a/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml
>> > +++ b/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml
>> > @@ -37,6 +37,14 @@ properties:
>> >clocks:
>> >  maxItems: 1
>> >  description: The SD/SDIO controller clock gate
>> > +  power-gpio:
>> 
>> '-gpios' is the preferred form even if just 1.
>> 
>
>Thanks for reviewing, I will change the name.

is this a clock gate or a power on gpio?


>
>> > +description:
>> > +  The GPIO for enabling/disabling SD bus power.
>> > +maxItems: 1
>> 
>> blank line
>> 
>
>I will remove the blank line.
>
>> > +  power-switch-gpio:
>> > +description:
>> > +  The GPIO for toggling the signal voltage between 3.3v and
>1.8v.

Which way does it toggle for which voltage?

Oh, you said in the change log but not in the binding.

But please, use gpio controled regulators as Ulf suggested and is
already used by other mmc controllers upstream.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/
Documentation/devicetree/bindings/regulator/gpio-regulator.yaml

Ulf> Please do not model these as GPIO pins like this. Instead, it's better
Ulf> to model them as gpio regulators, since the mmc core manages them as
Ulf> regulators.
Ulf> 
Ulf> We have a vmmc regulator (corresponding to vdd) and a vqmmc regulator
Ulf> (corresponding the signal-voltage level). These are also described in
Ulf> the common mmc DT bindings, see
Ulf> Documentation/devicetree/bindings/mmc/mmc-controller.yaml
Ulf> .

milton

>> > +maxItems: 1
>> >  
>> >  patternProperties:
>> >"^sdhci@[0-9a-f]+$":
>> > @@ -61,6 +69,14 @@ patternProperties:
>> >sdhci,auto-cmd12:
>> >  type: boolean
>> >  description: Specifies that controller should use auto
>CMD12
>> > +  power-gpio:
>> > +description:
>> > +  The GPIO for enabling/disabling SD bus power.
>> > +maxItems: 1
>> > +  power-switch-gpio:
>> > +description:
>> > +  The GPIO for toggling the signal voltage between 3.3v
>and 1.8v.
>> > +maxItems: 1
>> >  required:
>> >- compatible
>> >- reg
>> > @@ -80,6 +96,7 @@ required:
>> >  examples:
>> >- |
>> >  #include 
>> > +#include 
>> >  sdc@1e74 {
>> >  compatible = "aspeed,ast2500-sd-controller";
>> >  reg = <0x1e74 0x100>;
>> > @@ -94,6 +111,10 @@ examples:
>> >  interrupts = <26>;
>> >  sdhci,auto-cmd12;
>> >  clocks = < ASPEED_CLK_SDIO>;
>> > +power-gpio = < ASPEED_GPIO(V, 0)
>> > + GPIO_ACTIVE_HIGH>;
>> > +power-switch-gpio = < ASPEED_GPIO(V,
>1)
>> > + GPIO_ACTIVE_HIGH>;
>> >  };
>> >  
>> >  sdhci1: sdhci@200 {
>> > @@ -102,5 +123,9 @@ examples:
>> >  interrupts = <26>;
>> >  sdhci,auto-cmd12;
>> >  clocks = < 

Re: [PATCH v4] hwmon:(adm1275) Enable adm1278 ADM1278_TEMP1_EN

2020-06-19 Thread Milton Miller II
On : 06/19/2020 abiout 12:46PM in some timezone,  Manikandan Elumalai  wrote:

>The adm1278 temp attribute need it for openbmc platform .
>This feature not enabled by default, so PMON_CONFIG needs to enable
>it.
>
>Signed-off-by: Manikandan Elumalai 
>---
>---v4 -Reported-by: kernel test robot 

Thie above tag should be Adjacent to the Signed-off-by.

>---v3 -fix invalid signed-off.
>---   -removed checkpath warnings.
>---   -write ADM1278_TEMP1_EN and ADM1278_VOUT_EN conf in single
>line operation.
>---v2 -add Signed-off-by.
>---   -removed ADM1278_TEMP1_EN check.
>---

The canonical format is to have a line of 3 dashes then the trailing changelog 

Please read the documentation at 

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#the-canonical-patch-format

milton



Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings: Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs

2018-03-09 Thread Milton Miller II
About  03/07/2018 04:12PM in some time zone, Pavel Machek wrote:
>Subject: Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings:
>Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs
>
>Hi!
>
>> >Are these SoCs x86-based?
>> 
>> Yes, these are ARM SoCs. Please see Andrew's answer as well.
>
>Understood, thanks.
>
>> >>+  Read sampling point selection. The whole period of a bit time
>will be
>> >>+  divided into 16 time frames. This value will determine which
>time frame
>> >>+  this controller will sample PECI signal for data read back.
>Usually in
>> >>+  the middle of a bit time is the best.
>> >
>> >English? "This value will determine when this controller"?
>> >
>> 
>> Could I change it like below?:
>> 
>> "This value will determine in which time frame this controller
>samples PECI
>> signal for data read back"
>
>I guess... I'm not native speaker, I guess this could be improved
>some
>more.
>

I agree this wording is still confusing. 

The problem is that the key subject, the time of the sampling, is in the 
descriptive clause "in which time frame".

"This value will determine the time frame in which the controller will sample"

or perhaps phrase it as saving a specific sample from the over-clock, or a 
phase of the clock.

>Best regards,
>   Pavel
>
>-- 
>(english) http://www.livejournal.com/~pavelmachek
>(cesky, pictures)
>http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

milton
--
Speaking for myself not IBM.



Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings: Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs

2018-03-09 Thread Milton Miller II
About  03/07/2018 04:12PM in some time zone, Pavel Machek wrote:
>Subject: Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings:
>Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs
>
>Hi!
>
>> >Are these SoCs x86-based?
>> 
>> Yes, these are ARM SoCs. Please see Andrew's answer as well.
>
>Understood, thanks.
>
>> >>+  Read sampling point selection. The whole period of a bit time
>will be
>> >>+  divided into 16 time frames. This value will determine which
>time frame
>> >>+  this controller will sample PECI signal for data read back.
>Usually in
>> >>+  the middle of a bit time is the best.
>> >
>> >English? "This value will determine when this controller"?
>> >
>> 
>> Could I change it like below?:
>> 
>> "This value will determine in which time frame this controller
>samples PECI
>> signal for data read back"
>
>I guess... I'm not native speaker, I guess this could be improved
>some
>more.
>

I agree this wording is still confusing. 

The problem is that the key subject, the time of the sampling, is in the 
descriptive clause "in which time frame".

"This value will determine the time frame in which the controller will sample"

or perhaps phrase it as saving a specific sample from the over-clock, or a 
phase of the clock.

>Best regards,
>   Pavel
>
>-- 
>(english) http://www.livejournal.com/~pavelmachek
>(cesky, pictures)
>http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>

milton
--
Speaking for myself not IBM.



Re: [PATCH v2 01/31] af9005: don't do DMA on stack

2016-10-12 Thread Milton Miller II
On Tue, 11 Oct 2016 07:09:16 -0300, Mauro wrote:
>  struct af9005_device_state {
> 
>   u8 sequence;
> 
>   int led_state;
> 
> + unsigned char data[256];
> 
> + struct mutex data_mutex;
> 
>  };


This will not work on DMA incoherent architectures.  When the data 
cache is invalidated at the end of the DMA, the stores to the mutex 
will be thrown away and the system will not be happy as changed to 
the mutex are lost.

The data allocation needs to be a separate kmalloc, as is
somewhat obtusely documented around line 226 of DMA-API.txt[1].

A separate kmalloc will be aligned to be in separate cache lines.

Excerpt from DMA-API.txt:
 Warnings:  Memory coherency operates at a granularity called the cache line 
width.  In order for memory mapped by this API to operate correctly, the mapped 
region must begin exactly on a cache line boundary and end exactly on one (to 
prevent two separately mapped regions from sharing a single cache line).  Since 
the cache line size may not be known at compile time, the API will not enforce 
this requirement.  Therefore, it is recommended that driver writers who don't 
take special care to determine the cache line size at run time only map virtual 
regions that begin and end on page boundaries (which are guaranteed also to be 
cache line boundaries). 

[1]https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/DMA-API.txt#n226

milton

PS: I normally read lkml from various web archives, sorry for
loss of threading and cc.



Re: [PATCH v2 01/31] af9005: don't do DMA on stack

2016-10-12 Thread Milton Miller II
On Tue, 11 Oct 2016 07:09:16 -0300, Mauro wrote:
>  struct af9005_device_state {
> 
>   u8 sequence;
> 
>   int led_state;
> 
> + unsigned char data[256];
> 
> + struct mutex data_mutex;
> 
>  };


This will not work on DMA incoherent architectures.  When the data 
cache is invalidated at the end of the DMA, the stores to the mutex 
will be thrown away and the system will not be happy as changed to 
the mutex are lost.

The data allocation needs to be a separate kmalloc, as is
somewhat obtusely documented around line 226 of DMA-API.txt[1].

A separate kmalloc will be aligned to be in separate cache lines.

Excerpt from DMA-API.txt:
 Warnings:  Memory coherency operates at a granularity called the cache line 
width.  In order for memory mapped by this API to operate correctly, the mapped 
region must begin exactly on a cache line boundary and end exactly on one (to 
prevent two separately mapped regions from sharing a single cache line).  Since 
the cache line size may not be known at compile time, the API will not enforce 
this requirement.  Therefore, it is recommended that driver writers who don't 
take special care to determine the cache line size at run time only map virtual 
regions that begin and end on page boundaries (which are guaranteed also to be 
cache line boundaries). 

[1]https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/DMA-API.txt#n226

milton

PS: I normally read lkml from various web archives, sorry for
loss of threading and cc.



Re: [RFC] netconsole.txt: "nc" needs "-p" to specify the listening port

2012-07-29 Thread Milton Miller
[adding Rob as Doc maintanier]

On Sat, 28 Jul 2012 about 11:08:16 -, Dirk Gouders wrote:
> Borislav Petkov  writes:
> 
> > On Fri, Jul 27, 2012 at 11:24:53AM +0200, Dirk Gouders wrote:
> >> Cong Wang  writes:
> >> 
> >> > On Fri, Jul 27, 2012 at 2:35 PM, Dirk Gouders
> >> >  wrote:
> >> >> Hi Jesse,
> >> >>
> >> >> I would like to ask you to check if the documentation of "nc" in
> >> >> netconsole.txt is still correct.  I tried two different netcat packages
> >> >> and both require "-p" to specify the listening port.  I am wondering if
> >> >> that changed after the use of "nc" has been documented.
> >> >
> >> > On Fedora 16, `nc -u -l ` works fine.
> >> 
> >> Thanks for checking that.
> >> 
> >> If the information I found is correct, Fedora uses OpenBSD's nc
> >> codebase.  The two netcat packages I tested on a Gentoo system differ in
> >> requiring the -p switch for the port specification.
> >
> > So say exactly that in the doc: that the *BSD's version of nc doesn't
> > need the port number specified with '-p' and you're covered.
> OK, I tried that in the attached patch.
> I'm not sure if every exeption needs to/should be documented, though.
> 
> >From 3cdeac3e814471053129145c5fa8391acb365fd8 Mon Sep 17 00:00:00 2001
> From: Dirk Gouders 
> Date: Sat, 28 Jul 2012 12:32:49 +0200
> Subject: [PATCH] netconsole.txt: non-BSD versions of nc(1) require '-p'
>  switch
> 
> Gentoo for example uses non-BSD versions of nc(1) which require
> the '-p' switch to specify the listening port.
> 
> ---
> Documentation/networking/netconsole.txt |3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/Documentation/networking/netconsole.txt 
> b/Documentation/networking/netconsole.txt
> index 8d02207..9a362f8 100644
> --- a/Documentation/networking/netconsole.txt
> +++ b/Documentation/networking/netconsole.txt
> @@ -52,7 +52,8 @@ initialized and attempts to bring up the supplied dev at 
> the supplied
>  address.
>  
>  The remote host can run either 'netcat -u -l -p ',

So the above line shows usage with -p

> -'nc -l -u ' or syslogd.
> +'nc -l -u ' (BSD version of nc(1) e.g. Fedora),

now you add a comment about BSD and say Fedora which is not obviously
BSD (this is Documentation; reading the git history for clarification
is not approprate).

> +'nc -l -u -p ' or syslogd.

And now you add the original -p which you probably skipped over
since it was on the previous line?


>  
>  Dynamic reconfiguration:
>  

milton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] netconsole.txt: nc needs -p to specify the listening port

2012-07-29 Thread Milton Miller
[adding Rob as Doc maintanier]

On Sat, 28 Jul 2012 about 11:08:16 -, Dirk Gouders wrote:
 Borislav Petkov b...@alien8.de writes:
 
  On Fri, Jul 27, 2012 at 11:24:53AM +0200, Dirk Gouders wrote:
  Cong Wang xiyou.wangc...@gmail.com writes:
  
   On Fri, Jul 27, 2012 at 2:35 PM, Dirk Gouders
   goud...@et.bocholt.fh-gelsenkirchen.de wrote:
   Hi Jesse,
  
   I would like to ask you to check if the documentation of nc in
   netconsole.txt is still correct.  I tried two different netcat packages
   and both require -p to specify the listening port.  I am wondering if
   that changed after the use of nc has been documented.
  
   On Fedora 16, `nc -u -l port number` works fine.
  
  Thanks for checking that.
  
  If the information I found is correct, Fedora uses OpenBSD's nc
  codebase.  The two netcat packages I tested on a Gentoo system differ in
  requiring the -p switch for the port specification.
 
  So say exactly that in the doc: that the *BSD's version of nc doesn't
  need the port number specified with '-p' and you're covered.
 OK, I tried that in the attached patch.
 I'm not sure if every exeption needs to/should be documented, though.
 
 From 3cdeac3e814471053129145c5fa8391acb365fd8 Mon Sep 17 00:00:00 2001
 From: Dirk Gouders goud...@et.bocholt.fh-gelsenkirchen.de
 Date: Sat, 28 Jul 2012 12:32:49 +0200
 Subject: [PATCH] netconsole.txt: non-BSD versions of nc(1) require '-p'
  switch
 
 Gentoo for example uses non-BSD versions of nc(1) which require
 the '-p' switch to specify the listening port.
 
 ---
 Documentation/networking/netconsole.txt |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)
 
 diff --git a/Documentation/networking/netconsole.txt 
 b/Documentation/networking/netconsole.txt
 index 8d02207..9a362f8 100644
 --- a/Documentation/networking/netconsole.txt
 +++ b/Documentation/networking/netconsole.txt
 @@ -52,7 +52,8 @@ initialized and attempts to bring up the supplied dev at 
 the supplied
  address.
  
  The remote host can run either 'netcat -u -l -p port',

So the above line shows usage with -p

 -'nc -l -u port' or syslogd.
 +'nc -l -u port' (BSD version of nc(1) e.g. Fedora),

now you add a comment about BSD and say Fedora which is not obviously
BSD (this is Documentation; reading the git history for clarification
is not approprate).

 +'nc -l -u -p port' or syslogd.

And now you add the original -p which you probably skipped over
since it was on the previous line?


  
  Dynamic reconfiguration:
  

milton
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec

2007-12-03 Thread Milton Miller

On Dec 2, 2007, at 9:59 PM, Geoff Levand wrote:


Hi.

I'm finding that recently kexec'ed kernels on PS3 will
panic on startup.  It seems the trouble was introduced
with the ppc64 SPARSEMEM_VMEMMAP support.  The problem
is the same when starting either new or old kernels:

2.6.24 -> 2.6.23 ok
2.6.24 -> 2.6.23 panic
2.6.24 -> 2.6.24 panic


I'm not sure I completely follow this.  What is the difference between 
1 and 2 ?   Also, you are talking about starting with kexec, but I 
don't see how that fits in the failure you have below.  In other words, 
there may be more than one failure.  But I can talk a bit about the 
scope of the problem in the current traceback.




These are the commits that seem to introduce the problem:

  d29eff7bca60c9ee401d691d4562a4abca8de543 ppc64: SPARSEMEM_VMEMMAP 
suppor
  8f6aac419bd590f535fb110875a51f7db2b62b5b Generic Virtual Memmap 
support for SPARSEMEM



Below is a startup dump.  Any help in finding the problem
would be appreciated.

-Geoff



ps3_mm_add_memory:317: start_addr 74032000h, start_pfn 74032h, 
nr_pages 17000h

<4>swapper: page allocation failure. order:12, mode:0x80d0
Call Trace:
[c6047820] [c000e700] .show_stack+0x68/0x1b0 
(unreliable)

[c60478c0] [c0089eb4] .__alloc_pages+0x358/0x3ac
[c60479b0] [c00a3964] .vmemmap_alloc_block+0x6c/0xf4
[c6047a40] [c0026544] .vmemmap_populate+0x74/0x100
[c6047ae0] [c00a385c] 
.sparse_mem_map_populate+0x38/0x5c
[c6047b70] [c00a36e4] 
.sparse_add_one_section+0x64/0x128

[c6047c20] [c00aa74c] .__add_pages+0xac/0x18c
[c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60
[c6047d60] [c00aa5b0] .add_memory+0xd4/0x124
[c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108
[c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8
[c6047f90] [c0021d88] .kernel_thread+0x4c/0x68
Mem-info:
DMA per-cpu:
CPU0: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   
3 usd:   0
CPU1: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   
3 usd:   0

Active:0 inactive:0 dirty:0 writeback:0 unstable:0
 free:18094 slab:122 mapped:0 pagetables:0 bounce:0
DMA free:72376kB min:0kB low:0kB high:0kB active:0kB inactive:0kB 
present:129280kB pages_scanned:0 all_unreclaimable? no

lowmem_reserve[]: 0 0 0
DMA: 8*4kB 5*8kB 5*16kB 7*32kB 3*64kB 5*128kB 4*256kB 3*512kB 5*1024kB 
3*2048kB 4*4096kB 5*8192kB 0*16384kB = 72376kB

Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap  = 0kB
Total swap = 0kB
Free swap:0kB
32768 pages of RAM
10403 reserved pages
0 pages shared
0 pages swap cached


The kernel is using 16MB pages for the linear mapping and, since its in 
the same region, the sparse virtural memmap.  PS3 uses hotplug for all 
most all of its memory.   In this case, its trying to allocate an 
additional page to cover a new region of the memory map.   However, the 
initial 128 MB is fragmented, we have 8 8M chunks but no 16MB ones.


<1>Unable to handle kernel paging request for data at address 
0xcf0001960b10

<1>Faulting instruction address: 0xc0087340
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2 PS3
Modules linked in:
NIP: c0087340 LR: c008733c CTR: 
REGS: c6047900 TRAP: 0300   Not tainted  
(2.6.24-rc3-ps3-linux-dev-g91428d55-dirty)

MSR: 80008032   CR: 2200  XER: 
DAR: cf0001960b10, DSISR: 4200
TASK = c6041080[1] 'swapper' THREAD: c6044000 CPU: 1
<6>GPR00:  c6047b80 c052b410 
c6001b40
<6>GPR04: 0001 0003 0008 

<6>GPR08: 0002 cf0001960b08 c6051240 
0003
<6>GPR12: 0003 c0484080 100d 
00bc5000
<6>GPR16: 07fff000 0001 100a 
100d
<6>GPR20:  100df628 100df458 
100df678
<6>GPR24: 00740336 c0492c00  
0001
<6>GPR28: 000740325000 000740324924 c04ce9a8 
cf0001960ae0

NIP [c0087340] .memmap_init_zone+0xf0/0x134
LR [c008733c] .memmap_init_zone+0xec/0x134
Call Trace:
[c6047b80] [c01da530] .add_memory_block+0xd8/0x108 
(unreliable)

[c6047c20] [c00aa7ac] .__add_pages+0x10c/0x18c
[c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60
[c6047d60] [c00aa5b0] .add_memory+0xd4/0x124
[c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108
[c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8
[c6047f90] [c0021d88] .kernel_thread+0x4c/0x68
Instruction dump:
901f000c 38000400 7d20f8a8 7d290378 7d20f9ad 40a2fff4 7ba00521 7fe3fb78
3882 41820008 4b0d 393f0028  f93f0028 3bbd0001 
3bff0038

<0>Kernel panic - not 

Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec

2007-12-03 Thread Milton Miller

On Dec 2, 2007, at 9:59 PM, Geoff Levand wrote:


Hi.

I'm finding that recently kexec'ed kernels on PS3 will
panic on startup.  It seems the trouble was introduced
with the ppc64 SPARSEMEM_VMEMMAP support.  The problem
is the same when starting either new or old kernels:

2.6.24 - 2.6.23 ok
2.6.24 - 2.6.23 panic
2.6.24 - 2.6.24 panic


I'm not sure I completely follow this.  What is the difference between 
1 and 2 ?   Also, you are talking about starting with kexec, but I 
don't see how that fits in the failure you have below.  In other words, 
there may be more than one failure.  But I can talk a bit about the 
scope of the problem in the current traceback.




These are the commits that seem to introduce the problem:

  d29eff7bca60c9ee401d691d4562a4abca8de543 ppc64: SPARSEMEM_VMEMMAP 
suppor
  8f6aac419bd590f535fb110875a51f7db2b62b5b Generic Virtual Memmap 
support for SPARSEMEM



Below is a startup dump.  Any help in finding the problem
would be appreciated.

-Geoff



ps3_mm_add_memory:317: start_addr 74032000h, start_pfn 74032h, 
nr_pages 17000h

4swapper: page allocation failure. order:12, mode:0x80d0
Call Trace:
[c6047820] [c000e700] .show_stack+0x68/0x1b0 
(unreliable)

[c60478c0] [c0089eb4] .__alloc_pages+0x358/0x3ac
[c60479b0] [c00a3964] .vmemmap_alloc_block+0x6c/0xf4
[c6047a40] [c0026544] .vmemmap_populate+0x74/0x100
[c6047ae0] [c00a385c] 
.sparse_mem_map_populate+0x38/0x5c
[c6047b70] [c00a36e4] 
.sparse_add_one_section+0x64/0x128

[c6047c20] [c00aa74c] .__add_pages+0xac/0x18c
[c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60
[c6047d60] [c00aa5b0] .add_memory+0xd4/0x124
[c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108
[c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8
[c6047f90] [c0021d88] .kernel_thread+0x4c/0x68
Mem-info:
DMA per-cpu:
CPU0: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   
3 usd:   0
CPU1: Hot: hi:   42, btch:   7 usd:   0   Cold: hi:   14, btch:   
3 usd:   0

Active:0 inactive:0 dirty:0 writeback:0 unstable:0
 free:18094 slab:122 mapped:0 pagetables:0 bounce:0
DMA free:72376kB min:0kB low:0kB high:0kB active:0kB inactive:0kB 
present:129280kB pages_scanned:0 all_unreclaimable? no

lowmem_reserve[]: 0 0 0
DMA: 8*4kB 5*8kB 5*16kB 7*32kB 3*64kB 5*128kB 4*256kB 3*512kB 5*1024kB 
3*2048kB 4*4096kB 5*8192kB 0*16384kB = 72376kB

Swap cache: add 0, delete 0, find 0/0, race 0+0
Free swap  = 0kB
Total swap = 0kB
Free swap:0kB
32768 pages of RAM
10403 reserved pages
0 pages shared
0 pages swap cached


The kernel is using 16MB pages for the linear mapping and, since its in 
the same region, the sparse virtural memmap.  PS3 uses hotplug for all 
most all of its memory.   In this case, its trying to allocate an 
additional page to cover a new region of the memory map.   However, the 
initial 128 MB is fragmented, we have 8 8M chunks but no 16MB ones.


1Unable to handle kernel paging request for data at address 
0xcf0001960b10

1Faulting instruction address: 0xc0087340
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2 PS3
Modules linked in:
NIP: c0087340 LR: c008733c CTR: 
REGS: c6047900 TRAP: 0300   Not tainted  
(2.6.24-rc3-ps3-linux-dev-g91428d55-dirty)

MSR: 80008032 EE,IR,DR  CR: 2200  XER: 
DAR: cf0001960b10, DSISR: 4200
TASK = c6041080[1] 'swapper' THREAD: c6044000 CPU: 1
6GPR00:  c6047b80 c052b410 
c6001b40
6GPR04: 0001 0003 0008 

6GPR08: 0002 cf0001960b08 c6051240 
0003
6GPR12: 0003 c0484080 100d 
00bc5000
6GPR16: 07fff000 0001 100a 
100d
6GPR20:  100df628 100df458 
100df678
6GPR24: 00740336 c0492c00  
0001
6GPR28: 000740325000 000740324924 c04ce9a8 
cf0001960ae0

NIP [c0087340] .memmap_init_zone+0xf0/0x134
LR [c008733c] .memmap_init_zone+0xec/0x134
Call Trace:
[c6047b80] [c01da530] .add_memory_block+0xd8/0x108 
(unreliable)

[c6047c20] [c00aa7ac] .__add_pages+0x10c/0x18c
[c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60
[c6047d60] [c00aa5b0] .add_memory+0xd4/0x124
[c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108
[c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8
[c6047f90] [c0021d88] .kernel_thread+0x4c/0x68
Instruction dump:
901f000c 38000400 7d20f8a8 7d290378 7d20f9ad 40a2fff4 7ba00521 7fe3fb78
3882 41820008 4b0d 393f0028 f9290008 f93f0028 3bbd0001 
3bff0038

0Kernel panic - not syncing: 

[PATCH] Fix sched_domain sysctl registration again

2007-10-20 Thread Milton Miller
commit  029190c515f15f512ac85de8fc686d4dbd0ae731 (cpuset
sched_load_balance flag) was not tested SCHED_DEBUG enabled as
committed as it dereferences NULL when used and it reordered
the sysctl registration to cause it to never show any domains
or their tunables.

Fixes:

1) restore arch_init_sched_domains ordering
we can't walk the domains before we build them

presently we register cpus with empty directories (no domain
directories or files).

2) make unregister_sched_domain_sysctl do nothing when already unregistered
detach_destroy_domains is now called one set of cpus at a time
unregister_syctl dereferences NULL if called with a null.

While the the function would always dereference null if called
twice, in the previous code it was always called once and then
was followed a register.  So only the hidden bug of the
sysctl_root_table not being allocated followed by an attempt to
free it would have shown the error.

3) always call unregister and register in partition_sched_domains
The code is "smart" about unregistering only needed domains.
Since we aren't guaranteed any calls to unregister, always 
unregister.   Without calling register on the way out we
will not have a table or any sysctl tree.

4) warn if register is called without unregistering
The previous table memory is lost, leaving pointers to the
later freed memory in sysctl and leaking the memory of the
tables.


Signed-off-by: Milton Miller <[EMAIL PROTECTED]>
--- 
[resend with From set]
against e8b8c977734193adedf2b0f607d6252c78e86394

Before this patch on a 2-core 4-thread box compiled for SMT and NUMA,
the domains appear empty (there are actually 3 levels per cpu).  And as
soon as two domains a null pointer is dereferenced (unreliable in this
case is stack garbage):

bu19a:~# ls -R /proc/sys/kernel/sched_domain/
/proc/sys/kernel/sched_domain/:
cpu0  cpu1  cpu2  cpu3

/proc/sys/kernel/sched_domain/cpu0:

/proc/sys/kernel/sched_domain/cpu1:

/proc/sys/kernel/sched_domain/cpu2:

/proc/sys/kernel/sched_domain/cpu3:

bu19a:~# mkdir /dev/cpuset
bu19a:~# mount -tcpuset cpuset /dev/cpuset/
bu19a:~# cd /dev/cpuset/
bu19a:/dev/cpuset# echo 0 > sched_load_balance 
bu19a:/dev/cpuset# mkdir one
bu19a:/dev/cpuset# echo 1 > one/cpus   
bu19a:/dev/cpuset# echo 0 > one/sched_load_balance 
Unable to handle kernel paging request for data at address 0x0018
Faulting instruction address: 0xc006b608
NIP: c006b608 LR: c006b604 CTR: 
REGS: c00018d973f0 TRAP: 0300   Not tainted  (2.6.23-bml)
MSR: 90009032   CR: 28242442  XER: 
DAR: 0018, DSISR: 4000
TASK = c0001912e340[1987] 'bash' THREAD: c00018d94000 CPU: 2
..
NIP [c006b608] .unregister_sysctl_table+0x38/0x110
LR [c006b604] .unregister_sysctl_table+0x34/0x110
Call Trace:
[c00018d97670] [c7017270] 0xc7017270 (unreliable)
[c00018d97720] [c0058710] .detach_destroy_domains+0x30/0xb0
[c00018d977b0] [c005cf1c] .partition_sched_domains+0x1bc/0x230
[c00018d97870] [c009fdc4] .rebuild_sched_domains+0xb4/0x4c0
[c00018d97970] [c00a02e8] .update_flag+0x118/0x170
[c00018d97a80] [c00a1768] .cpuset_common_file_write+0x568/0x820
[c00018d97c00] [c009d95c] .cgroup_file_write+0x7c/0x180
[c00018d97cf0] [c00e76b8] .vfs_write+0xe8/0x1b0
[c00018d97d90] [c00e810c] .sys_write+0x4c/0x90
[c00018d97e30] [c000852c] syscall_exit+0x0/0x40



diff --git a/kernel/sched.c b/kernel/sched.c
index afe76ec..2af50ec 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5463,11 +5463,12 @@ static void register_sched_domain_sysctl(void)
struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
char buf[32];
 
+   WARN_ON(sd_ctl_dir[0].child);
+   sd_ctl_dir[0].child = entry;
+
if (entry == NULL)
return;
 
-   sd_ctl_dir[0].child = entry;
-
for_each_online_cpu(i) {
snprintf(buf, 32, "cpu%d", i);
entry->procname = kstrdup(buf, GFP_KERNEL);
@@ -5475,14 +5476,19 @@ static void register_sched_domain_sysctl(void)
entry->child = sd_alloc_ctl_cpu_table(i);
entry++;
}
+
+   WARN_ON(sd_sysctl_header);
sd_sysctl_header = register_sysctl_table(sd_ctl_root);
 }
 
+/* may be called multiple times per register */
 static void unregister_sched_domain_sysctl(void)
 {
-   unregister_sysctl_table(sd_sysctl_header);
+   if (sd_sysctl_header)
+   unregister_sysctl_table(sd_sysctl_header);
sd_sysctl_header = NULL;
-   sd_free_ctl_entry(_ctl_dir[0].child);
+   if (sd_ctl_dir[0].child)
+   sd_free_ctl_entry(_ctl_dir[0].child);
 }
 #else
 static void register_sched

[PATCH] restore arch/ppc/boot cflags

2007-10-20 Thread Milton Miller
Commit 9a39e273d4df0560c724c5fe71f6314a0583ca2b removed the boot directory
addition to CFLAGS that was being used by the subdirectory builds.  For the
other files, that patch set EXTRA_CFLAGS, but Makefile.build explicitly
sets that to empty as it is explicitly for a single directory only.
Append to KBUILD_CFLAGS instead.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>
---
The commit also changed xtensia to export EXTRA_CFLAGS from its boot
directory, that needs to be fixed too.

from ARCH=ppc prep_defconfig:

/data/home/miltonm/work.git/arch/ppc/boot/of1275/write.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/read.c:11:20:/data/home/miltonm/work.git/arch/ppc/boot/of1275/ofstdio.c:11:20:
  error: error: of1275.h: No such file or directoryof1275.h: No such file or 
directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/release.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/ofinit.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:12:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:11:20: error: of1275.h: 
No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/getprop.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:14:20: error: 
serial.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:13:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/enter.c:11:20: error: 
of1275.h: No such file or directory
error: of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:12:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/finddevice.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/bootinfo.c:14:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/misc-common.c:18:22: error: 
nonstdio.h: No such file or directory
of1275.h: No such file or directory


diff --git a/arch/ppc/boot/Makefile b/arch/ppc/boot/Makefile
index 487dc66..52a0d38 100644
--- a/arch/ppc/boot/Makefile
+++ b/arch/ppc/boot/Makefile
@@ -13,6 +13,7 @@
 # modified by Cort ([EMAIL PROTECTED])
 #
 
+KBUILD_CFLAGS  += -fno-builtin -D__BOOTER__ -Iarch/$(ARCH)/boot/include
 HOSTCFLAGS += -Iarch/$(ARCH)/boot/include
 
 BOOT_TARGETS   = zImage zImage.initrd znetboot znetboot.initrd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] restore arch/ppc/boot cflags

2007-10-20 Thread Milton Miller
Commit 9a39e273d4df0560c724c5fe71f6314a0583ca2b removed the boot directory
addition to CFLAGS that was being used by the subdirectory builds.  For the
other files, that patch set EXTRA_CFLAGS, but Makefile.build explicitly
sets that to empty as it is explicitly for a single directory only.
Append to KBUILD_CFLAGS instead.

Signed-off-by: Milton Miller [EMAIL PROTECTED]
---
The commit also changed xtensia to export EXTRA_CFLAGS from its boot
directory, that needs to be fixed too.

from ARCH=ppc prep_defconfig:

/data/home/miltonm/work.git/arch/ppc/boot/of1275/write.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/read.c:11:20:/data/home/miltonm/work.git/arch/ppc/boot/of1275/ofstdio.c:11:20:
  error: error: of1275.h: No such file or directoryof1275.h: No such file or 
directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/release.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/ofinit.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:12:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:11:20: error: of1275.h: 
No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/getprop.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:14:20: error: 
serial.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:13:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/enter.c:11:20: error: 
of1275.h: No such file or directory
error: of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:12:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/of1275/finddevice.c:11:20: error: 
of1275.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/bootinfo.c:14:22: error: 
nonstdio.h: No such file or directory
/data/home/miltonm/work.git/arch/ppc/boot/common/misc-common.c:18:22: error: 
nonstdio.h: No such file or directory
of1275.h: No such file or directory


diff --git a/arch/ppc/boot/Makefile b/arch/ppc/boot/Makefile
index 487dc66..52a0d38 100644
--- a/arch/ppc/boot/Makefile
+++ b/arch/ppc/boot/Makefile
@@ -13,6 +13,7 @@
 # modified by Cort ([EMAIL PROTECTED])
 #
 
+KBUILD_CFLAGS  += -fno-builtin -D__BOOTER__ -Iarch/$(ARCH)/boot/include
 HOSTCFLAGS += -Iarch/$(ARCH)/boot/include
 
 BOOT_TARGETS   = zImage zImage.initrd znetboot znetboot.initrd
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix sched_domain sysctl registration again

2007-10-20 Thread Milton Miller
commit  029190c515f15f512ac85de8fc686d4dbd0ae731 (cpuset
sched_load_balance flag) was not tested SCHED_DEBUG enabled as
committed as it dereferences NULL when used and it reordered
the sysctl registration to cause it to never show any domains
or their tunables.

Fixes:

1) restore arch_init_sched_domains ordering
we can't walk the domains before we build them

presently we register cpus with empty directories (no domain
directories or files).

2) make unregister_sched_domain_sysctl do nothing when already unregistered
detach_destroy_domains is now called one set of cpus at a time
unregister_syctl dereferences NULL if called with a null.

While the the function would always dereference null if called
twice, in the previous code it was always called once and then
was followed a register.  So only the hidden bug of the
sysctl_root_table not being allocated followed by an attempt to
free it would have shown the error.

3) always call unregister and register in partition_sched_domains
The code is smart about unregistering only needed domains.
Since we aren't guaranteed any calls to unregister, always 
unregister.   Without calling register on the way out we
will not have a table or any sysctl tree.

4) warn if register is called without unregistering
The previous table memory is lost, leaving pointers to the
later freed memory in sysctl and leaking the memory of the
tables.


Signed-off-by: Milton Miller [EMAIL PROTECTED]
--- 
[resend with From set]
against e8b8c977734193adedf2b0f607d6252c78e86394

Before this patch on a 2-core 4-thread box compiled for SMT and NUMA,
the domains appear empty (there are actually 3 levels per cpu).  And as
soon as two domains a null pointer is dereferenced (unreliable in this
case is stack garbage):

bu19a:~# ls -R /proc/sys/kernel/sched_domain/
/proc/sys/kernel/sched_domain/:
cpu0  cpu1  cpu2  cpu3

/proc/sys/kernel/sched_domain/cpu0:

/proc/sys/kernel/sched_domain/cpu1:

/proc/sys/kernel/sched_domain/cpu2:

/proc/sys/kernel/sched_domain/cpu3:

bu19a:~# mkdir /dev/cpuset
bu19a:~# mount -tcpuset cpuset /dev/cpuset/
bu19a:~# cd /dev/cpuset/
bu19a:/dev/cpuset# echo 0  sched_load_balance 
bu19a:/dev/cpuset# mkdir one
bu19a:/dev/cpuset# echo 1  one/cpus   
bu19a:/dev/cpuset# echo 0  one/sched_load_balance 
Unable to handle kernel paging request for data at address 0x0018
Faulting instruction address: 0xc006b608
NIP: c006b608 LR: c006b604 CTR: 
REGS: c00018d973f0 TRAP: 0300   Not tainted  (2.6.23-bml)
MSR: 90009032 EE,ME,IR,DR  CR: 28242442  XER: 
DAR: 0018, DSISR: 4000
TASK = c0001912e340[1987] 'bash' THREAD: c00018d94000 CPU: 2
..
NIP [c006b608] .unregister_sysctl_table+0x38/0x110
LR [c006b604] .unregister_sysctl_table+0x34/0x110
Call Trace:
[c00018d97670] [c7017270] 0xc7017270 (unreliable)
[c00018d97720] [c0058710] .detach_destroy_domains+0x30/0xb0
[c00018d977b0] [c005cf1c] .partition_sched_domains+0x1bc/0x230
[c00018d97870] [c009fdc4] .rebuild_sched_domains+0xb4/0x4c0
[c00018d97970] [c00a02e8] .update_flag+0x118/0x170
[c00018d97a80] [c00a1768] .cpuset_common_file_write+0x568/0x820
[c00018d97c00] [c009d95c] .cgroup_file_write+0x7c/0x180
[c00018d97cf0] [c00e76b8] .vfs_write+0xe8/0x1b0
[c00018d97d90] [c00e810c] .sys_write+0x4c/0x90
[c00018d97e30] [c000852c] syscall_exit+0x0/0x40



diff --git a/kernel/sched.c b/kernel/sched.c
index afe76ec..2af50ec 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5463,11 +5463,12 @@ static void register_sched_domain_sysctl(void)
struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
char buf[32];
 
+   WARN_ON(sd_ctl_dir[0].child);
+   sd_ctl_dir[0].child = entry;
+
if (entry == NULL)
return;
 
-   sd_ctl_dir[0].child = entry;
-
for_each_online_cpu(i) {
snprintf(buf, 32, cpu%d, i);
entry-procname = kstrdup(buf, GFP_KERNEL);
@@ -5475,14 +5476,19 @@ static void register_sched_domain_sysctl(void)
entry-child = sd_alloc_ctl_cpu_table(i);
entry++;
}
+
+   WARN_ON(sd_sysctl_header);
sd_sysctl_header = register_sysctl_table(sd_ctl_root);
 }
 
+/* may be called multiple times per register */
 static void unregister_sched_domain_sysctl(void)
 {
-   unregister_sysctl_table(sd_sysctl_header);
+   if (sd_sysctl_header)
+   unregister_sysctl_table(sd_sysctl_header);
sd_sysctl_header = NULL;
-   sd_free_ctl_entry(sd_ctl_dir[0].child);
+   if (sd_ctl_dir[0].child)
+   sd_free_ctl_entry(sd_ctl_dir[0].child);
 }
 #else
 static void register_sched_domain_sysctl(void)
@@ -6426,13 +6432,17

[PATCH] sched domain sysctl: free kstrdup allocations

2007-10-15 Thread Milton Miller
The procnames for the cpu and domain were allocated via kstrdup and so
should also be freed.   The names for the files are static, but we
can differentiate them by the presence of the proc_handler.  If a
kstrdup (of < 32 characters) fails the sysctl code will not see the
procname or remaining table entries, but any child tables and names
will be reclaimed upon free.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>
--- 

Hi Ingo.

It occurred to me this morning that the procname field was dynamically
allocated and needed to be freed.  I started to put in break statements
when allocation failed but it was approaching 50% error handling code.

I came up with this alternative of looping while entry->mode is set and
checking proc_handler instead of ->table.  Alternatively, the string
version of the domain name and cpu number could be stored the structs.

I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
counts after taking a cpuset exclusive and back.

milton

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-15 12:21:38.0 -0500
+++ kernel/kernel/sched.c   2007-10-15 12:22:12.0 -0500
@@ -5290,11 +5290,20 @@ static struct ctl_table *sd_alloc_ctl_en
 
 static void sd_free_ctl_entry(struct ctl_table **tablep)
 {
-   struct ctl_table *entry = *tablep;
+   struct ctl_table *entry;
 
-   for (entry = *tablep; entry->procname; entry++)
+   /*
+* In the intermediate directories, both the child directory and
+* procname are dynamically allocated and could fail but the mode
+* will always be set.  In the lowest directory the names are
+* static strings and all have proc handlers.
+*/
+   for (entry = *tablep; entry->mode; entry++) {
if (entry->child)
sd_free_ctl_entry(>child);
+   if (entry->proc_handler == NULL)
+   kfree(entry->procname);
+   }
 
kfree(*tablep);
*tablep = NULL;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: free kstrdup allocations

2007-10-15 Thread Milton Miller
The procnames for the cpu and domain were allocated via kstrdup and so
should also be freed.   The names for the files are static, but we
can differentiate them by the presence of the proc_handler.  If a
kstrdup (of  32 characters) fails the sysctl code will not see the
procname or remaining table entries, but any child tables and names
will be reclaimed upon free.

Signed-off-by: Milton Miller [EMAIL PROTECTED]
--- 

Hi Ingo.

It occurred to me this morning that the procname field was dynamically
allocated and needed to be freed.  I started to put in break statements
when allocation failed but it was approaching 50% error handling code.

I came up with this alternative of looping while entry-mode is set and
checking proc_handler instead of -table.  Alternatively, the string
version of the domain name and cpu number could be stored the structs.

I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
counts after taking a cpuset exclusive and back.

milton

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-15 12:21:38.0 -0500
+++ kernel/kernel/sched.c   2007-10-15 12:22:12.0 -0500
@@ -5290,11 +5290,20 @@ static struct ctl_table *sd_alloc_ctl_en
 
 static void sd_free_ctl_entry(struct ctl_table **tablep)
 {
-   struct ctl_table *entry = *tablep;
+   struct ctl_table *entry;
 
-   for (entry = *tablep; entry-procname; entry++)
+   /*
+* In the intermediate directories, both the child directory and
+* procname are dynamically allocated and could fail but the mode
+* will always be set.  In the lowest directory the names are
+* static strings and all have proc handlers.
+*/
+   for (entry = *tablep; entry-mode; entry++) {
if (entry-child)
sd_free_ctl_entry(entry-child);
+   if (entry-proc_handler == NULL)
+   kfree(entry-procname);
+   }
 
kfree(*tablep);
*tablep = NULL;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain debug sysctl fixes

2007-10-12 Thread Milton Miller

I forgot to put the seqence numbers in the patch subjects (I did
in the message ids).

Here is the order I diffed them:

sched domain sysctl: register online cpus
sched domain sysctl: free and rebuild when changing domains
sched.c: use kcalloc
sched domain sysctl: don't bug on alloc failure
sched domain sysctl: sysctl tables can have no holes

milton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: don't bug on alloc failure

2007-10-12 Thread Milton Miller
Now that we are calling this at runtime, a more relaxed error path is
suggested.  If an allocation fails, we just register the partial table,
which will show empty directories.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:59:02.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:59:07.0 -0500
@@ -5285,8 +5285,6 @@ static struct ctl_table *sd_alloc_ctl_en
struct ctl_table *entry =
kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL);
 
-   BUG_ON(!entry);
-
return entry;
 }
 
@@ -5319,6 +5317,9 @@ sd_alloc_ctl_domain_table(struct sched_d
 {
struct ctl_table *table = sd_alloc_ctl_entry(14);
 
+   if (table == NULL)
+   return NULL;
+
set_table_entry([0], "min_interval", >min_interval,
sizeof(long), 0644, proc_doulongvec_minmax);
set_table_entry([1], "max_interval", >max_interval,
@@ -5356,6 +5357,8 @@ static ctl_table *sd_alloc_ctl_cpu_table
for_each_domain(cpu, sd)
domain_num++;
entry = table = sd_alloc_ctl_entry(domain_num + 1);
+   if (table == NULL)
+   return NULL;
 
i = 0;
for_each_domain(cpu, sd) {
@@ -5376,6 +5379,9 @@ static void register_sched_domain_sysctl
struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
char buf[32];
 
+   if (entry == NULL)
+   return;
+
sd_ctl_dir[0].child = entry;
 
for_each_online_cpu(i) {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: sysctl tables can have no holes

2007-10-12 Thread Milton Miller
The register_sysctl_table stops on the first entry without a
procname and ctl_name, so we need to fill in consecutive entries.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:59:07.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:59:16.0 -0500
@@ -5315,7 +5315,7 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-   struct ctl_table *table = sd_alloc_ctl_entry(14);
+   struct ctl_table *table = sd_alloc_ctl_entry(12);
 
if (table == NULL)
return NULL;
@@ -5338,11 +5338,12 @@ sd_alloc_ctl_domain_table(struct sched_d
sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry([8], "imbalance_pct", >imbalance_pct,
sizeof(int), 0644, proc_dointvec_minmax);
-   set_table_entry([10], "cache_nice_tries",
+   set_table_entry([9], "cache_nice_tries",
>cache_nice_tries,
sizeof(int), 0644, proc_dointvec_minmax);
-   set_table_entry([12], "flags", >flags,
+   set_table_entry([10], "flags", >flags,
sizeof(int), 0644, proc_dointvec_minmax);
+   /* [11] is terminator */
 
return table;
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: register online cpus

2007-10-12 Thread Milton Miller
init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu
variables.  If the cpus_possible mask is not contigious this will result
in a crash referencing unallocated data.  If the online mask is not
contigious then we would show offline cpus and miss online ones.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-11 23:48:10.0 -0500
+++ kernel/kernel/sched.c   2007-10-11 23:55:51.0 -0500
@@ -5367,11 +5367,12 @@ static void init_sched_domain_sysctl(voi
 
sd_ctl_dir[0].child = entry;
 
-   for (i = 0; i < cpu_num; i++, entry++) {
+   for_each_online_cpu(i) {
snprintf(buf, 32, "cpu%d", i);
entry->procname = kstrdup(buf, GFP_KERNEL);
entry->mode = 0555;
entry->child = sd_alloc_ctl_cpu_table(i);
+   entry++;
}
sd_sysctl_header = register_sysctl_table(sd_ctl_root);
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched.c: use kcalloc

2007-10-12 Thread Milton Miller
kcalloc checks for n * sizeof(element) overflows and it zeros.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>


Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:58:35.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:58:38.0 -0500
@@ -5283,10 +5283,9 @@ static struct ctl_table sd_ctl_root[] = 
 static struct ctl_table *sd_alloc_ctl_entry(int n)
 {
struct ctl_table *entry =
-   kmalloc(n * sizeof(struct ctl_table), GFP_KERNEL);
+   kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL);
 
BUG_ON(!entry);
-   memset(entry, 0, n * sizeof(struct ctl_table));
 
return entry;
 }
@@ -6080,7 +6079,7 @@ static int build_sched_domains(const cpu
/*
 * Allocate the per-node list of sched groups
 */
-   sched_group_nodes = kzalloc(sizeof(struct sched_group *)*MAX_NUMNODES,
+   sched_group_nodes = kcalloc(MAX_NUMNODES, sizeof(struct sched_group *),
   GFP_KERNEL);
if (!sched_group_nodes) {
printk(KERN_WARNING "Can not alloc sched group node list\n");
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: free and rebuild when changing domains

2007-10-12 Thread Milton Miller
Unregister and free the sysctl table before destroying domains, then
rebuild and register after creating the new domains.  This prevents the
sysctl table from pointing to freed memory for root to write.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:55:19.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:58:35.0 -0500
@@ -5291,6 +5291,18 @@ static struct ctl_table *sd_alloc_ctl_en
return entry;
 }
 
+static void sd_free_ctl_entry(struct ctl_table **tablep)
+{
+   struct ctl_table *entry = *tablep;
+
+   for (entry = *tablep; entry->procname; entry++)
+   if (entry->child)
+   sd_free_ctl_entry(>child);
+
+   kfree(*tablep);
+   *tablep = NULL;
+}
+
 static void
 set_table_entry(struct ctl_table *entry,
const char *procname, void *data, int maxlen,
@@ -5359,7 +5371,7 @@ static ctl_table *sd_alloc_ctl_cpu_table
 }
 
 static struct ctl_table_header *sd_sysctl_header;
-static void init_sched_domain_sysctl(void)
+static void register_sched_domain_sysctl(void)
 {
int i, cpu_num = num_online_cpus();
struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
@@ -5376,8 +5388,18 @@ static void init_sched_domain_sysctl(voi
}
sd_sysctl_header = register_sysctl_table(sd_ctl_root);
 }
+
+static void unregister_sched_domain_sysctl(void)
+{
+   unregister_sysctl_table(sd_sysctl_header);
+   sd_sysctl_header = NULL;
+   sd_free_ctl_entry(_ctl_dir[0].child);
+}
 #else
-static void init_sched_domain_sysctl(void)
+static void register_sched_domain_sysctl(void)
+{
+}
+static void unregister_sched_domain_sysctl(void)
 {
 }
 #endif
@@ -6311,6 +6333,8 @@ static int arch_init_sched_domains(const
 
err = build_sched_domains(_default_map);
 
+   register_sched_domain_sysctl();
+
return err;
 }
 
@@ -6327,6 +6351,8 @@ static void detach_destroy_domains(const
 {
int i;
 
+   unregister_sched_domain_sysctl();
+
for_each_cpu_mask(i, *cpu_map)
cpu_attach_domain(NULL, i);
synchronize_sched();
@@ -6357,6 +6383,8 @@ int partition_sched_domains(cpumask_t *p
if (!err && !cpus_empty(*partition2))
err = build_sched_domains(partition2);
 
+   register_sched_domain_sysctl();
+
return err;
 }
 
@@ -6488,8 +6516,6 @@ void __init sched_init_smp(void)
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);
 
-   init_sched_domain_sysctl();
-
/* Move init over to a non-isolated CPU */
if (set_cpus_allowed(current, non_isolated_cpus) < 0)
BUG();
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain debug sysctl fixes

2007-10-12 Thread Milton Miller
The following 5 patches provide fixes for the sysctl code added under
CONFIG_DEBUG_SCHED.   After debugging a boot failure dereferencing
not per-cpu data for an impossible cpu, I found the other issues by
inspection.

Following these fixes, /proc/sysctl/kernel/sched_debug/ should work with
exclusive cpusets, cpu hotplug, offline and not present cpus, and will
show all 11 files.

milton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: sysctl tables can have no holes

2007-10-12 Thread Milton Miller
The register_sysctl_table stops on the first entry without a
procname and ctl_name, so we need to fill in consecutive entries.

Signed-off-by: Milton Miller [EMAIL PROTECTED]

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:59:07.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:59:16.0 -0500
@@ -5315,7 +5315,7 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-   struct ctl_table *table = sd_alloc_ctl_entry(14);
+   struct ctl_table *table = sd_alloc_ctl_entry(12);
 
if (table == NULL)
return NULL;
@@ -5338,11 +5338,12 @@ sd_alloc_ctl_domain_table(struct sched_d
sizeof(int), 0644, proc_dointvec_minmax);
set_table_entry(table[8], imbalance_pct, sd-imbalance_pct,
sizeof(int), 0644, proc_dointvec_minmax);
-   set_table_entry(table[10], cache_nice_tries,
+   set_table_entry(table[9], cache_nice_tries,
sd-cache_nice_tries,
sizeof(int), 0644, proc_dointvec_minmax);
-   set_table_entry(table[12], flags, sd-flags,
+   set_table_entry(table[10], flags, sd-flags,
sizeof(int), 0644, proc_dointvec_minmax);
+   /* table[11] is terminator */
 
return table;
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: register online cpus

2007-10-12 Thread Milton Miller
init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu
variables.  If the cpus_possible mask is not contigious this will result
in a crash referencing unallocated data.  If the online mask is not
contigious then we would show offline cpus and miss online ones.

Signed-off-by: Milton Miller [EMAIL PROTECTED]

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-11 23:48:10.0 -0500
+++ kernel/kernel/sched.c   2007-10-11 23:55:51.0 -0500
@@ -5367,11 +5367,12 @@ static void init_sched_domain_sysctl(voi
 
sd_ctl_dir[0].child = entry;
 
-   for (i = 0; i  cpu_num; i++, entry++) {
+   for_each_online_cpu(i) {
snprintf(buf, 32, cpu%d, i);
entry-procname = kstrdup(buf, GFP_KERNEL);
entry-mode = 0555;
entry-child = sd_alloc_ctl_cpu_table(i);
+   entry++;
}
sd_sysctl_header = register_sysctl_table(sd_ctl_root);
 }
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched.c: use kcalloc

2007-10-12 Thread Milton Miller
kcalloc checks for n * sizeof(element) overflows and it zeros.

Signed-off-by: Milton Miller [EMAIL PROTECTED]


Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:58:35.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:58:38.0 -0500
@@ -5283,10 +5283,9 @@ static struct ctl_table sd_ctl_root[] = 
 static struct ctl_table *sd_alloc_ctl_entry(int n)
 {
struct ctl_table *entry =
-   kmalloc(n * sizeof(struct ctl_table), GFP_KERNEL);
+   kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL);
 
BUG_ON(!entry);
-   memset(entry, 0, n * sizeof(struct ctl_table));
 
return entry;
 }
@@ -6080,7 +6079,7 @@ static int build_sched_domains(const cpu
/*
 * Allocate the per-node list of sched groups
 */
-   sched_group_nodes = kzalloc(sizeof(struct sched_group *)*MAX_NUMNODES,
+   sched_group_nodes = kcalloc(MAX_NUMNODES, sizeof(struct sched_group *),
   GFP_KERNEL);
if (!sched_group_nodes) {
printk(KERN_WARNING Can not alloc sched group node list\n);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: free and rebuild when changing domains

2007-10-12 Thread Milton Miller
Unregister and free the sysctl table before destroying domains, then
rebuild and register after creating the new domains.  This prevents the
sysctl table from pointing to freed memory for root to write.

Signed-off-by: Milton Miller [EMAIL PROTECTED]

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:55:19.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:58:35.0 -0500
@@ -5291,6 +5291,18 @@ static struct ctl_table *sd_alloc_ctl_en
return entry;
 }
 
+static void sd_free_ctl_entry(struct ctl_table **tablep)
+{
+   struct ctl_table *entry = *tablep;
+
+   for (entry = *tablep; entry-procname; entry++)
+   if (entry-child)
+   sd_free_ctl_entry(entry-child);
+
+   kfree(*tablep);
+   *tablep = NULL;
+}
+
 static void
 set_table_entry(struct ctl_table *entry,
const char *procname, void *data, int maxlen,
@@ -5359,7 +5371,7 @@ static ctl_table *sd_alloc_ctl_cpu_table
 }
 
 static struct ctl_table_header *sd_sysctl_header;
-static void init_sched_domain_sysctl(void)
+static void register_sched_domain_sysctl(void)
 {
int i, cpu_num = num_online_cpus();
struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
@@ -5376,8 +5388,18 @@ static void init_sched_domain_sysctl(voi
}
sd_sysctl_header = register_sysctl_table(sd_ctl_root);
 }
+
+static void unregister_sched_domain_sysctl(void)
+{
+   unregister_sysctl_table(sd_sysctl_header);
+   sd_sysctl_header = NULL;
+   sd_free_ctl_entry(sd_ctl_dir[0].child);
+}
 #else
-static void init_sched_domain_sysctl(void)
+static void register_sched_domain_sysctl(void)
+{
+}
+static void unregister_sched_domain_sysctl(void)
 {
 }
 #endif
@@ -6311,6 +6333,8 @@ static int arch_init_sched_domains(const
 
err = build_sched_domains(cpu_default_map);
 
+   register_sched_domain_sysctl();
+
return err;
 }
 
@@ -6327,6 +6351,8 @@ static void detach_destroy_domains(const
 {
int i;
 
+   unregister_sched_domain_sysctl();
+
for_each_cpu_mask(i, *cpu_map)
cpu_attach_domain(NULL, i);
synchronize_sched();
@@ -6357,6 +6383,8 @@ int partition_sched_domains(cpumask_t *p
if (!err  !cpus_empty(*partition2))
err = build_sched_domains(partition2);
 
+   register_sched_domain_sysctl();
+
return err;
 }
 
@@ -6488,8 +6516,6 @@ void __init sched_init_smp(void)
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);
 
-   init_sched_domain_sysctl();
-
/* Move init over to a non-isolated CPU */
if (set_cpus_allowed(current, non_isolated_cpus)  0)
BUG();
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain debug sysctl fixes

2007-10-12 Thread Milton Miller
The following 5 patches provide fixes for the sysctl code added under
CONFIG_DEBUG_SCHED.   After debugging a boot failure dereferencing
not per-cpu data for an impossible cpu, I found the other issues by
inspection.

Following these fixes, /proc/sysctl/kernel/sched_debug/ should work with
exclusive cpusets, cpu hotplug, offline and not present cpus, and will
show all 11 files.

milton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: don't bug on alloc failure

2007-10-12 Thread Milton Miller
Now that we are calling this at runtime, a more relaxed error path is
suggested.  If an allocation fails, we just register the partial table,
which will show empty directories.

Signed-off-by: Milton Miller [EMAIL PROTECTED]

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-12 03:59:02.0 -0500
+++ kernel/kernel/sched.c   2007-10-12 03:59:07.0 -0500
@@ -5285,8 +5285,6 @@ static struct ctl_table *sd_alloc_ctl_en
struct ctl_table *entry =
kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL);
 
-   BUG_ON(!entry);
-
return entry;
 }
 
@@ -5319,6 +5317,9 @@ sd_alloc_ctl_domain_table(struct sched_d
 {
struct ctl_table *table = sd_alloc_ctl_entry(14);
 
+   if (table == NULL)
+   return NULL;
+
set_table_entry(table[0], min_interval, sd-min_interval,
sizeof(long), 0644, proc_doulongvec_minmax);
set_table_entry(table[1], max_interval, sd-max_interval,
@@ -5356,6 +5357,8 @@ static ctl_table *sd_alloc_ctl_cpu_table
for_each_domain(cpu, sd)
domain_num++;
entry = table = sd_alloc_ctl_entry(domain_num + 1);
+   if (table == NULL)
+   return NULL;
 
i = 0;
for_each_domain(cpu, sd) {
@@ -5376,6 +5379,9 @@ static void register_sched_domain_sysctl
struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1);
char buf[32];
 
+   if (entry == NULL)
+   return;
+
sd_ctl_dir[0].child = entry;
 
for_each_online_cpu(i) {
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain debug sysctl fixes

2007-10-12 Thread Milton Miller

I forgot to put the seqence numbers in the patch subjects (I did
in the message ids).

Here is the order I diffed them:

sched domain sysctl: register online cpus
sched domain sysctl: free and rebuild when changing domains
sched.c: use kcalloc
sched domain sysctl: don't bug on alloc failure
sched domain sysctl: sysctl tables can have no holes

milton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] powerpc: don't enable cpu hotplug on mpic-based pseries

2007-10-10 Thread Milton Miller

Don't allow cpu hotplug on pSeries systems lacking XICS interrupt controller,
since current code is hardcoded to call xics routines.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>
--

Olof's patch searched the device-tree again, looking for an mpic.   This
code instead checks that we found an xics the first time by checking the
init function.

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 9711eb0..20f010a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -262,6 +262,12 @@ static int __init pseries_cpu_hotplug_init(void)
return 0;
}
 
+   if (ppc_md.init_IRQ != xics_init_IRQ) {
+   printk(KERN_INFO "pSeries CPU Hotplug only supported on xics "
+   "interrupt controllers - disabling");
+   return 0;
+   }
+
ppc_md.cpu_die = pseries_mach_cpu_die;
smp_ops->cpu_disable = pseries_cpu_disable;
smp_ops->cpu_die = pseries_cpu_die;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] [patch 09/12] Fix SMP poweroff hangs

2007-10-10 Thread Milton Miller

On Tue, Oct 9 2007 Thomas Gleixner wrote:
>On Tue, 9 Oct 2007, Linus Torvalds wrote:
>>On Wed, 10 Oct 2007, Thomas Gleixner wrote:
>>>
>>> Wrapping it into a #ifdef CONFIG_X86 would be sufficient.
>>
>> Well, the ppc oops seems to be a ppc bug regardless.
>>
>> If CPU_HOTPLUG isn't defined, the thing does nothing. And if it is
>> defined, I don't see why/how ppc can validly oops. So I think the first
>> thing to do is to try to figure out why it oopses, not to disable it for
>> ppc.
>
>Fair enough. OTOH for the affected PPC users it's a regression and that's
>what I'm concerned of.
>
>tglx

I agree this exposes a ppc arch bug (and the reason has been diagnosed),
but might not other archs have similar problems?  The patch ties
power-off code to suspend and hibernate cpu hotplug code.

Why not put this ACPI requirement in the ACPI code?   There is already
a hook called at the same place this function was called that is only
used for power off.

Also, the code is compiled for CONFIG_PM_SLEEP_SMP, which is not
CONFIG_HOTPLUG_CPU, it requires the selection of either CONFIG_HIBERNATION
or CONFIG_SUSPEND.  I confirmed that both of these can be disabled
on i386 without disabling acpi.  In such a kernel the existing patch
is insufficient.


 drivers/acpi/sleep/main.c |2 ++
 kernel/sys.c  |1 -
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/acpi/sleep/main.c b/drivers/acpi/sleep/main.c
index 2cbb9aa..79589bd 100644
--- a/drivers/acpi/sleep/main.c
+++ b/drivers/acpi/sleep/main.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -382,6 +383,7 @@ static void acpi_power_off_prepare(void)
 {
/* Prepare to power off the system */
acpi_sleep_prepare(ACPI_STATE_S5);
+   disable_nonboot_cpus();
 }
 
 static void acpi_power_off(void)
diff --git a/kernel/sys.c b/kernel/sys.c
index 8ae2e63..138f5c8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -879,7 +879,6 @@ void kernel_power_off(void)
kernel_shutdown_prepare(SYSTEM_POWER_OFF);
if (pm_power_off_prepare)
pm_power_off_prepare();
-   disable_nonboot_cpus();
sysdev_shutdown();
printk(KERN_EMERG "Power down.\n");
machine_power_off();
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] [patch 09/12] Fix SMP poweroff hangs

2007-10-10 Thread Milton Miller

On Tue, Oct 9 2007 Thomas Gleixner wrote:
On Tue, 9 Oct 2007, Linus Torvalds wrote:
On Wed, 10 Oct 2007, Thomas Gleixner wrote:

 Wrapping it into a #ifdef CONFIG_X86 would be sufficient.

 Well, the ppc oops seems to be a ppc bug regardless.

 If CPU_HOTPLUG isn't defined, the thing does nothing. And if it is
 defined, I don't see why/how ppc can validly oops. So I think the first
 thing to do is to try to figure out why it oopses, not to disable it for
 ppc.

Fair enough. OTOH for the affected PPC users it's a regression and that's
what I'm concerned of.

tglx

I agree this exposes a ppc arch bug (and the reason has been diagnosed),
but might not other archs have similar problems?  The patch ties
power-off code to suspend and hibernate cpu hotplug code.

Why not put this ACPI requirement in the ACPI code?   There is already
a hook called at the same place this function was called that is only
used for power off.

Also, the code is compiled for CONFIG_PM_SLEEP_SMP, which is not
CONFIG_HOTPLUG_CPU, it requires the selection of either CONFIG_HIBERNATION
or CONFIG_SUSPEND.  I confirmed that both of these can be disabled
on i386 without disabling acpi.  In such a kernel the existing patch
is insufficient.


 drivers/acpi/sleep/main.c |2 ++
 kernel/sys.c  |1 -
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/acpi/sleep/main.c b/drivers/acpi/sleep/main.c
index 2cbb9aa..79589bd 100644
--- a/drivers/acpi/sleep/main.c
+++ b/drivers/acpi/sleep/main.c
@@ -15,6 +15,7 @@
 #include linux/dmi.h
 #include linux/device.h
 #include linux/suspend.h
+#include linux/cpu.h
 
 #include asm/io.h
 
@@ -382,6 +383,7 @@ static void acpi_power_off_prepare(void)
 {
/* Prepare to power off the system */
acpi_sleep_prepare(ACPI_STATE_S5);
+   disable_nonboot_cpus();
 }
 
 static void acpi_power_off(void)
diff --git a/kernel/sys.c b/kernel/sys.c
index 8ae2e63..138f5c8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -879,7 +879,6 @@ void kernel_power_off(void)
kernel_shutdown_prepare(SYSTEM_POWER_OFF);
if (pm_power_off_prepare)
pm_power_off_prepare();
-   disable_nonboot_cpus();
sysdev_shutdown();
printk(KERN_EMERG Power down.\n);
machine_power_off();
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] powerpc: don't enable cpu hotplug on mpic-based pseries

2007-10-10 Thread Milton Miller

Don't allow cpu hotplug on pSeries systems lacking XICS interrupt controller,
since current code is hardcoded to call xics routines.

Signed-off-by: Milton Miller [EMAIL PROTECTED]
--

Olof's patch searched the device-tree again, looking for an mpic.   This
code instead checks that we found an xics the first time by checking the
init function.

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 9711eb0..20f010a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -262,6 +262,12 @@ static int __init pseries_cpu_hotplug_init(void)
return 0;
}
 
+   if (ppc_md.init_IRQ != xics_init_IRQ) {
+   printk(KERN_INFO pSeries CPU Hotplug only supported on xics 
+   interrupt controllers - disabling);
+   return 0;
+   }
+
ppc_md.cpu_die = pseries_mach_cpu_die;
smp_ops-cpu_disable = pseries_cpu_disable;
smp_ops-cpu_die = pseries_cpu_die;
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH Kbuild] Call only one make with all targets for O=

2007-09-21 Thread Milton Miller
Change the invocations of make in the output directory Makefile and the
main Makefile for seperate object trees to pass all goals to one $(MAKE)
via a new phony target "sub-make" and the existing target _all.

When compiling with seperate object directories, a seperate make is called
in the context of another directory (from the output directory the main
Makefile is called, the Makefile is then restarted with current directory
set to the object tree).  Before this patch, when multiple make command
goals are specified, each target results in a seperate make invocation.
With make -j, these invocations may run in parallel, resulting in multiple
commands running in the same directory clobbering each others results.

I did not try to address make -j for mixed dot-config and no-dot-config
targets.  Because the order does matter, a solution was not obvious.
Perhaps a simple check for MAKEFLAGS having -j and refusing to run would
be approprate.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>
---
I chose @: as the phony command after the sub-make target does all the
work; is there a better alternative?  It looks like :; is used for
Makefile.

Index: kernel/Makefile
===
--- kernel.orig/Makefile2007-09-19 01:55:45.0 -0500
+++ kernel/Makefile 2007-09-19 02:01:16.0 -0500
@@ -116,12 +116,16 @@ KBUILD_OUTPUT := $(shell cd $(KBUILD_OUT
 $(if $(KBUILD_OUTPUT),, \
  $(error output directory "$(saved-output)" does not exist))
 
-PHONY += $(MAKECMDGOALS)
+PHONY += $(MAKECMDGOALS) sub-make
 
-$(filter-out _all,$(MAKECMDGOALS)) _all:
+$(filter-out _all sub-make,$(MAKECMDGOALS)) _all: sub-make
+   @:
+
+sub-make: FORCE
$(if $(KBUILD_VERBOSE:1=),@)$(MAKE) -C $(KBUILD_OUTPUT) \
KBUILD_SRC=$(CURDIR) \
-   KBUILD_EXTMOD="$(KBUILD_EXTMOD)" -f $(CURDIR)/Makefile $@
+   KBUILD_EXTMOD="$(KBUILD_EXTMOD)" -f $(CURDIR)/Makefile \
+   $(filter-out _all sub-make,$(MAKECMDGOALS))
 
 # Leave processing to above invocation of make
 skip-makefile := 1
Index: kernel/scripts/mkmakefile
===
--- kernel.orig/scripts/mkmakefile  2007-09-19 01:55:45.0 -0500
+++ kernel/scripts/mkmakefile   2007-09-19 02:01:53.0 -0500
@@ -26,11 +26,13 @@ MAKEFLAGS += --no-print-directory
 
 .PHONY: all \$(MAKECMDGOALS)
 
+all:= \$(filter-out all Makefile,\$(MAKECMDGOALS))
+
 all:
-   \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT)
+   \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$(all)
 
 Makefile:;
 
-\$(filter-out all Makefile,\$(MAKECMDGOALS)) %/:
-   \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$@
+\$(all) %/: all
+   @:
 EOF
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH Kbuild] Call only one make with all targets for O=

2007-09-21 Thread Milton Miller
Change the invocations of make in the output directory Makefile and the
main Makefile for seperate object trees to pass all goals to one $(MAKE)
via a new phony target sub-make and the existing target _all.

When compiling with seperate object directories, a seperate make is called
in the context of another directory (from the output directory the main
Makefile is called, the Makefile is then restarted with current directory
set to the object tree).  Before this patch, when multiple make command
goals are specified, each target results in a seperate make invocation.
With make -j, these invocations may run in parallel, resulting in multiple
commands running in the same directory clobbering each others results.

I did not try to address make -j for mixed dot-config and no-dot-config
targets.  Because the order does matter, a solution was not obvious.
Perhaps a simple check for MAKEFLAGS having -j and refusing to run would
be approprate.

Signed-off-by: Milton Miller [EMAIL PROTECTED]
---
I chose @: as the phony command after the sub-make target does all the
work; is there a better alternative?  It looks like :; is used for
Makefile.

Index: kernel/Makefile
===
--- kernel.orig/Makefile2007-09-19 01:55:45.0 -0500
+++ kernel/Makefile 2007-09-19 02:01:16.0 -0500
@@ -116,12 +116,16 @@ KBUILD_OUTPUT := $(shell cd $(KBUILD_OUT
 $(if $(KBUILD_OUTPUT),, \
  $(error output directory $(saved-output) does not exist))
 
-PHONY += $(MAKECMDGOALS)
+PHONY += $(MAKECMDGOALS) sub-make
 
-$(filter-out _all,$(MAKECMDGOALS)) _all:
+$(filter-out _all sub-make,$(MAKECMDGOALS)) _all: sub-make
+   @:
+
+sub-make: FORCE
$(if $(KBUILD_VERBOSE:1=),@)$(MAKE) -C $(KBUILD_OUTPUT) \
KBUILD_SRC=$(CURDIR) \
-   KBUILD_EXTMOD=$(KBUILD_EXTMOD) -f $(CURDIR)/Makefile $@
+   KBUILD_EXTMOD=$(KBUILD_EXTMOD) -f $(CURDIR)/Makefile \
+   $(filter-out _all sub-make,$(MAKECMDGOALS))
 
 # Leave processing to above invocation of make
 skip-makefile := 1
Index: kernel/scripts/mkmakefile
===
--- kernel.orig/scripts/mkmakefile  2007-09-19 01:55:45.0 -0500
+++ kernel/scripts/mkmakefile   2007-09-19 02:01:53.0 -0500
@@ -26,11 +26,13 @@ MAKEFLAGS += --no-print-directory
 
 .PHONY: all \$(MAKECMDGOALS)
 
+all:= \$(filter-out all Makefile,\$(MAKECMDGOALS))
+
 all:
-   \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT)
+   \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$(all)
 
 Makefile:;
 
-\$(filter-out all Makefile,\$(MAKECMDGOALS)) %/:
-   \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$@
+\$(all) %/: all
+   @:
 EOF
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller


On Jul 20, 2007, at 1:17 PM, Alan Stern wrote:


On Fri, 20 Jul 2007, Milton Miller wrote:


On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:

On Fri, 20 Jul 2007, Milton Miller wrote:

We can't do this unless we have frozen tasks (this way, or another)
before
carrying out the entire operation.


What can't we do?   We've already worked with the drivers to quesce
the
hardware and put any information to resume the device in ram.  Now 
we
ask them to put their device in low power mode so we can go to 
sleep.

Even if we schedule, the only thing userspace could touch is memory.


Userspace can submit I/O requests.  Someone will have to audit every
driver to make sure that such I/O requests don't cause a quiesced
device to become active.  If the device is active, it will make the
memory snapshot inconsistent with the on-device data.


If a driver is waking a device between the time it was told by
hibernation "suspend all operations and save your device state to ram"
and "resume your device" then it is a buggy driver.


That's exactly my point.  As far as I know nobody has done a survey,
but I bet you'd find _many_ drivers are buggy either in this way or the
converse (forcing an I/O request to fail immediately instead of waiting
until the suspend is over when it could succeed).  They have this bug
because they were written -- those which include any suspend/resume
support at all -- under the assumption that they could rely on the
freezer.

And that's why Rafael said "We can't do this unless we have frozen
tasks (this way, or another) before carrying out the entire operation."
Until the drivers are fixed -- which seems like a tremendous job --
none of this will work.


So this is in the way of removing the freezer ... but as we are not 
relying on doing any io other than suspend device operation, save state 
to ram, then later put device in low power mode for s3 and/or s4, and 
finally restore and resume to running.



I argue the process can make the io request after we write to disk, we
just can't service it.  If we are suspended it will go to the request
queue, and eventually the process will wait for normal throttling
mechanisms until the driver is woken up.


Many drivers don't have request queues.  Even for the ones that do,
there are I/O pathways that bypass the queue (think of ioctl or sysfs).


So its not a flag in make_request, fine.


Actually, my point was more "what kernel services do the drivers need
to transition from quiesced to low power for acpi S4 or
suspend-to-ram"?  We can't give them allocate-memory (but we give them
a call "we are going to suspend" when they can), but does "run this
tasklet" help?  What timer facilities are needed?


Some drivers need the ability to schedule.  Some will need the ability
to allocate memory (although GFP_ATOMIC is probably sufficient).  Some
will need timers to run.


Can they allocate the memory in advance?  (Call them when we know we 
want to suspend, they make the allocations they will need; we later 
call them again to release the allocations).


If you need timers, you probably want some scheduling?


Do we need to differentate init (por by bios) and resume from quiesced
(for reboot, kexec start/resume)?  I hope not.


Yes we do.


can you elabrate?   Note I was not asking resume-from-low power vs 
init-from-por.  We still get that distinction.


How do these drivers work today when we kexec?

The reason I'm asking is its hard to tell the first kernel what 
happened.  We can say "we powered off, and we were restarted", but it 
becomes much harder when each device may or may not have a driver in 
the save kernel if we have to differentate for each device if it was 
initialized and later quiesced by the jump kernel during save or never 
touched.  And we need to tell the resume from hybernate code "i touched 
it" "no i didn't" and "we resumed from s4" "no it was from s5".


This is why I've been proposing that we don't create the suspend image 
with devices in the low power state, but only in a quiesced state 
similar to the initial state.



I'm proposing a sequence like:

(1) start allocating pinned memory to reduce saved image size
(2) allocate and map blocks to save maximum image (we know how much ram 
is not in 1, so the max size)
(3) tell drivers we are going to suspend.   userspace is still running, 
swaping still active, etc.  now is the time to allocate memory to save 
device state.
(4) do what we want to slow down userspace making requests (ie run 
freezer today)
(5) call drivers while still scheduling with interrupts "save 
oppertunitiy".  From this point, any new request should be queued or 
the process put on a wait queue.

(6) suspend timers, turn off interrupts
(7) call drivers with interrupts off (final save)
(8) jump to other kernel to save the image
(9) call drivers to transition to low power
(10) finish operatio

Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:

On Fri, 20 Jul 2007, Milton Miller wrote:

We can't do this unless we have frozen tasks (this way, or another)
before
carrying out the entire operation.


What can't we do?   We've already worked with the drivers to quesce 
the

hardware and put any information to resume the device in ram.  Now we
ask them to put their device in low power mode so we can go to sleep.
Even if we schedule, the only thing userspace could touch is memory.


Userspace can submit I/O requests.  Someone will have to audit every
driver to make sure that such I/O requests don't cause a quiesced
device to become active.  If the device is active, it will make the
memory snapshot inconsistent with the on-device data.


If a driver is waking a device between the time it was told by 
hibernation "suspend all operations and save your device state to ram" 
and "resume your device" then it is a buggy driver.


I argue the process can make the io request after we write to disk, we 
just can't service it.  If we are suspended it will go to the request 
queue, and eventually the process will wait for normal throttling 
mechanisms until the driver is woken up.


It may mean the driver has to set a flag so that it knows it had an 
iorequest arrive while it was suspended and needs to wake the queue 
during its resume function.



Actually, my point was more "what kernel services do the drivers need 
to transition from quiesced to low power for acpi S4 or 
suspend-to-ram"?  We can't give them allocate-memory (but we give them 
a call "we are going to suspend" when they can), but does "run this 
tasklet" help?  What timer facilities are needed?


Do we need to differentate init (por by bios) and resume from quiesced 
(for reboot, kexec start/resume)?  I hope not.


milton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:

On Friday, 20 July 2007 01:07, [EMAIL PROTECTED] wrote:

On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:

On Thursday, 19 July 2007 17:46, Milton Miller wrote:

The currently identified problems under discussion include:
(1) how to interact with acpi to enter into S4.
(2) how to identify which memory needs to be saved
(3) how to communicate where to save the memory
(4) what state should devices be in when switching kernels
(5) the complicated setup required with the current patch
(6) what code restores the image


(7) how to avoid corrupting filesystems mounted by the hibernated 
kernel


I didn't realize this was a discussion item. I thought the options 
were

clear, for some filesystem types you can mount them read-only, but for
ext3 (and possilby other less common ones) you just plain cannot touch
them.


That's correct.  And since you cannot thouch ext3, you need either to 
assume
that you won't touch filesystems at all, or to have a code to 
recognize the

filesystem you're dealing with.


Or add a small bit of infrastructure that errors writes at make_request 
if you don't have a magic "i am a direct block device write from 
userspace" flag on the bio.


The hibernate may fail, but you don't corrupt the media.

If you don't get the image out, resume back to the "this is resume" 
instead of the power-down path.



(2) Upon start-up (by which I mean what happens after the user has
pressed
the power button or something like that):
  * check if the image is present (and valid) _without_ enabling 
ACPI

(we don't
do that now, but I see no reason for not doing it in the new
framework)
  * if the image is present (and valid), load it
  * turn on ACPI (unless already turned on by the BIOS, that is)
  * execute the _BFS global control method
  * execute the _WAK global control method
  * continue
  Here, the first two things should be done by the image-loading
kernel, but
  the remaining operations have to be carried out by the restored
kernel.


Here I agree.

Here is my proposal.  Instead of trying to both write the image and
suspend, I think this all becomes much simpler if we limit the scope
the work of the second kernel.  Its purpose is to write the image.
After that its done.   The platform can be powered off if we are 
going

to S5.   However, to support suspend to ram and suspend to disk, we
return to the first kernel.


We can't do this unless we have frozen tasks (this way, or another) 
before
carrying out the entire operation.  In that case, however, the 
kexec-based
approach would have only one advantage over the current one.  
Namely, it

would allow us to create bigger images.


we all agree that tasks cannot run during the suspend-to-ram state, 
but

the disagreement is over what this means

at one extreme it could mean that you would need the full freezer as 
per

the current suspend projects.

at the other extreme it could mean that all that's needed is to 
invoke the
suspend-to-ram routine before anything else on the suspended kernel 
on the

return from the save and restore kernel.

we just need to figure out which it is (or if it's somewhere in 
between).


Well, I think that the "invoke the suspend-to-ram routine before 
anything else

on the suspended kernel" thing won't be easy to implement in practice.


Why?  You don't expect suspend-to-ram in drivers to be implemented?  We 
need more speperation of the quiesce drivers from power-down devices?


Note that we are just talking about "suspend devices and put their 
state in ram", not actually invoking the platform to suspend to ram.


And I'm actually saying we free memory and maybe allocate disk blocks 
for the save before we suspend (see below).



Message-ID: <[EMAIL PROTECTED]>
On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:

(1) Filesystems mounted before the hibernation are untouchable


This is because some file systems do a fsck or other activity even 
when
mounted read only.  For the kexec case, however, this should be 
"file
systems mounted by the hibernated system must not be written".   As 
has
been mentioned in the past, we should be able to use something like 
dm

snapshot to allow fsck and the file system to see the cleaned copy
while not actually writing the media.


We can't _require_ users to use the dm snapshot in order for the 
hibernation

to work, sorry.

And by _reading_ from a filesystem you generally update metadata.


not if the filesystem is mounted read-only (except on ext3)


Well, if the filesystem in question is a journaling one and the 
hibernated

kernel has mounted this fs read-write, this seems to be tricky anyway.


Yes.  I would argue writing to existing blocks of a file (not thorugh 
the filesystem, just getting their blocsk from the file system) should 
be safe, but it occurs to me that may not be the case if your fsck and 
bmap move data blocks from some update log to the f

Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote:

On Thursday, 19 July 2007 17:46, Milton Miller wrote:

The currently identified problems under discussion include:
(1) how to interact with acpi to enter into S4.
(2) how to identify which memory needs to be saved
(3) how to communicate where to save the memory
(4) what state should devices be in when switching kernels
(5) the complicated setup required with the current patch
(6) what code restores the image


(7) how to avoid corrupting filesystems mounted by the hibernated 
kernel




Ok I talked on this too.


I'll now start with quotes from several articles in this thread and my
responses.

Message-ID: <[EMAIL PROTECTED]>
On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:

(1) Upon entering the sleep state, which IMO can be done _after_ the
image
has been saved:
  * figure out which devices can wake up
  * put devices into low power states (wake-up devices are placed in
the Dx
states compatible with the wake capability, the others are 
powered

off)
  * execute the _PTS global control method
  * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
  * execute the _GTS global control method
  * set the GPE enable registers corresponding to the wake-up 
devices)

  * make the platform enter S4 (there's a well defined procedure for
that)
  I think that this should be done by the image-saving kernel.


Message-ID: <[EMAIL PROTECTED]>
On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
expressed his agreement with this block but also confusion on the 
other

blocks.


I strongly disagree.

(1) as has been pointed out, this requires the new kernel to 
understand

all io devices in the first kernel.
(2) it requires both kernels to talk to ACPI.   This is doomed to
failure.  How can the second kernel initialize ACPI?   The platform
thinks it has already been initialized.  Do we plan to always undo all
acpi initialization?


Good question.  I don't know.




(2) Upon start-up (by which I mean what happens after the user has
pressed
the power button or something like that):
  * check if the image is present (and valid) _without_ enabling ACPI
(we don't
do that now, but I see no reason for not doing it in the new
framework)
  * if the image is present (and valid), load it
  * turn on ACPI (unless already turned on by the BIOS, that is)
  * execute the _BFS global control method
  * execute the _WAK global control method
  * continue
  Here, the first two things should be done by the image-loading
kernel, but
  the remaining operations have to be carried out by the restored
kernel.


Here I agree.

Here is my proposal.  Instead of trying to both write the image and
suspend, I think this all becomes much simpler if we limit the scope
the work of the second kernel.  Its purpose is to write the image.
After that its done.   The platform can be powered off if we are going
to S5.   However, to support suspend to ram and suspend to disk, we
return to the first kernel.


We can't do this unless we have frozen tasks (this way, or another) 
before

carrying out the entire operation.


What can't we do?   We've already worked with the drivers to quesce the 
hardware and put any information to resume the device in ram.  Now we 
ask them to put their device in low power mode so we can go to sleep.  
Even if we schedule, the only thing userspace could touch is memory.   
If we resume, they just run those computations again.



In that case, however, the kexec-based
approach would have only one advantage over the current one.  Namely, 
it

would allow us to create bigger images.


The advantage is we don't have to come up with a way to teach drivers 
"wake up to run these requests, but no other requests".  We don't have 
to figure out what we need to resume to allow them to process a 
request.



This means that the first kernel will need to know why it got resumed.
Was the system powered off, and this is the resume from the user?   Or
was it restarted because the image has been saved, and its now time to
actually suspend until woken up?  If you look at it, this is the same
interface we have with the magic arch_suspend hook -- did we just
suspend and its time to write the image, or did we just resume and its
time to wake everything up.

I think this can be easily solved by giving the image saving kernel 
two

resume points: one for the image has been written, and one for we
rebooted and have restored the image.  I'm not familiar with ACPI.
Perhaps we need a third to differentiate we read the image from S4
instead of from S5, but that information must be available to the OS
because it needs that to know if it should resume from hibernate.

By making the split at image save and restore we have several
advantages:

(1) the kernel always initializes with devices in the init or quiesced
but active state.

(2) the kernel always resumes with devices in the init or quiesced but
active state.

(3) the kjump save and restore kernel doe

Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 19, 2007, at 12:31 PM, [EMAIL PROTECTED] wrote:

On Thu, 19 Jul 2007, Milton Miller wrote:
 (2) Upon start-up (by which I mean what happens after the user has 
pressed

 the power button or something like that):
   * check if the image is present (and valid) _without_ enabling 
ACPI (we

 don't
 do that now, but I see no reason for not doing it in the new
 framework)
   * if the image is present (and valid), load it
   * turn on ACPI (unless already turned on by the BIOS, that is)
   * execute the _BFS global control method
   * execute the _WAK global control method
   * continue
   Here, the first two things should be done by the image-loading
 kernel, but
   the remaining operations have to be carried out by the restored 
kernel.


Here I agree.

Here is my proposal.  Instead of trying to both write the image and 
suspend, I think this all becomes much simpler if we limit the scope 
the work of the second kernel.  Its purpose is to write the image.   
After that its done. The platform can be powered off if we are going 
to S5.   However, to support suspend to ram and suspend to disk, we 
return to the first kernel.


This means that the first kernel will need to know why it got 
resumed.  Was the system powered off, and this is the resume from the 
user?   Or was it restarted because the image has been saved, and its 
now time to actually suspend until woken up?  If you look at it, this 
is the same interface we have with the magic arch_suspend hook -- did 
we just suspend and its time to write the image, or did we just 
resume and its time to wake everything up.


I think this can be easily solved by giving the image saving kernel 
two resume points: one for the image has been written, and one for we 
rebooted and have restored the image.  I'm not familiar with ACPI.  
Perhaps we need a third to differentiate we read the image from S4 
instead of from S5, but that information must be available to the OS 
because it needs that to know if it should resume from hibernate.


are we sure that there are only 2-3 possible actions? or should this 
be made into a simple jump table so that it's extendable?


At 2 I don't think we need a jump table.   Even if we had a table, we 
have to identify what each entry means.  If we start getting more then 
we can change from command line to table.



As noted in  the thread

Message-ID: <[EMAIL PROTECTED]>
Subject: [linux-pm] Re: hibernation/snapshot design
on Mon Jul  9 08:23:53 2007, Jeremy Maitin-Shepard wrote:


 (3) how to communicate where to save the memory


This is an intresting topic.  The suspended kernel has most IO and 
disk space.  It also knows how much space is to be occupied by the 
kernel.   So communicating a block map to the second kernel would be 
the obvious choice. But the second kernel must be able to find the 
image to restore it, and it must have drivers for the media.  Also, 
this is not feasible for storing to nfs.


I think we will end up with several methods.

One would be supply a list of blocks, and implement a file system 
that reads the file by reading the scatter list from media.  The 
restore kernel then only needs to read an anchor, and can build upon 
that until the image is read into memory.  Or do this in userspace.


I don't know how this compares to the current restore path.   I 
wasn't able to identify the code that creates the on disk structure 
in my 10 minute perusal of kernel/power/.


A second method will be to supply a device and file that will be 
mounted by the save kernel, then unmounted and restored.  This would 
require a partition that is not mounted or open by the suspended 
kernel (or use nfs or a similar protocol that is designed for 
multiple client concurrent access).


A third method would be to allocate a file with the first kernel, and 
make sure the blocks are flushed to disk.  The save and restore 
kernels map the file system using a snapshot device.  Writing would 
map the blocks and use the block offset to write to the real device 
using the method from the first option; reading could be done 
directly from the snapshot device.


The first and third option are dead on log based file systems (where 
the data is stored in the log).


remember that the save and restore kernel can access the memory of the 
suspending kernel, so as long as the data is in a known format and 
there is a pointer to the data in a known location, the save and 
restore kernel can retreive the data from memory, there's no need to 
involve media.


I agree that the the save kernel can read the list from the being-saved 
kernel.


However, when restoring, the being-saved (being-restored) kernel is not 
accessable, so the save list has to be stored as part of the image.



Simplifying kjump: the proposal for v3.

The current code is trying to use crash dump area as a safe, reserved 
area to run the second kernel.   However, that means that the kernel 
has to be linked specially to run in the reserved area.   I th

Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 19, 2007, at 12:31 PM, [EMAIL PROTECTED] wrote:

On Thu, 19 Jul 2007, Milton Miller wrote:
 (2) Upon start-up (by which I mean what happens after the user has 
pressed

 the power button or something like that):
   * check if the image is present (and valid) _without_ enabling 
ACPI (we

 don't
 do that now, but I see no reason for not doing it in the new
 framework)
   * if the image is present (and valid), load it
   * turn on ACPI (unless already turned on by the BIOS, that is)
   * execute the _BFS global control method
   * execute the _WAK global control method
   * continue
   Here, the first two things should be done by the image-loading
 kernel, but
   the remaining operations have to be carried out by the restored 
kernel.


Here I agree.

Here is my proposal.  Instead of trying to both write the image and 
suspend, I think this all becomes much simpler if we limit the scope 
the work of the second kernel.  Its purpose is to write the image.   
After that its done. The platform can be powered off if we are going 
to S5.   However, to support suspend to ram and suspend to disk, we 
return to the first kernel.


This means that the first kernel will need to know why it got 
resumed.  Was the system powered off, and this is the resume from the 
user?   Or was it restarted because the image has been saved, and its 
now time to actually suspend until woken up?  If you look at it, this 
is the same interface we have with the magic arch_suspend hook -- did 
we just suspend and its time to write the image, or did we just 
resume and its time to wake everything up.


I think this can be easily solved by giving the image saving kernel 
two resume points: one for the image has been written, and one for we 
rebooted and have restored the image.  I'm not familiar with ACPI.  
Perhaps we need a third to differentiate we read the image from S4 
instead of from S5, but that information must be available to the OS 
because it needs that to know if it should resume from hibernate.


are we sure that there are only 2-3 possible actions? or should this 
be made into a simple jump table so that it's extendable?


At 2 I don't think we need a jump table.   Even if we had a table, we 
have to identify what each entry means.  If we start getting more then 
we can change from command line to table.



As noted in  the thread

Message-ID: [EMAIL PROTECTED]
Subject: [linux-pm] Re: hibernation/snapshot design
on Mon Jul  9 08:23:53 2007, Jeremy Maitin-Shepard wrote:


 (3) how to communicate where to save the memory


This is an intresting topic.  The suspended kernel has most IO and 
disk space.  It also knows how much space is to be occupied by the 
kernel.   So communicating a block map to the second kernel would be 
the obvious choice. But the second kernel must be able to find the 
image to restore it, and it must have drivers for the media.  Also, 
this is not feasible for storing to nfs.


I think we will end up with several methods.

One would be supply a list of blocks, and implement a file system 
that reads the file by reading the scatter list from media.  The 
restore kernel then only needs to read an anchor, and can build upon 
that until the image is read into memory.  Or do this in userspace.


I don't know how this compares to the current restore path.   I 
wasn't able to identify the code that creates the on disk structure 
in my 10 minute perusal of kernel/power/.


A second method will be to supply a device and file that will be 
mounted by the save kernel, then unmounted and restored.  This would 
require a partition that is not mounted or open by the suspended 
kernel (or use nfs or a similar protocol that is designed for 
multiple client concurrent access).


A third method would be to allocate a file with the first kernel, and 
make sure the blocks are flushed to disk.  The save and restore 
kernels map the file system using a snapshot device.  Writing would 
map the blocks and use the block offset to write to the real device 
using the method from the first option; reading could be done 
directly from the snapshot device.


The first and third option are dead on log based file systems (where 
the data is stored in the log).


remember that the save and restore kernel can access the memory of the 
suspending kernel, so as long as the data is in a known format and 
there is a pointer to the data in a known location, the save and 
restore kernel can retreive the data from memory, there's no need to 
involve media.


I agree that the the save kernel can read the list from the being-saved 
kernel.


However, when restoring, the being-saved (being-restored) kernel is not 
accessable, so the save list has to be stored as part of the image.



Simplifying kjump: the proposal for v3.

The current code is trying to use crash dump area as a safe, reserved 
area to run the second kernel.   However, that means that the kernel 
has to be linked specially to run in the reserved area.   I think we

Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote:

On Thursday, 19 July 2007 17:46, Milton Miller wrote:

The currently identified problems under discussion include:
(1) how to interact with acpi to enter into S4.
(2) how to identify which memory needs to be saved
(3) how to communicate where to save the memory
(4) what state should devices be in when switching kernels
(5) the complicated setup required with the current patch
(6) what code restores the image


(7) how to avoid corrupting filesystems mounted by the hibernated 
kernel




Ok I talked on this too.


I'll now start with quotes from several articles in this thread and my
responses.

Message-ID: [EMAIL PROTECTED]
On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote:

(1) Upon entering the sleep state, which IMO can be done _after_ the
image
has been saved:
  * figure out which devices can wake up
  * put devices into low power states (wake-up devices are placed in
the Dx
states compatible with the wake capability, the others are 
powered

off)
  * execute the _PTS global control method
  * switch off the nonlocal CPUs (eg. nonboot CPUs on x86)
  * execute the _GTS global control method
  * set the GPE enable registers corresponding to the wake-up 
devices)

  * make the platform enter S4 (there's a well defined procedure for
that)
  I think that this should be done by the image-saving kernel.


Message-ID: [EMAIL PROTECTED]
On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard
expressed his agreement with this block but also confusion on the 
other

blocks.


I strongly disagree.

(1) as has been pointed out, this requires the new kernel to 
understand

all io devices in the first kernel.
(2) it requires both kernels to talk to ACPI.   This is doomed to
failure.  How can the second kernel initialize ACPI?   The platform
thinks it has already been initialized.  Do we plan to always undo all
acpi initialization?


Good question.  I don't know.




(2) Upon start-up (by which I mean what happens after the user has
pressed
the power button or something like that):
  * check if the image is present (and valid) _without_ enabling ACPI
(we don't
do that now, but I see no reason for not doing it in the new
framework)
  * if the image is present (and valid), load it
  * turn on ACPI (unless already turned on by the BIOS, that is)
  * execute the _BFS global control method
  * execute the _WAK global control method
  * continue
  Here, the first two things should be done by the image-loading
kernel, but
  the remaining operations have to be carried out by the restored
kernel.


Here I agree.

Here is my proposal.  Instead of trying to both write the image and
suspend, I think this all becomes much simpler if we limit the scope
the work of the second kernel.  Its purpose is to write the image.
After that its done.   The platform can be powered off if we are going
to S5.   However, to support suspend to ram and suspend to disk, we
return to the first kernel.


We can't do this unless we have frozen tasks (this way, or another) 
before

carrying out the entire operation.


What can't we do?   We've already worked with the drivers to quesce the 
hardware and put any information to resume the device in ram.  Now we 
ask them to put their device in low power mode so we can go to sleep.  
Even if we schedule, the only thing userspace could touch is memory.   
If we resume, they just run those computations again.



In that case, however, the kexec-based
approach would have only one advantage over the current one.  Namely, 
it

would allow us to create bigger images.


The advantage is we don't have to come up with a way to teach drivers 
wake up to run these requests, but no other requests.  We don't have 
to figure out what we need to resume to allow them to process a 
request.



This means that the first kernel will need to know why it got resumed.
Was the system powered off, and this is the resume from the user?   Or
was it restarted because the image has been saved, and its now time to
actually suspend until woken up?  If you look at it, this is the same
interface we have with the magic arch_suspend hook -- did we just
suspend and its time to write the image, or did we just resume and its
time to wake everything up.

I think this can be easily solved by giving the image saving kernel 
two

resume points: one for the image has been written, and one for we
rebooted and have restored the image.  I'm not familiar with ACPI.
Perhaps we need a third to differentiate we read the image from S4
instead of from S5, but that information must be available to the OS
because it needs that to know if it should resume from hibernate.

By making the split at image save and restore we have several
advantages:

(1) the kernel always initializes with devices in the init or quiesced
but active state.

(2) the kernel always resumes with devices in the init or quiesced but
active state.

(3) the kjump save and restore kernel does not need to know how to
suspend

Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote:

On Friday, 20 July 2007 01:07, [EMAIL PROTECTED] wrote:

On Thu, 19 Jul 2007, Rafael J. Wysocki wrote:

On Thursday, 19 July 2007 17:46, Milton Miller wrote:

The currently identified problems under discussion include:
(1) how to interact with acpi to enter into S4.
(2) how to identify which memory needs to be saved
(3) how to communicate where to save the memory
(4) what state should devices be in when switching kernels
(5) the complicated setup required with the current patch
(6) what code restores the image


(7) how to avoid corrupting filesystems mounted by the hibernated 
kernel


I didn't realize this was a discussion item. I thought the options 
were

clear, for some filesystem types you can mount them read-only, but for
ext3 (and possilby other less common ones) you just plain cannot touch
them.


That's correct.  And since you cannot thouch ext3, you need either to 
assume
that you won't touch filesystems at all, or to have a code to 
recognize the

filesystem you're dealing with.


Or add a small bit of infrastructure that errors writes at make_request 
if you don't have a magic i am a direct block device write from 
userspace flag on the bio.


The hibernate may fail, but you don't corrupt the media.

If you don't get the image out, resume back to the this is resume 
instead of the power-down path.



(2) Upon start-up (by which I mean what happens after the user has
pressed
the power button or something like that):
  * check if the image is present (and valid) _without_ enabling 
ACPI

(we don't
do that now, but I see no reason for not doing it in the new
framework)
  * if the image is present (and valid), load it
  * turn on ACPI (unless already turned on by the BIOS, that is)
  * execute the _BFS global control method
  * execute the _WAK global control method
  * continue
  Here, the first two things should be done by the image-loading
kernel, but
  the remaining operations have to be carried out by the restored
kernel.


Here I agree.

Here is my proposal.  Instead of trying to both write the image and
suspend, I think this all becomes much simpler if we limit the scope
the work of the second kernel.  Its purpose is to write the image.
After that its done.   The platform can be powered off if we are 
going

to S5.   However, to support suspend to ram and suspend to disk, we
return to the first kernel.


We can't do this unless we have frozen tasks (this way, or another) 
before
carrying out the entire operation.  In that case, however, the 
kexec-based
approach would have only one advantage over the current one.  
Namely, it

would allow us to create bigger images.


we all agree that tasks cannot run during the suspend-to-ram state, 
but

the disagreement is over what this means

at one extreme it could mean that you would need the full freezer as 
per

the current suspend projects.

at the other extreme it could mean that all that's needed is to 
invoke the
suspend-to-ram routine before anything else on the suspended kernel 
on the

return from the save and restore kernel.

we just need to figure out which it is (or if it's somewhere in 
between).


Well, I think that the invoke the suspend-to-ram routine before 
anything else

on the suspended kernel thing won't be easy to implement in practice.


Why?  You don't expect suspend-to-ram in drivers to be implemented?  We 
need more speperation of the quiesce drivers from power-down devices?


Note that we are just talking about suspend devices and put their 
state in ram, not actually invoking the platform to suspend to ram.


And I'm actually saying we free memory and maybe allocate disk blocks 
for the save before we suspend (see below).



Message-ID: [EMAIL PROTECTED]
On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote:

(1) Filesystems mounted before the hibernation are untouchable


This is because some file systems do a fsck or other activity even 
when
mounted read only.  For the kexec case, however, this should be 
file
systems mounted by the hibernated system must not be written.   As 
has
been mentioned in the past, we should be able to use something like 
dm

snapshot to allow fsck and the file system to see the cleaned copy
while not actually writing the media.


We can't _require_ users to use the dm snapshot in order for the 
hibernation

to work, sorry.

And by _reading_ from a filesystem you generally update metadata.


not if the filesystem is mounted read-only (except on ext3)


Well, if the filesystem in question is a journaling one and the 
hibernated

kernel has mounted this fs read-write, this seems to be tricky anyway.


Yes.  I would argue writing to existing blocks of a file (not thorugh 
the filesystem, just getting their blocsk from the file system) should 
be safe, but it occurs to me that may not be the case if your fsck and 
bmap move data blocks from some update log to the file system.


But we know the (maximum) image size.   So we could

Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller

On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:

On Fri, 20 Jul 2007, Milton Miller wrote:

We can't do this unless we have frozen tasks (this way, or another)
before
carrying out the entire operation.


What can't we do?   We've already worked with the drivers to quesce 
the

hardware and put any information to resume the device in ram.  Now we
ask them to put their device in low power mode so we can go to sleep.
Even if we schedule, the only thing userspace could touch is memory.


Userspace can submit I/O requests.  Someone will have to audit every
driver to make sure that such I/O requests don't cause a quiesced
device to become active.  If the device is active, it will make the
memory snapshot inconsistent with the on-device data.


If a driver is waking a device between the time it was told by 
hibernation suspend all operations and save your device state to ram 
and resume your device then it is a buggy driver.


I argue the process can make the io request after we write to disk, we 
just can't service it.  If we are suspended it will go to the request 
queue, and eventually the process will wait for normal throttling 
mechanisms until the driver is woken up.


It may mean the driver has to set a flag so that it knows it had an 
iorequest arrive while it was suspended and needs to wake the queue 
during its resume function.



Actually, my point was more what kernel services do the drivers need 
to transition from quiesced to low power for acpi S4 or 
suspend-to-ram?  We can't give them allocate-memory (but we give them 
a call we are going to suspend when they can), but does run this 
tasklet help?  What timer facilities are needed?


Do we need to differentate init (por by bios) and resume from quiesced 
(for reboot, kexec start/resume)?  I hope not.


milton

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-pm] Re: Hibernation considerations

2007-07-20 Thread Milton Miller


On Jul 20, 2007, at 1:17 PM, Alan Stern wrote:


On Fri, 20 Jul 2007, Milton Miller wrote:


On Jul 20, 2007, at 11:20 AM, Alan Stern wrote:

On Fri, 20 Jul 2007, Milton Miller wrote:

We can't do this unless we have frozen tasks (this way, or another)
before
carrying out the entire operation.


What can't we do?   We've already worked with the drivers to quesce
the
hardware and put any information to resume the device in ram.  Now 
we
ask them to put their device in low power mode so we can go to 
sleep.

Even if we schedule, the only thing userspace could touch is memory.


Userspace can submit I/O requests.  Someone will have to audit every
driver to make sure that such I/O requests don't cause a quiesced
device to become active.  If the device is active, it will make the
memory snapshot inconsistent with the on-device data.


If a driver is waking a device between the time it was told by
hibernation suspend all operations and save your device state to ram
and resume your device then it is a buggy driver.


That's exactly my point.  As far as I know nobody has done a survey,
but I bet you'd find _many_ drivers are buggy either in this way or the
converse (forcing an I/O request to fail immediately instead of waiting
until the suspend is over when it could succeed).  They have this bug
because they were written -- those which include any suspend/resume
support at all -- under the assumption that they could rely on the
freezer.

And that's why Rafael said We can't do this unless we have frozen
tasks (this way, or another) before carrying out the entire operation.
Until the drivers are fixed -- which seems like a tremendous job --
none of this will work.


So this is in the way of removing the freezer ... but as we are not 
relying on doing any io other than suspend device operation, save state 
to ram, then later put device in low power mode for s3 and/or s4, and 
finally restore and resume to running.



I argue the process can make the io request after we write to disk, we
just can't service it.  If we are suspended it will go to the request
queue, and eventually the process will wait for normal throttling
mechanisms until the driver is woken up.


Many drivers don't have request queues.  Even for the ones that do,
there are I/O pathways that bypass the queue (think of ioctl or sysfs).


So its not a flag in make_request, fine.


Actually, my point was more what kernel services do the drivers need
to transition from quiesced to low power for acpi S4 or
suspend-to-ram?  We can't give them allocate-memory (but we give them
a call we are going to suspend when they can), but does run this
tasklet help?  What timer facilities are needed?


Some drivers need the ability to schedule.  Some will need the ability
to allocate memory (although GFP_ATOMIC is probably sufficient).  Some
will need timers to run.


Can they allocate the memory in advance?  (Call them when we know we 
want to suspend, they make the allocations they will need; we later 
call them again to release the allocations).


If you need timers, you probably want some scheduling?


Do we need to differentate init (por by bios) and resume from quiesced
(for reboot, kexec start/resume)?  I hope not.


Yes we do.


can you elabrate?   Note I was not asking resume-from-low power vs 
init-from-por.  We still get that distinction.


How do these drivers work today when we kexec?

The reason I'm asking is its hard to tell the first kernel what 
happened.  We can say we powered off, and we were restarted, but it 
becomes much harder when each device may or may not have a driver in 
the save kernel if we have to differentate for each device if it was 
initialized and later quiesced by the jump kernel during save or never 
touched.  And we need to tell the resume from hybernate code i touched 
it no i didn't and we resumed from s4 no it was from s5.


This is why I've been proposing that we don't create the suspend image 
with devices in the low power state, but only in a quiesced state 
similar to the initial state.



I'm proposing a sequence like:

(1) start allocating pinned memory to reduce saved image size
(2) allocate and map blocks to save maximum image (we know how much ram 
is not in 1, so the max size)
(3) tell drivers we are going to suspend.   userspace is still running, 
swaping still active, etc.  now is the time to allocate memory to save 
device state.
(4) do what we want to slow down userspace making requests (ie run 
freezer today)
(5) call drivers while still scheduling with interrupts save 
oppertunitiy.  From this point, any new request should be queued or 
the process put on a wait queue.

(6) suspend timers, turn off interrupts
(7) call drivers with interrupts off (final save)
(8) jump to other kernel to save the image
(9) call drivers to transition to low power
(10) finish operations to platform suspend on hybernate
(11) call drivers to resume, telling them if from suspend-to-ram or 
suspend-to-disk, possibly in two stages

[PATCH] e100 rx: or s and el bits

2007-05-01 Thread Milton Miller
In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description
talks about emulating another driver by setting addtional bits and
the being unable to test when submitted.  Seeing the & operator to
set more bits made me suspicious, and indeed the bits are defined
in positive logic:

cb_s  = 0x4000,
cb_el = 0x8000,

So anding those together would be 0.   I'm guessing they should
be or'd, but don't have hardware here to test, much like the
committed patch.  In fact, I'll let someone else do the compile
test too.  I'll update the comment.

(It looks like the end of list and s bits would not be set
in the template nor cleared when linking in recieve skbs, so
as long as the kernel can keep up with the 100Mb card we 
wouldn't see the adapter go off the linked list, possibly
explaining any successful use of this patch written against
2.6.14).

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>


--- linux-2.6/drivers/net/e100.c.orig   2007-05-01 05:19:17.0 -0500
+++ linux-2.6/drivers/net/e100.c2007-05-01 05:22:14.0 -0500
@@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic
((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i));
 
/* Template for a freshly allocated RFD */
-   nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s);
+   nic->blank_rfd.command = cpu_to_le16(cb_el | cb_s);
nic->blank_rfd.rbd = 0x;
nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
 
@@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic 
}
 
/* Link the RFD to end of RFA by linking previous RFD to
-* this one, and clearing EL bit of previous.  */
+* this one, and clearing EL and S bit of previous.  */
if(rx->prev->skb) {
struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data;
put_unaligned(cpu_to_le32(rx->dma_addr),
(u32 *)_rfd->link);
wmb();
-   prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s);
+   prev_rfd->command &= ~cpu_to_le16(cb_el | cb_s);
pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr,
sizeof(struct rfd), PCI_DMA_TODEVICE);
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] e100 rx: or s and el bits

2007-05-01 Thread Milton Miller
In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description
talks about emulating another driver by setting addtional bits and
the being unable to test when submitted.  Seeing the  operator to
set more bits made me suspicious, and indeed the bits are defined
in positive logic:

cb_s  = 0x4000,
cb_el = 0x8000,

So anding those together would be 0.   I'm guessing they should
be or'd, but don't have hardware here to test, much like the
committed patch.  In fact, I'll let someone else do the compile
test too.  I'll update the comment.

(It looks like the end of list and s bits would not be set
in the template nor cleared when linking in recieve skbs, so
as long as the kernel can keep up with the 100Mb card we 
wouldn't see the adapter go off the linked list, possibly
explaining any successful use of this patch written against
2.6.14).

Signed-off-by: Milton Miller [EMAIL PROTECTED]


--- linux-2.6/drivers/net/e100.c.orig   2007-05-01 05:19:17.0 -0500
+++ linux-2.6/drivers/net/e100.c2007-05-01 05:22:14.0 -0500
@@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic
((nic-mac = mac_82558_D101_A4) ? cb_cid : cb_i));
 
/* Template for a freshly allocated RFD */
-   nic-blank_rfd.command = cpu_to_le16(cb_el  cb_s);
+   nic-blank_rfd.command = cpu_to_le16(cb_el | cb_s);
nic-blank_rfd.rbd = 0x;
nic-blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN);
 
@@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic 
}
 
/* Link the RFD to end of RFA by linking previous RFD to
-* this one, and clearing EL bit of previous.  */
+* this one, and clearing EL and S bit of previous.  */
if(rx-prev-skb) {
struct rfd *prev_rfd = (struct rfd *)rx-prev-skb-data;
put_unaligned(cpu_to_le32(rx-dma_addr),
(u32 *)prev_rfd-link);
wmb();
-   prev_rfd-command = ~cpu_to_le16(cb_el  cb_s);
+   prev_rfd-command = ~cpu_to_le16(cb_el | cb_s);
pci_dma_sync_single_for_device(nic-pdev, rx-prev-dma_addr,
sizeof(struct rfd), PCI_DMA_TODEVICE);
}
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 00/12] Syslets, Threadlets, generic AIO support, v5

2007-03-13 Thread Milton Miller

Anton Blanchard wrote:

Hi Ingo,


this is the v5 release of the syslet/threadlet subsystem:

   http://redhat.com/~mingo/syslet-patches/


Nice!



I too went and downloaded patches-v5 for review.

First off, one problem I noticed in sys_async_wait:

+   ah->events_left = min_wait_events - (kernel_ring_idx - user_ring_idx);

This completely misses the wraparound case of kernel_ring_idx <
user_ring_idx. I wonder if this is causing some of the benchmark 
problems?

(add max_ring_index if kernel < user).


I tried to port this to ppc64 but found a few problems:

The 64bit powerpc ABI has the concept of a TOC (r2) which is used for
per function data. This means this wont work:

[deleted]
I think we would want to change restore_ip to restore_function, and 
then

create a per arch helper, perhaps:

void set_user_context(struct task_struct *task, unsigned long stack,
  unsigned long function, unsigned long retval);

ppc64 could then grab the ip and r2 values from the function 
descriptor.


The other issue involves the syscall table:

asmlinkage struct syslet_uatom __user *
sys_async_exec(struct syslet_uatom __user *uatom,
   struct async_head_user __user *ahu)
{
return __sys_async_exec(uatom, ahu, sys_call_table, 
NR_syscalls);

}

This exposes the layout of the syscall table. Unfortunately it wont 
work

on ppc64. In arch/powerpc/kernel/systbl.S:

#define COMPAT_SYS(func).llong  .sys_##func,.compat_sys_##func

Both syscall tables are overlaid.

Anton


In addition, the entries in the table are not function pointers, they
are the actual code targets.   So we need a arch helper to invoke the
system call.

Here is another problem with your compat code.  Just telling user space
that everything is u64 and having the kernel retrieve pointers and ulong
doesn't work, you have to actually copy in u64 values and truncate them
down.  Your current code is broken on all 32bit big endian kernels.
Actually, the check needs to be that the upper 32 bits are 0 or return
-EINVAL.

In addition, the compat syscall entry points assume that the arguments
have been truncated to compat_ulong values by the syscall entry path,
and that they only need to do sign extension (and/or pointer conversion
on s390 with its 31 bit pointers).  So all compat kernels are broken.

The two of these things together makes me think we want two copy
functions.  At that point we may as well define the struct uatom in
terms of ulong and compat_ulong for the compat_uatom.  That would lead
to two copies of exec_uatom, but the elimination of passing the syscall
table as an argument down.  The need_resched and signal check could
become part of the common next_uatom routine, although it would need to
know uatom+1 instead of doing the addition in itself.

Other observations:

All the logic setting at and async_ready is a bit hard to follow.
After some analysis, t->at is only ever set to >__at and
async_ready is only set to the same at or NULL.  Both of these
should become flags, and at->task should be converted to
container_of.  Also, the name at is hard to grep / search for.

The stop flags are decoded with a case but are not densely encoded,
rather they are one hot.  We either need to error on multiple stop
bits being set, stop on each possible condition, or encode them
densely.

There is no check for flags being set that are not recognized.
If we ever add a flag for another stop condition this would
lead to incorrect execution by the kernel.

There are some syscalls that can return -EFAULT but later have
force_syscall_noerror.  We should create a stop on ERROR and
clear the force_noerror flag between syscalls.  The umem_add
syscall should add force_noerror if the put_user succeeds.

In copy_uatom, you call verify_read on the entire uatom.  This means
that the struct with all user space size has to be within the process
limit, which violates your assertion that userspace doesn't need the
whole structure.  If we add the requirement that the space that would
be occupied by the complete atom has to exist, then we can copy the
whole struct uatom with copy_from_user and then copy the args with
get_user.  User space can still pack them more densely, and we can
still stop copying on a null arg pointer.  Actually, calling access_ok
then __get_user can be more expensive on some architectures because
they have to verify both start and length on access_ok but can only
verify start on get_user because they have unmapped areas between user
space and kernel space.  This would also mean that we don't check
arg_ptr for NULL without verifying that get_user actually worked.   
The gotos in exec_uatom are just a while loop with a break.

sys_umem_add should be in /lib under lib-y in the Makefile.
In fact declaring the function weak does not make it a weak
syscall implementation on some architectures.

Weak syscalls aliases to sys_ni_syscall are needed for when
async support is not selected in Kconfig.

The Documentation 

Re: [patch 00/12] Syslets, Threadlets, generic AIO support, v5

2007-03-13 Thread Milton Miller

Anton Blanchard wrote:

Hi Ingo,


this is the v5 release of the syslet/threadlet subsystem:

   http://redhat.com/~mingo/syslet-patches/


Nice!



I too went and downloaded patches-v5 for review.

First off, one problem I noticed in sys_async_wait:

+   ah-events_left = min_wait_events - (kernel_ring_idx - user_ring_idx);

This completely misses the wraparound case of kernel_ring_idx 
user_ring_idx. I wonder if this is causing some of the benchmark 
problems?

(add max_ring_index if kernel  user).


I tried to port this to ppc64 but found a few problems:

The 64bit powerpc ABI has the concept of a TOC (r2) which is used for
per function data. This means this wont work:

[deleted]
I think we would want to change restore_ip to restore_function, and 
then

create a per arch helper, perhaps:

void set_user_context(struct task_struct *task, unsigned long stack,
  unsigned long function, unsigned long retval);

ppc64 could then grab the ip and r2 values from the function 
descriptor.


The other issue involves the syscall table:

asmlinkage struct syslet_uatom __user *
sys_async_exec(struct syslet_uatom __user *uatom,
   struct async_head_user __user *ahu)
{
return __sys_async_exec(uatom, ahu, sys_call_table, 
NR_syscalls);

}

This exposes the layout of the syscall table. Unfortunately it wont 
work

on ppc64. In arch/powerpc/kernel/systbl.S:

#define COMPAT_SYS(func).llong  .sys_##func,.compat_sys_##func

Both syscall tables are overlaid.

Anton


In addition, the entries in the table are not function pointers, they
are the actual code targets.   So we need a arch helper to invoke the
system call.

Here is another problem with your compat code.  Just telling user space
that everything is u64 and having the kernel retrieve pointers and ulong
doesn't work, you have to actually copy in u64 values and truncate them
down.  Your current code is broken on all 32bit big endian kernels.
Actually, the check needs to be that the upper 32 bits are 0 or return
-EINVAL.

In addition, the compat syscall entry points assume that the arguments
have been truncated to compat_ulong values by the syscall entry path,
and that they only need to do sign extension (and/or pointer conversion
on s390 with its 31 bit pointers).  So all compat kernels are broken.

The two of these things together makes me think we want two copy
functions.  At that point we may as well define the struct uatom in
terms of ulong and compat_ulong for the compat_uatom.  That would lead
to two copies of exec_uatom, but the elimination of passing the syscall
table as an argument down.  The need_resched and signal check could
become part of the common next_uatom routine, although it would need to
know uatom+1 instead of doing the addition in itself.

Other observations:

All the logic setting at and async_ready is a bit hard to follow.
After some analysis, t-at is only ever set to t-__at and
async_ready is only set to the same at or NULL.  Both of these
should become flags, and at-task should be converted to
container_of.  Also, the name at is hard to grep / search for.

The stop flags are decoded with a case but are not densely encoded,
rather they are one hot.  We either need to error on multiple stop
bits being set, stop on each possible condition, or encode them
densely.

There is no check for flags being set that are not recognized.
If we ever add a flag for another stop condition this would
lead to incorrect execution by the kernel.

There are some syscalls that can return -EFAULT but later have
force_syscall_noerror.  We should create a stop on ERROR and
clear the force_noerror flag between syscalls.  The umem_add
syscall should add force_noerror if the put_user succeeds.

In copy_uatom, you call verify_read on the entire uatom.  This means
that the struct with all user space size has to be within the process
limit, which violates your assertion that userspace doesn't need the
whole structure.  If we add the requirement that the space that would
be occupied by the complete atom has to exist, then we can copy the
whole struct uatom with copy_from_user and then copy the args with
get_user.  User space can still pack them more densely, and we can
still stop copying on a null arg pointer.  Actually, calling access_ok
then __get_user can be more expensive on some architectures because
they have to verify both start and length on access_ok but can only
verify start on get_user because they have unmapped areas between user
space and kernel space.  This would also mean that we don't check
arg_ptr for NULL without verifying that get_user actually worked.   
The gotos in exec_uatom are just a while loop with a break.

sys_umem_add should be in /lib under lib-y in the Makefile.
In fact declaring the function weak does not make it a weak
syscall implementation on some architectures.

Weak syscalls aliases to sys_ni_syscall are needed for when
async support is not selected in Kconfig.

The Documentation patch 

Re: Make sure we populate the initroot filesystem late enough

2007-02-26 Thread Milton Miller

On Feb 27, 2007, at 2:24 AM, David Woodhouse wrote:

On Sun, 2007-02-25 at 20:13 -0800, Linus Torvalds wrote:

On Sun, 25 Feb 2007, David Woodhouse wrote:

Can you try adding something like

memset(start, 0xf0, end - start);


Yeah, I did that before giving up on it for the day and going in 
search

of dinner. It changes the failure mode to a BUG() in
cache_free_debugcheck(), at line 2876 of mm/slab.c


Ok, that's just strange.


In this case I hadn't left the 'return' in free_initrd_mem(). I was
poisoning the pages and then returning them to the pool as usual.

If I poison the pages and _don't_ return them to the pool, it boots
fine. PageReserved is set on every page in the initrd region; total
page_count() is equal to the number of pages (which doesn't
_necessarily_ mean that page_count() for every page is equal to 1 but
it's a strong hint that that's the case).

Looking in /dev/mem after it boots, I see that my poison is still
present throughout the whole region.

One obvious thing to do would be to remove all the "__initdata" 
entries in

mm/slab.c..


This is biting us long before we call free_initmem().


 But I'd also like to see the full backtrace for the  BUG_ON(),
in case that gives any clues at all.


I'll see if I can find a camera.


It smells like the pages weren't actually reserved in the first place
and we were blithely allocating them. The only problem with that 
theory

is that the initrd doesn't seem to be getting corrupted -- and if we
were handing out its pages like that then surely _something_ would 
have

scribbled on it before we tried to read it.


Yeah, I don't think it's necessarily initrd itself, I'd be more 
inclined
to think that the reason you see this change with the initrd 
unpacking is
simply that it does a lot of allocations for the initrd files, so I 
think

it is only indirectly involved - just because it ends up being a slab
user.


Whatever happens, initrd as a 'slab user' is fine. The crashes happen
_later_, when someone else is using the memory which used to belong to
the initrd. In that 'BUG at slab.c:2876' I mentioned above, r3 was
within the initrd region. As I said, I'll try to find a camera.



Just a thought,

Any chance you are using one of the unusal code paths, like the 
bootloader

moving the initrd or using a kernel-crash region?


milton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Make sure we populate the initroot filesystem late enough

2007-02-26 Thread Milton Miller

On Feb 27, 2007, at 2:24 AM, David Woodhouse wrote:

On Sun, 2007-02-25 at 20:13 -0800, Linus Torvalds wrote:

On Sun, 25 Feb 2007, David Woodhouse wrote:

Can you try adding something like

memset(start, 0xf0, end - start);


Yeah, I did that before giving up on it for the day and going in 
search

of dinner. It changes the failure mode to a BUG() in
cache_free_debugcheck(), at line 2876 of mm/slab.c


Ok, that's just strange.


In this case I hadn't left the 'return' in free_initrd_mem(). I was
poisoning the pages and then returning them to the pool as usual.

If I poison the pages and _don't_ return them to the pool, it boots
fine. PageReserved is set on every page in the initrd region; total
page_count() is equal to the number of pages (which doesn't
_necessarily_ mean that page_count() for every page is equal to 1 but
it's a strong hint that that's the case).

Looking in /dev/mem after it boots, I see that my poison is still
present throughout the whole region.

One obvious thing to do would be to remove all the __initdata 
entries in

mm/slab.c..


This is biting us long before we call free_initmem().


 But I'd also like to see the full backtrace for the  BUG_ON(),
in case that gives any clues at all.


I'll see if I can find a camera.


It smells like the pages weren't actually reserved in the first place
and we were blithely allocating them. The only problem with that 
theory

is that the initrd doesn't seem to be getting corrupted -- and if we
were handing out its pages like that then surely _something_ would 
have

scribbled on it before we tried to read it.


Yeah, I don't think it's necessarily initrd itself, I'd be more 
inclined
to think that the reason you see this change with the initrd 
unpacking is
simply that it does a lot of allocations for the initrd files, so I 
think

it is only indirectly involved - just because it ends up being a slab
user.


Whatever happens, initrd as a 'slab user' is fine. The crashes happen
_later_, when someone else is using the memory which used to belong to
the initrd. In that 'BUG at slab.c:2876' I mentioned above, r3 was
within the initrd region. As I said, I'll try to find a camera.



Just a thought,

Any chance you are using one of the unusal code paths, like the 
bootloader

moving the initrd or using a kernel-crash region?


milton

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20: Rebuild after trivial patch rebuilds way too much

2007-02-13 Thread Milton Miller
On Mon, 12 Feb 2007 12:38:15 -0500 (EST), Andrey Borzenkov wrote:
> Is not it too much fore a trivial two lines patch to a single file?
> 
> {pts/1}% stg import
> ~/patch/Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch
> Importing
> patch "Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch"... done
> Now at patch "Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch"
> {pts/1}% make -C ~/src/linux-git O=$HOME/build/linux-2.6.20
> make: Entering directory `/home/bor/src/linux-git'
.. 120 lines of make ...


You can ask Kbuild why its building with make V=2.

make V=2 -C ~/src/linux-git O=$HOME/build/linux-2.6.20

from scripts/Kbuild.include:
###
# why - tell why a a target got build
#   enabled by make V=2
#   Output (listed in the order they are checked):
#  (1) - due to target is PHONY
#  (2) - due to target missing
#  (3) - due to: file1.h file2.h
#  (4) - due to command line change
#  (5) - due to missing .cmd file
#  (6) - due to target not in $(targets)
# (1) PHONY targets are always build
# (2) No target, so we better build it
# (3) Prerequisite is newer than target
# (4) The command line stored in the file named dir/.target.cmd
# differed from actual command line. This happens when compiler
# options changes
# (5) No dir/.target.cmd file (used to store command line)
# (6) No dir/.target.cmd file and target not listed in $(targets)
# This is a good hint that there is a bug in the kbuild file


Regards,
milton
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.20: Rebuild after trivial patch rebuilds way too much

2007-02-13 Thread Milton Miller
On Mon, 12 Feb 2007 12:38:15 -0500 (EST), Andrey Borzenkov wrote:
 Is not it too much fore a trivial two lines patch to a single file?
 
 {pts/1}% stg import
 ~/patch/Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch
 Importing
 patch Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch... done
 Now at patch Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch
 {pts/1}% make -C ~/src/linux-git O=$HOME/build/linux-2.6.20
 make: Entering directory `/home/bor/src/linux-git'
.. 120 lines of make ...


You can ask Kbuild why its building with make V=2.

make V=2 -C ~/src/linux-git O=$HOME/build/linux-2.6.20

from scripts/Kbuild.include:
###
# why - tell why a a target got build
#   enabled by make V=2
#   Output (listed in the order they are checked):
#  (1) - due to target is PHONY
#  (2) - due to target missing
#  (3) - due to: file1.h file2.h
#  (4) - due to command line change
#  (5) - due to missing .cmd file
#  (6) - due to target not in $(targets)
# (1) PHONY targets are always build
# (2) No target, so we better build it
# (3) Prerequisite is newer than target
# (4) The command line stored in the file named dir/.target.cmd
# differed from actual command line. This happens when compiler
# options changes
# (5) No dir/.target.cmd file (used to store command line)
# (6) No dir/.target.cmd file and target not listed in $(targets)
# This is a good hint that there is a bug in the kbuild file


Regards,
milton
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-11 Thread Milton Miller


On Feb 9, 2007, at 10:17 AM, Carl Love wrote:


On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote:

 On Feb 8, 2007, at 4:51 PM, Carl Love wrote:


On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:

On Thursday 08 February 2007 15:18, Milton Miller wrote:


1) sample rate setup

In the current patch, the user specifies a sample rate as a 
time

interval.
The kernel is (a) calling cpufreq to get the current cpu
frequency,
(b)
converting the rate to a cycle count, (c) converting this to a
24 bit
LFSR count, an iterative algorithm (in this patch, starting 
from

one of 256 values so a max of 2^16 or 64k iterations), (d)
calculating
an trace unload interval.   In addition, a cpufreq notifier is
registered
to recalculate on frequency changes.


No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.


So you are saying that b and c are primary, and a is used to calculate
a safe value for d.   All of the above work is dont, just from a
different starting point?


There are two things 1) setup the LFSR to control how often the 
Hardware

puts samples into the trace buffer.  2) setup the kernel timer to read
the trace buffer (your d, d is a function of the cpu freq) and process
the samples.

(c is nonsense)



Well, its cacluate the rate from the count vs count from the rate.



The kernel timer was set with the goal the the hardware trace buffer
would not get more then half full to ensure we would not lose samples
even for the maximum rate that the hardware would be adding samples to
the trace buffer (user specified N=100,000).


That should be well commented.




The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is
the
performance counter entry for that event.


Ok I haven't looked a the api closely.



Actually, the oprofile userspace fills out files in a file system to 
tell
the kernel what it needs to know.   The powerpc code defines the 
reources

needed to use the PMU hardware, which is (1) common mux selects, for
processors that need them, and (2) a set of directories, one for each 
of the

pmu counters, each of which contain the controls for that counter (such
as enable kernel space, enable user space, event timeout to interrupt or
sample collection, etc.   The ctr[i].count is one of these files.




Specifically with SPU
profiling, we do not use performance counters because the CELL HW 
does

not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is
an
LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is 
taken
and stored in the trace buffer.  Hence, the value of N specified by 
the
user must get converted to the LFSR value that is N from the end of 
the

sequence.


Ok so its arbitray load count to max vs count and compare.   A 
critical

detail when computing the value to load, but the net result is the
same; the value for the count it hard to determine.


The above statement makes no sense to me.


I think I talked past you.  Your description of hadrware vs mine was
different in that the counter always ends at a specified point and is
loaded with the variable count, where mine had it comparing to the
count as it incremented.

However, I now see that you were referring to the fact that what the
user specifes, the count, has to be converted to a LFSR value.  My point
is this can be done in the user-space oprofile code.  It already has
to look up magic numbers for setting up event selection muxes for other
hardware, adding a lfsr calculation is not beyond resoon.  Nor is having
it provide two values in two files.



Determining the initial LFSR value that is N values from the last value
in the sequence is not easy to do.


Well, its easy, its just order(N).


The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per 
CPU

clock cycle regardless of the CPU frequency or changes in the
frequency.
There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.


Oh, so the lfsr doesn't have to be recomputed, only the time
between unloads.


The LFSR value is computed ONCE when you start OProfile.  The value is
setup in the hardware once when OProfile starts.  The hardware will
always restart with the value given to it after it reaches the last
value in the sequence.  What you call the "time between unloads" is the
time at which you schedule the kernel routine to empty the trace 
buffer.

It is calculated once.  It would only need to be recomputed if the cpu
frequency changed.






The obvio

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-11 Thread Milton Miller


On Feb 9, 2007, at 10:17 AM, Carl Love wrote:


On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote:

 On Feb 8, 2007, at 4:51 PM, Carl Love wrote:


On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:

On Thursday 08 February 2007 15:18, Milton Miller wrote:


1) sample rate setup

In the current patch, the user specifies a sample rate as a 
time

interval.
The kernel is (a) calling cpufreq to get the current cpu
frequency,
(b)
converting the rate to a cycle count, (c) converting this to a
24 bit
LFSR count, an iterative algorithm (in this patch, starting 
from

one of 256 values so a max of 2^16 or 64k iterations), (d)
calculating
an trace unload interval.   In addition, a cpufreq notifier is
registered
to recalculate on frequency changes.


No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.


So you are saying that b and c are primary, and a is used to calculate
a safe value for d.   All of the above work is dont, just from a
different starting point?


There are two things 1) setup the LFSR to control how often the 
Hardware

puts samples into the trace buffer.  2) setup the kernel timer to read
the trace buffer (your d, d is a function of the cpu freq) and process
the samples.

(c is nonsense)



Well, its cacluate the rate from the count vs count from the rate.



The kernel timer was set with the goal the the hardware trace buffer
would not get more then half full to ensure we would not lose samples
even for the maximum rate that the hardware would be adding samples to
the trace buffer (user specified N=100,000).


That should be well commented.




The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is
the
performance counter entry for that event.


Ok I haven't looked a the api closely.



Actually, the oprofile userspace fills out files in a file system to 
tell
the kernel what it needs to know.   The powerpc code defines the 
reources

needed to use the PMU hardware, which is (1) common mux selects, for
processors that need them, and (2) a set of directories, one for each 
of the

pmu counters, each of which contain the controls for that counter (such
as enable kernel space, enable user space, event timeout to interrupt or
sample collection, etc.   The ctr[i].count is one of these files.




Specifically with SPU
profiling, we do not use performance counters because the CELL HW 
does

not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is
an
LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is 
taken
and stored in the trace buffer.  Hence, the value of N specified by 
the
user must get converted to the LFSR value that is N from the end of 
the

sequence.


Ok so its arbitray load count to max vs count and compare.   A 
critical

detail when computing the value to load, but the net result is the
same; the value for the count it hard to determine.


The above statement makes no sense to me.


I think I talked past you.  Your description of hadrware vs mine was
different in that the counter always ends at a specified point and is
loaded with the variable count, where mine had it comparing to the
count as it incremented.

However, I now see that you were referring to the fact that what the
user specifes, the count, has to be converted to a LFSR value.  My point
is this can be done in the user-space oprofile code.  It already has
to look up magic numbers for setting up event selection muxes for other
hardware, adding a lfsr calculation is not beyond resoon.  Nor is having
it provide two values in two files.



Determining the initial LFSR value that is N values from the last value
in the sequence is not easy to do.


Well, its easy, its just order(N).


The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per 
CPU

clock cycle regardless of the CPU frequency or changes in the
frequency.
There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.


Oh, so the lfsr doesn't have to be recomputed, only the time
between unloads.


The LFSR value is computed ONCE when you start OProfile.  The value is
setup in the hardware once when OProfile starts.  The hardware will
always restart with the value given to it after it reaches the last
value in the sequence.  What you call the time between unloads is the
time at which you schedule the kernel routine to empty the trace 
buffer.

It is calculated once.  It would only need to be recomputed if the cpu
frequency changed.






The obvious problem

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller


On Feb 9, 2007, at 1:10 PM, Arnd Bergmann wrote:


On Friday 09 February 2007 19:47, Milton Miller wrote:

On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote:




Doing the translation in two stages in user space, as you
suggest here, definitely makes sense to me. I think it
can be done a little simpler though:

Why would you need the accurate dcookie information to be
provided by the kernel? The ELF loader is done in user
space, and the kernel only reproduces what it thinks that
came up with. If the kernel only gives the dcookie information
about the SPU ELF binary to the oprofile user space, then
that can easily recreate the same mapping.


Actually, I was trying to cover issues such as anonymous
memory.   If the kernel doesn't generate dcookies for
the load segments the information is not recoverable once
the task exits.  This would also allow the loader to create
an artifical elf header that covered both the base executable
and a dynamicly linked one.

Other alternatives exist, such as a structure for context-id
that would have its own identifing magic and an array of elf
header pointers.


But _why_ do you want to solve that problem? we don't have
dynamically linked binaries and I really don't see why the loader
would want to create artificial elf headers...


I'm explainign how they could be handled.   Actully I think
the other proposal (an identified structure that points to
sevear elf headers) would be more approprate.  As you point
out, if there are presently no dynamic libraries in use, it
doesn't have to be solved today.  I'm just trying to make
the code future proof, or at least a clear path forward.




The kernel still needs to provide the overlay identifiers
though.


Yes, which means it needs to parse the header (or libpse
be enhanced to write the monitor info to a spufs file).


we thought about this in the past and discarded it because of
the complexity of an spufs interface that would handle this
correctly.


Not sure what would be difficuult, and it would allow other
binary formats.   But parsing the headers in the kernel
means existing userspace doesn't have to be upgraded, so I
am not proposing this requirement.




yes, this sounds nice. But tt does not at all help accuracy,
only performance, right?


It allows the user space to know when the sample was taken
and  be aware of the ambiguity.   If the sample rate is
high enough in relation to the overlay switch rate, user space
could decide to discard the ambiguous samples.


yes, good point.


This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.


Yes, that would be a new feature if the kernel passed dcookie
information for every section, but I doubt that it is worth
it. I have not seen any program that allows loading code
from more than one ELF file. In particular, the ELF format
on the SPU is currently lacking the relocation mechanisms
that you would need for resolving spu-side symbols at load
time


Actually, It could check all load segments, and only report
those where the dcookie changes (as opposed to the offset).


I'm not really following you here, but probably you misunderstood
my point as well.


I was thinking in terms of dyanmic libraries, and totally skipped
your comment about the relocaiton info being missing.   My reply
point was that the table could be compressed to the current
entry if all hased to the same vm area.




This seems to incur a run-time overhead on the SPU even if not
profiling, I would consider that not acceptable.


It definitely is overhead.  Which means it would have to be
optional, like gprof.


There is some work going on for another profiler independent
of the hardware feature that only relies on instrumenting the
spu executable for things like DMA transfers and overlay
changes.


Regardless, its beyond the current scope.

milton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller


On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote:


On Thursday 08 February 2007 15:18, Milton Miller wrote:


The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to
present samples that may have been collected during context
switch processing, they must be discarded.


I thought it already did handle overlays, what did I miss here?


It does, see my reply to Maynard.  Not sure what I was thinking
when I wrote this, possibly I was thinking of the inaccuracies.




My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.


Doing the translation in two stages in user space, as you
suggest here, definitely makes sense to me. I think it
can be done a little simpler though:

Why would you need the accurate dcookie information to be
provided by the kernel? The ELF loader is done in user
space, and the kernel only reproduces what it thinks that
came up with. If the kernel only gives the dcookie information
about the SPU ELF binary to the oprofile user space, then
that can easily recreate the same mapping.


Actually, I was trying to cover issues such as anonymous
memory.   If the kernel doesn't generate dcookies for
the load segments the information is not recoverable once
the task exits.  This would also allow the loader to create
an artifical elf header that covered both the base executable
and a dynamicly linked one.

Other alternatives exist, such as a structure for context-id
that would have its own identifing magic and an array of elf
header pointers.




The kernel still needs to provide the overlay identifiers
though.


Yes, which means it needs to parse the header (or libpse
be enhanced to write the monitor info to a spufs file).




To identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.


right.


This approach trades translation lookup overhead for each
recorded sample for a burst of data on new context activation.
In addition it exposes the sample point of the overlay identifier
vs the address collection.  This allows the ambiguity to be
exposed to user space.   In addition, with the above proposed
kernel timer vs sample collection, user space could limit the
elapsed time between the address collection and the overlay
id check.


yes, this sounds nice. But tt does not at all help accuracy,
only performance, right?


It allows the user space to know when the sample was taken
and  be aware of the ambiguity.   If the sample rate is
high enough in relation to the overlay switch rate, user space
could decide to discard the ambiguous samples.




This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.


Yes, that would be a new feature if the kernel passed dcookie
information for every section, but I doubt that it is worth
it. I have not seen any program that allows loading code
from more than one ELF file. In particular, the ELF format
on the SPU is currently lacking the relocation mechanisms
that you would need for resolving spu-side symbols at load
time


Actually, It could check all load segments, and only report
those where the dcookie changes (as opposed to the offset).


.


If better overlay identification is required, in theory the
overlay switch code could be augmented to record the switches
(DMA reference time from the PowerPC memory and record a
relative decrementer in the SPU), this is obviously a future
item.  But it is facilitated by having user space resolve the
SPU to source file translation.


This seems to incur a run-time overhead on the SPU even if not
profiling, I would consider that not acceptable.


It definitely is overhead.  Which means it would have to be
optional, like gprof.


milton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" i

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller


On Feb 8, 2007, at 5:59 PM, Maynard Johnson wrote:


Milton,
Thank you for your comments.  Carl will reply to certain parts of your 
posting where he's more knowledgeable than I.  See my replies below.




Thanks for the pleasant tone and dialog.


Milton Miller wrote:

On Feb 6, 2007, at 5:02 PM, Carl Love wrote:

This is the first update to the patch previously posted by Maynard
Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL".





Data collected


The current patch starts tackling these translation issues for the
presently common case of a static self contained binary from a single
file, either single separate source file or embedded in the data of
the host application.   When creating the trace entry for a SPU
context switch, it records the application owner, pid, tid, and
dcookie of the main executable.   It addition, it looks up the
object-id as a virtual address and records the offset if it is 
non-zero,

or the dcookie of the object if it is zero.   The code then creates
a data structure by reading the elf headers from the user process
(at the address given by the object-id) and building a list of
SPU address to elf object offsets, as specified by the ELF loader
headers.   In addition to the elf loader section, it processes the
overlay headers and records the address, size, and magic number of
the overlay.

When the hardware trace entries are processed, each address is
looked up this structure and translated to the elf offset.  If
it is an overlay region, the overlay identify word is read and
the list is searched for the matching overlay.  The resulting
offset is sent to the oprofile system.

The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to


Yes, we do handle overlays.  (Note: I'm looking into a bug
right now in our overlay support.)


I knew you handled overlays, and I did not mean to say that
you did not.   I am not sure how that got there.  I may have
been thinking of the kernel supplied context switch code
discussion, or how the code supplied dcookie or offset but
not both.  Actually, I might have been referring to the
fact that overlays are guessed rather than recorded.


present samples that may have been collected during context
switch processing, they must be discarded.


My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.  To
identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.


Discussions on this topic in the past have resulted in the
current implementation precisely because we're able to record
the samples as fileoffsets, just as the userspace tools expect.


I was not part of the previous discussions, so please forgive me.


I haven't had time to check out how much this would impact the
userspace tools, but my gut feel is that it would be quite
significant.  If we were developing this module with a matching
newly-created userspace tool, I would be more inclined to agree
that this makes sense.


I have not yet studied the user space tool.   In fact, when I
made this proposal, I had not studied the kernel oprofile code
either, although I had read the concepts and discussion of the
event buffer when the base patch was added to the kernel.

I have read and now consider myself to have some understanding
of the kernel side.  I note that the user space tool calls itself
alpha and the kernel support experimental.   I only looked at
the user space enough to determine it is written in C++.

I would hope the tool would be modular enough to insert a data
transformation pass to do this conversion that the kernel is
doing.


But you give no rationale for your
proposal that justifies the change.  The current implementation
works, it has no impact on normal, non-profiling behavior,  and
the overhead during profiling is not noticeable.


I was proposing this for several of reasons.

One, there were expressed limitations  in the current
proposal.  There is a requirement that everything be linked
into one ELF object for it to be profiled seemed significant
to me.  This implies that shared libraries (well, dynamic
linked ones) have no pa

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller


On Feb 8, 2007, at 5:59 PM, Maynard Johnson wrote:


Milton,
Thank you for your comments.  Carl will reply to certain parts of your 
posting where he's more knowledgeable than I.  See my replies below.




Thanks for the pleasant tone and dialog.


Milton Miller wrote:

On Feb 6, 2007, at 5:02 PM, Carl Love wrote:

This is the first update to the patch previously posted by Maynard
Johnson as PATCH 4/4. Add support to OProfile for profiling CELL.





Data collected


The current patch starts tackling these translation issues for the
presently common case of a static self contained binary from a single
file, either single separate source file or embedded in the data of
the host application.   When creating the trace entry for a SPU
context switch, it records the application owner, pid, tid, and
dcookie of the main executable.   It addition, it looks up the
object-id as a virtual address and records the offset if it is 
non-zero,

or the dcookie of the object if it is zero.   The code then creates
a data structure by reading the elf headers from the user process
(at the address given by the object-id) and building a list of
SPU address to elf object offsets, as specified by the ELF loader
headers.   In addition to the elf loader section, it processes the
overlay headers and records the address, size, and magic number of
the overlay.

When the hardware trace entries are processed, each address is
looked up this structure and translated to the elf offset.  If
it is an overlay region, the overlay identify word is read and
the list is searched for the matching overlay.  The resulting
offset is sent to the oprofile system.

The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to


Yes, we do handle overlays.  (Note: I'm looking into a bug
right now in our overlay support.)


I knew you handled overlays, and I did not mean to say that
you did not.   I am not sure how that got there.  I may have
been thinking of the kernel supplied context switch code
discussion, or how the code supplied dcookie or offset but
not both.  Actually, I might have been referring to the
fact that overlays are guessed rather than recorded.


present samples that may have been collected during context
switch processing, they must be discarded.


My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.  To
identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.


Discussions on this topic in the past have resulted in the
current implementation precisely because we're able to record
the samples as fileoffsets, just as the userspace tools expect.


I was not part of the previous discussions, so please forgive me.


I haven't had time to check out how much this would impact the
userspace tools, but my gut feel is that it would be quite
significant.  If we were developing this module with a matching
newly-created userspace tool, I would be more inclined to agree
that this makes sense.


I have not yet studied the user space tool.   In fact, when I
made this proposal, I had not studied the kernel oprofile code
either, although I had read the concepts and discussion of the
event buffer when the base patch was added to the kernel.

I have read and now consider myself to have some understanding
of the kernel side.  I note that the user space tool calls itself
alpha and the kernel support experimental.   I only looked at
the user space enough to determine it is written in C++.

I would hope the tool would be modular enough to insert a data
transformation pass to do this conversion that the kernel is
doing.


But you give no rationale for your
proposal that justifies the change.  The current implementation
works, it has no impact on normal, non-profiling behavior,  and
the overhead during profiling is not noticeable.


I was proposing this for several of reasons.

One, there were expressed limitations  in the current
proposal.  There is a requirement that everything be linked
into one ELF object for it to be profiled seemed significant
to me.  This implies that shared libraries (well, dynamic
linked ones) have no path forward

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller


On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote:


On Thursday 08 February 2007 15:18, Milton Miller wrote:


The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to
present samples that may have been collected during context
switch processing, they must be discarded.


I thought it already did handle overlays, what did I miss here?


It does, see my reply to Maynard.  Not sure what I was thinking
when I wrote this, possibly I was thinking of the inaccuracies.




My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.


Doing the translation in two stages in user space, as you
suggest here, definitely makes sense to me. I think it
can be done a little simpler though:

Why would you need the accurate dcookie information to be
provided by the kernel? The ELF loader is done in user
space, and the kernel only reproduces what it thinks that
came up with. If the kernel only gives the dcookie information
about the SPU ELF binary to the oprofile user space, then
that can easily recreate the same mapping.


Actually, I was trying to cover issues such as anonymous
memory.   If the kernel doesn't generate dcookies for
the load segments the information is not recoverable once
the task exits.  This would also allow the loader to create
an artifical elf header that covered both the base executable
and a dynamicly linked one.

Other alternatives exist, such as a structure for context-id
that would have its own identifing magic and an array of elf
header pointers.




The kernel still needs to provide the overlay identifiers
though.


Yes, which means it needs to parse the header (or libpse
be enhanced to write the monitor info to a spufs file).




To identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.


right.


This approach trades translation lookup overhead for each
recorded sample for a burst of data on new context activation.
In addition it exposes the sample point of the overlay identifier
vs the address collection.  This allows the ambiguity to be
exposed to user space.   In addition, with the above proposed
kernel timer vs sample collection, user space could limit the
elapsed time between the address collection and the overlay
id check.


yes, this sounds nice. But tt does not at all help accuracy,
only performance, right?


It allows the user space to know when the sample was taken
and  be aware of the ambiguity.   If the sample rate is
high enough in relation to the overlay switch rate, user space
could decide to discard the ambiguous samples.




This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.


Yes, that would be a new feature if the kernel passed dcookie
information for every section, but I doubt that it is worth
it. I have not seen any program that allows loading code
from more than one ELF file. In particular, the ELF format
on the SPU is currently lacking the relocation mechanisms
that you would need for resolving spu-side symbols at load
time


Actually, It could check all load segments, and only report
those where the dcookie changes (as opposed to the offset).


.


If better overlay identification is required, in theory the
overlay switch code could be augmented to record the switches
(DMA reference time from the PowerPC memory and record a
relative decrementer in the SPU), this is obviously a future
item.  But it is facilitated by having user space resolve the
SPU to source file translation.


This seems to incur a run-time overhead on the SPU even if not
profiling, I would consider that not acceptable.


It definitely is overhead.  Which means it would have to be
optional, like gprof.


milton

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-09 Thread Milton Miller


On Feb 9, 2007, at 1:10 PM, Arnd Bergmann wrote:


On Friday 09 February 2007 19:47, Milton Miller wrote:

On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote:




Doing the translation in two stages in user space, as you
suggest here, definitely makes sense to me. I think it
can be done a little simpler though:

Why would you need the accurate dcookie information to be
provided by the kernel? The ELF loader is done in user
space, and the kernel only reproduces what it thinks that
came up with. If the kernel only gives the dcookie information
about the SPU ELF binary to the oprofile user space, then
that can easily recreate the same mapping.


Actually, I was trying to cover issues such as anonymous
memory.   If the kernel doesn't generate dcookies for
the load segments the information is not recoverable once
the task exits.  This would also allow the loader to create
an artifical elf header that covered both the base executable
and a dynamicly linked one.

Other alternatives exist, such as a structure for context-id
that would have its own identifing magic and an array of elf
header pointers.


But _why_ do you want to solve that problem? we don't have
dynamically linked binaries and I really don't see why the loader
would want to create artificial elf headers...


I'm explainign how they could be handled.   Actully I think
the other proposal (an identified structure that points to
sevear elf headers) would be more approprate.  As you point
out, if there are presently no dynamic libraries in use, it
doesn't have to be solved today.  I'm just trying to make
the code future proof, or at least a clear path forward.




The kernel still needs to provide the overlay identifiers
though.


Yes, which means it needs to parse the header (or libpse
be enhanced to write the monitor info to a spufs file).


we thought about this in the past and discarded it because of
the complexity of an spufs interface that would handle this
correctly.


Not sure what would be difficuult, and it would allow other
binary formats.   But parsing the headers in the kernel
means existing userspace doesn't have to be upgraded, so I
am not proposing this requirement.




yes, this sounds nice. But tt does not at all help accuracy,
only performance, right?


It allows the user space to know when the sample was taken
and  be aware of the ambiguity.   If the sample rate is
high enough in relation to the overlay switch rate, user space
could decide to discard the ambiguous samples.


yes, good point.


This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.


Yes, that would be a new feature if the kernel passed dcookie
information for every section, but I doubt that it is worth
it. I have not seen any program that allows loading code
from more than one ELF file. In particular, the ELF format
on the SPU is currently lacking the relocation mechanisms
that you would need for resolving spu-side symbols at load
time


Actually, It could check all load segments, and only report
those where the dcookie changes (as opposed to the offset).


I'm not really following you here, but probably you misunderstood
my point as well.


I was thinking in terms of dyanmic libraries, and totally skipped
your comment about the relocaiton info being missing.   My reply
point was that the table could be compressed to the current
entry if all hased to the same vm area.




This seems to incur a run-time overhead on the SPU even if not
profiling, I would consider that not acceptable.


It definitely is overhead.  Which means it would have to be
optional, like gprof.


There is some work going on for another profiler independent
of the hardware feature that only relies on instrumenting the
spu executable for things like DMA transfers and overlay
changes.


Regardless, its beyond the current scope.

milton

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Milton Miller


On Feb 8, 2007, at 4:51 PM, Carl Love wrote:


On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:

On Thursday 08 February 2007 15:18, Milton Miller wrote:


1) sample rate setup

In the current patch, the user specifies a sample rate as a time
interval.
The kernel is (a) calling cpufreq to get the current cpu 
frequency,

(b)
converting the rate to a cycle count, (c) converting this to a 
24 bit

LFSR count, an iterative algorithm (in this patch, starting from
one of 256 values so a max of 2^16 or 64k iterations), (d)
calculating
an trace unload interval.   In addition, a cpufreq notifier is
registered
to recalculate on frequency changes.


No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.


So you are saying that b and c are primary, and a is used to calculate
a safe value for d.   All of the above work is dont, just from a
different starting point?



The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is 
the

performance counter entry for that event.


Ok I haven't looked a the api closely.


Specifically with SPU
profiling, we do not use performance counters because the CELL HW does
not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is 
an

LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is taken
and stored in the trace buffer.  Hence, the value of N specified by the
user must get converted to the LFSR value that is N from the end of the
sequence.


Ok so its arbitray load count to max vs count and compare.   A critical
detail when computing the value to load, but the net result is the
same; the value for the count it hard to determine.


The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per CPU
clock cycle regardless of the CPU frequency or changes in the 
frequency.

There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.




Oh, so the lfsr doesn't have to be recomputed, only the time
between unloads.



Milton had a comment about the code

 if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {

+ spu_cycle_reset = ctr[0].count;
+ return;
+ }


Well, given the above description, it is clear that if you are doing 
SPU

event profiling, the value N is put into the cntr[0].cnt entry since
there is only one event.  Thus in cell_global_start_spu() you use
spu_cycle_reset as the argument to the lfsr calculation routine to get
the LFSR value that is N from the end of the sequence.


I was looking at the patch and the context was not very good.   You
might consider adding -p to your diff command, it provides the function
name after the @@.

However, in this case, I think I just need to see the final result.





The obvious problem is step (c), running a loop potentially 64
thousand
times in kernel space will have a noticeable impact on other 
threads.


I propose instead that user space perform the above 4 steps, and
provide
the kernel with two inputs: (1) the value to load in the LFSR 
and (2)
the periodic frequency / time interval at which to empty the 
hardware

trace buffer, perform sample analysis, and send the data to the
oprofile
subsystem.

There should be no security issues with this approach.   If the 
LFSR

value
is calculated incorrectly, either it will be too short, causing 
the

trace
array to overfill and data to be dropped, or it will be too 
long, and

there will be fewer samples.   Likewise, the kernel periodic poll
can be
too long, again causing overflow, or too frequent, causing only 
timer

execution overhead.

Various data is collected by the kernel while processing the
periodic timer,
this approach would also allow the profiling tools to control the
frequency of this collection.   More frequent collection results 
in

more
accurate sample data, with the linear cost of poll execution
overhead.

Frequency changes can be handled either by the profile code 
setting
collection at a higher than necessary rate, or by interacting 
with

the
governor to limit the speeds.

Optionally, the kernel can add a record indicating that some 
data was

likely dropped if it is able to read all 256 entries without
underflowing
the array.  This can be used as hint to user space that the 
kernel

time
was too long for the collection rate.


Moving the sample rate computation to user space sounds like the right
idea, but why not have a more drastic version of it:


No, I do

Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Milton Miller
f
the overlay.

When the hardware trace entries are processed, each address is
looked up this structure and translated to the elf offset.  If
it is an overlay region, the overlay identify word is read and
the list is searched for the matching overlay.  The resulting
offset is sent to the oprofile system.

The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to
present samples that may have been collected during context
switch processing, they must be discarded.


My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.  To
identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.

This approach trades translation lookup overhead for each
recorded sample for a burst of data on new context activation.
In addition it exposes the sample point of the overlay identifier
vs the address collection.  This allows the ambiguity to be
exposed to user space.   In addition, with the above proposed
kernel timer vs sample collection, user space could limit the
elapsed time between the address collection and the overlay
id check.

This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.

If better overlay identification is required, in theory the
overlay switch code could be augmented to record the switches
(DMA reference time from the PowerPC memory and record a
relative decrementer in the SPU), this is obviously a future
item.  But it is facilitated by having user space resolve the
SPU to source file translation.

milton
--
[EMAIL PROTECTED]   Milton Miller
Speaking for myself only.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Milton Miller
 trace entries are processed, each address is
looked up this structure and translated to the elf offset.  If
it is an overlay region, the overlay identify word is read and
the list is searched for the matching overlay.  The resulting
offset is sent to the oprofile system.

The current patch specifically identifies that only single
elf objects are handled.  There is no code to handle dynamic
linked libraries or overlays.   Nor is there any method to
present samples that may have been collected during context
switch processing, they must be discarded.


My proposal is to change what is presented to user space.  Instead
of trying to translate the SPU address to the backing file
as the samples are recorded, store the samples as the SPU
context and address.  The context switch would record tid,
pid, object id as it does now.   In addition, if this is a
new object-id, the kernel would read elf headers as it does
today.  However, it would then proceed to provide accurate
dcookie information for each loader region and overlay.  To
identify which overlays are active, (instead of the present
read on use and search the list to translate approach) the
kernel would record the location of the overlay identifiers
as it parsed the kernel, but would then read the identification
word and would record the present value as an sample from
a separate but related stream.   The kernel could maintain
the last value for each overlay and only send profile events
for the deltas.

This approach trades translation lookup overhead for each
recorded sample for a burst of data on new context activation.
In addition it exposes the sample point of the overlay identifier
vs the address collection.  This allows the ambiguity to be
exposed to user space.   In addition, with the above proposed
kernel timer vs sample collection, user space could limit the
elapsed time between the address collection and the overlay
id check.

This approach allows multiple objects by its nature.  A new
elf header could be constructed in memory that contained
the union of the elf objects load segments, and the tools
will magically work.   Alternatively the object id could
point to a new structure, identified via a new header, that
it points to other elf headers (easily differentiated by the
elf magic headers).   Other binary formats, including several
objects in a ar archive, could be supported.

If better overlay identification is required, in theory the
overlay switch code could be augmented to record the switches
(DMA reference time from the PowerPC memory and record a
relative decrementer in the SPU), this is obviously a future
item.  But it is facilitated by having user space resolve the
SPU to source file translation.

milton
--
[EMAIL PROTECTED]   Milton Miller
Speaking for myself only.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch

2007-02-08 Thread Milton Miller


On Feb 8, 2007, at 4:51 PM, Carl Love wrote:


On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote:

On Thursday 08 February 2007 15:18, Milton Miller wrote:


1) sample rate setup

In the current patch, the user specifies a sample rate as a time
interval.
The kernel is (a) calling cpufreq to get the current cpu 
frequency,

(b)
converting the rate to a cycle count, (c) converting this to a 
24 bit

LFSR count, an iterative algorithm (in this patch, starting from
one of 256 values so a max of 2^16 or 64k iterations), (d)
calculating
an trace unload interval.   In addition, a cpufreq notifier is
registered
to recalculate on frequency changes.


No.  The user issues the command opcontrol --event:N  where N is the
number of events (cycles, l2 cache misses, instructions retired etc)
that are to elapse between collecting the samples.


So you are saying that b and c are primary, and a is used to calculate
a safe value for d.   All of the above work is dont, just from a
different starting point?



The OProfile passes
the value N to the kernel via the variable ctr[i].count.  Where i is 
the

performance counter entry for that event.


Ok I haven't looked a the api closely.


Specifically with SPU
profiling, we do not use performance counters because the CELL HW does
not allow the normal the PPU to read the SPU PC when a performance
counter interrupt occurs.  We are using some additional hw support in
the chip that allows us to periodically capture the SPU PC.  There is 
an

LFSR hardware counter that can be started at an arbitrary LFSR value.
When the last LFSR value in the sequence is reached, a sample is taken
and stored in the trace buffer.  Hence, the value of N specified by the
user must get converted to the LFSR value that is N from the end of the
sequence.


Ok so its arbitray load count to max vs count and compare.   A critical
detail when computing the value to load, but the net result is the
same; the value for the count it hard to determine.


The same clock that the processor is running at is used to
control the LFSR count.  Hence the LFSR counter increments once per CPU
clock cycle regardless of the CPU frequency or changes in the 
frequency.

There is no calculation for the LFSR value that is a function of the
processor frequency.  There is no need to adjust the LFSR when the
processor frequency changes.




Oh, so the lfsr doesn't have to be recomputed, only the time
between unloads.



Milton had a comment about the code

 if (ctr[0].event == SPU_CYCLES_EVENT_NUM) {

+ spu_cycle_reset = ctr[0].count;
+ return;
+ }


Well, given the above description, it is clear that if you are doing 
SPU

event profiling, the value N is put into the cntr[0].cnt entry since
there is only one event.  Thus in cell_global_start_spu() you use
spu_cycle_reset as the argument to the lfsr calculation routine to get
the LFSR value that is N from the end of the sequence.


I was looking at the patch and the context was not very good.   You
might consider adding -p to your diff command, it provides the function
name after the @@.

However, in this case, I think I just need to see the final result.





The obvious problem is step (c), running a loop potentially 64
thousand
times in kernel space will have a noticeable impact on other 
threads.


I propose instead that user space perform the above 4 steps, and
provide
the kernel with two inputs: (1) the value to load in the LFSR 
and (2)
the periodic frequency / time interval at which to empty the 
hardware

trace buffer, perform sample analysis, and send the data to the
oprofile
subsystem.

There should be no security issues with this approach.   If the 
LFSR

value
is calculated incorrectly, either it will be too short, causing 
the

trace
array to overfill and data to be dropped, or it will be too 
long, and

there will be fewer samples.   Likewise, the kernel periodic poll
can be
too long, again causing overflow, or too frequent, causing only 
timer

execution overhead.

Various data is collected by the kernel while processing the
periodic timer,
this approach would also allow the profiling tools to control the
frequency of this collection.   More frequent collection results 
in

more
accurate sample data, with the linear cost of poll execution
overhead.

Frequency changes can be handled either by the profile code 
setting
collection at a higher than necessary rate, or by interacting 
with

the
governor to limit the speeds.

Optionally, the kernel can add a record indicating that some 
data was

likely dropped if it is able to read all 256 entries without
underflowing
the array.  This can be used as hint to user space that the 
kernel

time
was too long for the collection rate.


Moving the sample rate computation to user space sounds like the right
idea, but why not have a more drastic version of it:


No, I do

Re: [PATCH] ppc64: iSeries early printk breakage

2005-09-07 Thread Milton Miller

Does anything actually break without this patch?

My reading of unregister_console says we will acquire
the console semaphore, walk the list, fail to find the
console, relase the semaphore, and return.


Hmm... unless there is a problem with the console preference
code?  I don't see anything that should deref a bad pointer,
maybe it breaks the preference?

From get web:

if (console_drivers == NULL)
preferred_console = selected_console;
else if (console->flags & CON_CONSDEV)
console_drivers->flags |= CON_CONSDEV;

On Sep 7, 2005, at 4:52 AM, Stephen Rothwell wrote:


The earlier commit 8d9273918635f0301368c01b56c03a6f339e8d51
(Consolidate early console and PPCDBG code) broke iSeries because
it caused unregister_console(_console) to be called
unconditionally.  iSeries never registers the udbg_console.

This just reverts part of the change.

Signed-off-by: Stephen Rothwell <[EMAIL PROTECTED]>
---

 arch/ppc64/kernel/udbg.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

--
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/

234f5032f6ccb4d72e4b74d33af55716b67d8a27
diff --git a/arch/ppc64/kernel/udbg.c b/arch/ppc64/kernel/udbg.c
--- a/arch/ppc64/kernel/udbg.c
+++ b/arch/ppc64/kernel/udbg.c
@@ -158,14 +158,20 @@ static struct console udbg_console = {
.index  = -1,
 };

+static int early_console_initialized;
+
 void __init disable_early_printk(void)
 {
+   if (!early_console_initialized)
+   return;
unregister_console(_console);
+   early_console_initialized = 0;
 }

 /* called by setup_system */
 void register_early_udbg_console(void)
 {
+   early_console_initialized = 1;
register_console(_console);
 }



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ppc64: iSeries early printk breakage

2005-09-07 Thread Milton Miller

Does anything actually break without this patch?

My reading of unregister_console says we will acquire
the console semaphore, walk the list, fail to find the
console, relase the semaphore, and return.


Hmm... unless there is a problem with the console preference
code?  I don't see anything that should deref a bad pointer,
maybe it breaks the preference?

From get web:

if (console_drivers == NULL)
preferred_console = selected_console;
else if (console-flags  CON_CONSDEV)
console_drivers-flags |= CON_CONSDEV;

On Sep 7, 2005, at 4:52 AM, Stephen Rothwell wrote:


The earlier commit 8d9273918635f0301368c01b56c03a6f339e8d51
(Consolidate early console and PPCDBG code) broke iSeries because
it caused unregister_console(udbg_console) to be called
unconditionally.  iSeries never registers the udbg_console.

This just reverts part of the change.

Signed-off-by: Stephen Rothwell [EMAIL PROTECTED]
---

 arch/ppc64/kernel/udbg.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

--
Cheers,
Stephen Rothwell[EMAIL PROTECTED]
http://www.canb.auug.org.au/~sfr/

234f5032f6ccb4d72e4b74d33af55716b67d8a27
diff --git a/arch/ppc64/kernel/udbg.c b/arch/ppc64/kernel/udbg.c
--- a/arch/ppc64/kernel/udbg.c
+++ b/arch/ppc64/kernel/udbg.c
@@ -158,14 +158,20 @@ static struct console udbg_console = {
.index  = -1,
 };

+static int early_console_initialized;
+
 void __init disable_early_printk(void)
 {
+   if (!early_console_initialized)
+   return;
unregister_console(udbg_console);
+   early_console_initialized = 0;
 }

 /* called by setup_system */
 void register_early_udbg_console(void)
 {
+   early_console_initialized = 1;
register_console(udbg_console);
 }



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/