RE: [PATCH v1 1/2] dt-bindings: mmc: sdhci-of-aspeed: Add power-gpio and power-switch-gpio
-"openbmc" wrote: - >To: Rob Herring >From: Steven Lee >Sent by: "openbmc" >Date: 04/12/2021 08:31PM >Cc: "open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS" >, Ulf Hansson , >Ryan Chen , "moderated list:ASPEED SD/MMC >DRIVER" , Andrew Jeffery >, "open list:ASPEED SD/MMC DRIVER" >, "moderated list:ASPEED SD/MMC DRIVER" >, Ryan Chen , >Adrian Hunter , open list >, Chin-Ting Kuo >, "moderated list:ARM/ASPEED MACHINE >SUPPORT" >Subject: [EXTERNAL] Re: [PATCH v1 1/2] dt-bindings: mmc: >sdhci-of-aspeed: Add power-gpio and power-switch-gpio > >The 04/10/2021 02:41, Rob Herring wrote: >> On Thu, Apr 08, 2021 at 09:52:17AM +0800, Steven Lee wrote: >> > AST2600-A2 EVB provides the reference design for enabling SD bus >power >> > and toggling SD bus signal voltage by GPIO pins. >> > Add the definition and example for power-gpio and >power-switch-gpio >> > properties. >> > >> > In the reference design, GPIOV0 of AST2600-A2 EVB is connected to >power >> > load switch that providing 3.3v to SD1 bus vdd. GPIOV1 is >connected to >> > a 1.8v and a 3.3v power load switch that providing signal voltage >to >> > SD1 bus. >> > If GPIOV0 is active high, SD1 bus is enabled. Otherwise, SD1 bus >is >> > disabled. >> > If GPIOV1 is active high, 3.3v power load switch is enabled, SD1 >signal >> > voltage is 3.3v. Otherwise, 1.8v power load switch will be >enabled, SD1 >> > signal voltage becomes 1.8v. >> > >> > AST2600-A2 EVB also support toggling signal voltage for SD2 bus. >> > The design is the same as SD1 bus. It uses GPIOV2 as power-gpio >and GPIOV3 >> > as power-switch-gpio. >> > >> > Signed-off-by: Steven Lee >> > --- >> > .../devicetree/bindings/mmc/aspeed,sdhci.yaml | 25 >+++ >> > 1 file changed, 25 insertions(+) >> > >> > diff --git >a/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml >b/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml >> > index 987b287f3bff..515a74614f3c 100644 >> > --- a/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml >> > +++ b/Documentation/devicetree/bindings/mmc/aspeed,sdhci.yaml >> > @@ -37,6 +37,14 @@ properties: >> >clocks: >> > maxItems: 1 >> > description: The SD/SDIO controller clock gate >> > + power-gpio: >> >> '-gpios' is the preferred form even if just 1. >> > >Thanks for reviewing, I will change the name. is this a clock gate or a power on gpio? > >> > +description: >> > + The GPIO for enabling/disabling SD bus power. >> > +maxItems: 1 >> >> blank line >> > >I will remove the blank line. > >> > + power-switch-gpio: >> > +description: >> > + The GPIO for toggling the signal voltage between 3.3v and >1.8v. Which way does it toggle for which voltage? Oh, you said in the change log but not in the binding. But please, use gpio controled regulators as Ulf suggested and is already used by other mmc controllers upstream. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ Documentation/devicetree/bindings/regulator/gpio-regulator.yaml Ulf> Please do not model these as GPIO pins like this. Instead, it's better Ulf> to model them as gpio regulators, since the mmc core manages them as Ulf> regulators. Ulf> Ulf> We have a vmmc regulator (corresponding to vdd) and a vqmmc regulator Ulf> (corresponding the signal-voltage level). These are also described in Ulf> the common mmc DT bindings, see Ulf> Documentation/devicetree/bindings/mmc/mmc-controller.yaml Ulf> . milton >> > +maxItems: 1 >> > >> > patternProperties: >> >"^sdhci@[0-9a-f]+$": >> > @@ -61,6 +69,14 @@ patternProperties: >> >sdhci,auto-cmd12: >> > type: boolean >> > description: Specifies that controller should use auto >CMD12 >> > + power-gpio: >> > +description: >> > + The GPIO for enabling/disabling SD bus power. >> > +maxItems: 1 >> > + power-switch-gpio: >> > +description: >> > + The GPIO for toggling the signal voltage between 3.3v >and 1.8v. >> > +maxItems: 1 >> > required: >> >- compatible >> >- reg >> > @@ -80,6 +96,7 @@ required: >> > examples: >> >- | >> > #include >> > +#include >> > sdc@1e74 { >> > compatible = "aspeed,ast2500-sd-controller"; >> > reg = <0x1e74 0x100>; >> > @@ -94,6 +111,10 @@ examples: >> > interrupts = <26>; >> > sdhci,auto-cmd12; >> > clocks = < ASPEED_CLK_SDIO>; >> > +power-gpio = < ASPEED_GPIO(V, 0) >> > + GPIO_ACTIVE_HIGH>; >> > +power-switch-gpio = < ASPEED_GPIO(V, >1) >> > + GPIO_ACTIVE_HIGH>; >> > }; >> > >> > sdhci1: sdhci@200 { >> > @@ -102,5 +123,9 @@ examples: >> > interrupts = <26>; >> > sdhci,auto-cmd12; >> > clocks = <
Re: [PATCH v4] hwmon:(adm1275) Enable adm1278 ADM1278_TEMP1_EN
On : 06/19/2020 abiout 12:46PM in some timezone, Manikandan Elumalai wrote: >The adm1278 temp attribute need it for openbmc platform . >This feature not enabled by default, so PMON_CONFIG needs to enable >it. > >Signed-off-by: Manikandan Elumalai >--- >---v4 -Reported-by: kernel test robot Thie above tag should be Adjacent to the Signed-off-by. >---v3 -fix invalid signed-off. >--- -removed checkpath warnings. >--- -write ADM1278_TEMP1_EN and ADM1278_VOUT_EN conf in single >line operation. >---v2 -add Signed-off-by. >--- -removed ADM1278_TEMP1_EN check. >--- The canonical format is to have a line of 3 dashes then the trailing changelog Please read the documentation at https://www.kernel.org/doc/html/latest/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes https://www.kernel.org/doc/html/latest/process/submitting-patches.html#the-canonical-patch-format milton
Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings: Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs
About 03/07/2018 04:12PM in some time zone, Pavel Machek wrote: >Subject: Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings: >Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs > >Hi! > >> >Are these SoCs x86-based? >> >> Yes, these are ARM SoCs. Please see Andrew's answer as well. > >Understood, thanks. > >> >>+ Read sampling point selection. The whole period of a bit time >will be >> >>+ divided into 16 time frames. This value will determine which >time frame >> >>+ this controller will sample PECI signal for data read back. >Usually in >> >>+ the middle of a bit time is the best. >> > >> >English? "This value will determine when this controller"? >> > >> >> Could I change it like below?: >> >> "This value will determine in which time frame this controller >samples PECI >> signal for data read back" > >I guess... I'm not native speaker, I guess this could be improved >some >more. > I agree this wording is still confusing. The problem is that the key subject, the time of the sampling, is in the descriptive clause "in which time frame". "This value will determine the time frame in which the controller will sample" or perhaps phrase it as saving a specific sample from the over-clock, or a phase of the clock. >Best regards, > Pavel > >-- >(english) http://www.livejournal.com/~pavelmachek >(cesky, pictures) >http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html > milton -- Speaking for myself not IBM.
Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings: Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs
About 03/07/2018 04:12PM in some time zone, Pavel Machek wrote: >Subject: Re: [PATCH v2 2/8] [PATCH 2/8] Documentations: dt-bindings: >Add a document of PECI adapter driver for Aspeed AST24xx/25xx SoCs > >Hi! > >> >Are these SoCs x86-based? >> >> Yes, these are ARM SoCs. Please see Andrew's answer as well. > >Understood, thanks. > >> >>+ Read sampling point selection. The whole period of a bit time >will be >> >>+ divided into 16 time frames. This value will determine which >time frame >> >>+ this controller will sample PECI signal for data read back. >Usually in >> >>+ the middle of a bit time is the best. >> > >> >English? "This value will determine when this controller"? >> > >> >> Could I change it like below?: >> >> "This value will determine in which time frame this controller >samples PECI >> signal for data read back" > >I guess... I'm not native speaker, I guess this could be improved >some >more. > I agree this wording is still confusing. The problem is that the key subject, the time of the sampling, is in the descriptive clause "in which time frame". "This value will determine the time frame in which the controller will sample" or perhaps phrase it as saving a specific sample from the over-clock, or a phase of the clock. >Best regards, > Pavel > >-- >(english) http://www.livejournal.com/~pavelmachek >(cesky, pictures) >http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html > milton -- Speaking for myself not IBM.
Re: [PATCH v2 01/31] af9005: don't do DMA on stack
On Tue, 11 Oct 2016 07:09:16 -0300, Mauro wrote: > struct af9005_device_state { > > u8 sequence; > > int led_state; > > + unsigned char data[256]; > > + struct mutex data_mutex; > > }; This will not work on DMA incoherent architectures. When the data cache is invalidated at the end of the DMA, the stores to the mutex will be thrown away and the system will not be happy as changed to the mutex are lost. The data allocation needs to be a separate kmalloc, as is somewhat obtusely documented around line 226 of DMA-API.txt[1]. A separate kmalloc will be aligned to be in separate cache lines. Excerpt from DMA-API.txt: Warnings: Memory coherency operates at a granularity called the cache line width. In order for memory mapped by this API to operate correctly, the mapped region must begin exactly on a cache line boundary and end exactly on one (to prevent two separately mapped regions from sharing a single cache line). Since the cache line size may not be known at compile time, the API will not enforce this requirement. Therefore, it is recommended that driver writers who don't take special care to determine the cache line size at run time only map virtual regions that begin and end on page boundaries (which are guaranteed also to be cache line boundaries). [1]https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/DMA-API.txt#n226 milton PS: I normally read lkml from various web archives, sorry for loss of threading and cc.
Re: [PATCH v2 01/31] af9005: don't do DMA on stack
On Tue, 11 Oct 2016 07:09:16 -0300, Mauro wrote: > struct af9005_device_state { > > u8 sequence; > > int led_state; > > + unsigned char data[256]; > > + struct mutex data_mutex; > > }; This will not work on DMA incoherent architectures. When the data cache is invalidated at the end of the DMA, the stores to the mutex will be thrown away and the system will not be happy as changed to the mutex are lost. The data allocation needs to be a separate kmalloc, as is somewhat obtusely documented around line 226 of DMA-API.txt[1]. A separate kmalloc will be aligned to be in separate cache lines. Excerpt from DMA-API.txt: Warnings: Memory coherency operates at a granularity called the cache line width. In order for memory mapped by this API to operate correctly, the mapped region must begin exactly on a cache line boundary and end exactly on one (to prevent two separately mapped regions from sharing a single cache line). Since the cache line size may not be known at compile time, the API will not enforce this requirement. Therefore, it is recommended that driver writers who don't take special care to determine the cache line size at run time only map virtual regions that begin and end on page boundaries (which are guaranteed also to be cache line boundaries). [1]https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/DMA-API.txt#n226 milton PS: I normally read lkml from various web archives, sorry for loss of threading and cc.
Re: [RFC] netconsole.txt: "nc" needs "-p" to specify the listening port
[adding Rob as Doc maintanier] On Sat, 28 Jul 2012 about 11:08:16 -, Dirk Gouders wrote: > Borislav Petkov writes: > > > On Fri, Jul 27, 2012 at 11:24:53AM +0200, Dirk Gouders wrote: > >> Cong Wang writes: > >> > >> > On Fri, Jul 27, 2012 at 2:35 PM, Dirk Gouders > >> > wrote: > >> >> Hi Jesse, > >> >> > >> >> I would like to ask you to check if the documentation of "nc" in > >> >> netconsole.txt is still correct. I tried two different netcat packages > >> >> and both require "-p" to specify the listening port. I am wondering if > >> >> that changed after the use of "nc" has been documented. > >> > > >> > On Fedora 16, `nc -u -l ` works fine. > >> > >> Thanks for checking that. > >> > >> If the information I found is correct, Fedora uses OpenBSD's nc > >> codebase. The two netcat packages I tested on a Gentoo system differ in > >> requiring the -p switch for the port specification. > > > > So say exactly that in the doc: that the *BSD's version of nc doesn't > > need the port number specified with '-p' and you're covered. > OK, I tried that in the attached patch. > I'm not sure if every exeption needs to/should be documented, though. > > >From 3cdeac3e814471053129145c5fa8391acb365fd8 Mon Sep 17 00:00:00 2001 > From: Dirk Gouders > Date: Sat, 28 Jul 2012 12:32:49 +0200 > Subject: [PATCH] netconsole.txt: non-BSD versions of nc(1) require '-p' > switch > > Gentoo for example uses non-BSD versions of nc(1) which require > the '-p' switch to specify the listening port. > > --- > Documentation/networking/netconsole.txt |3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/Documentation/networking/netconsole.txt > b/Documentation/networking/netconsole.txt > index 8d02207..9a362f8 100644 > --- a/Documentation/networking/netconsole.txt > +++ b/Documentation/networking/netconsole.txt > @@ -52,7 +52,8 @@ initialized and attempts to bring up the supplied dev at > the supplied > address. > > The remote host can run either 'netcat -u -l -p ', So the above line shows usage with -p > -'nc -l -u ' or syslogd. > +'nc -l -u ' (BSD version of nc(1) e.g. Fedora), now you add a comment about BSD and say Fedora which is not obviously BSD (this is Documentation; reading the git history for clarification is not approprate). > +'nc -l -u -p ' or syslogd. And now you add the original -p which you probably skipped over since it was on the previous line? > > Dynamic reconfiguration: > milton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] netconsole.txt: nc needs -p to specify the listening port
[adding Rob as Doc maintanier] On Sat, 28 Jul 2012 about 11:08:16 -, Dirk Gouders wrote: Borislav Petkov b...@alien8.de writes: On Fri, Jul 27, 2012 at 11:24:53AM +0200, Dirk Gouders wrote: Cong Wang xiyou.wangc...@gmail.com writes: On Fri, Jul 27, 2012 at 2:35 PM, Dirk Gouders goud...@et.bocholt.fh-gelsenkirchen.de wrote: Hi Jesse, I would like to ask you to check if the documentation of nc in netconsole.txt is still correct. I tried two different netcat packages and both require -p to specify the listening port. I am wondering if that changed after the use of nc has been documented. On Fedora 16, `nc -u -l port number` works fine. Thanks for checking that. If the information I found is correct, Fedora uses OpenBSD's nc codebase. The two netcat packages I tested on a Gentoo system differ in requiring the -p switch for the port specification. So say exactly that in the doc: that the *BSD's version of nc doesn't need the port number specified with '-p' and you're covered. OK, I tried that in the attached patch. I'm not sure if every exeption needs to/should be documented, though. From 3cdeac3e814471053129145c5fa8391acb365fd8 Mon Sep 17 00:00:00 2001 From: Dirk Gouders goud...@et.bocholt.fh-gelsenkirchen.de Date: Sat, 28 Jul 2012 12:32:49 +0200 Subject: [PATCH] netconsole.txt: non-BSD versions of nc(1) require '-p' switch Gentoo for example uses non-BSD versions of nc(1) which require the '-p' switch to specify the listening port. --- Documentation/networking/netconsole.txt |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt index 8d02207..9a362f8 100644 --- a/Documentation/networking/netconsole.txt +++ b/Documentation/networking/netconsole.txt @@ -52,7 +52,8 @@ initialized and attempts to bring up the supplied dev at the supplied address. The remote host can run either 'netcat -u -l -p port', So the above line shows usage with -p -'nc -l -u port' or syslogd. +'nc -l -u port' (BSD version of nc(1) e.g. Fedora), now you add a comment about BSD and say Fedora which is not obviously BSD (this is Documentation; reading the git history for clarification is not approprate). +'nc -l -u -p port' or syslogd. And now you add the original -p which you probably skipped over since it was on the previous line? Dynamic reconfiguration: milton -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec
On Dec 2, 2007, at 9:59 PM, Geoff Levand wrote: Hi. I'm finding that recently kexec'ed kernels on PS3 will panic on startup. It seems the trouble was introduced with the ppc64 SPARSEMEM_VMEMMAP support. The problem is the same when starting either new or old kernels: 2.6.24 -> 2.6.23 ok 2.6.24 -> 2.6.23 panic 2.6.24 -> 2.6.24 panic I'm not sure I completely follow this. What is the difference between 1 and 2 ? Also, you are talking about starting with kexec, but I don't see how that fits in the failure you have below. In other words, there may be more than one failure. But I can talk a bit about the scope of the problem in the current traceback. These are the commits that seem to introduce the problem: d29eff7bca60c9ee401d691d4562a4abca8de543 ppc64: SPARSEMEM_VMEMMAP suppor 8f6aac419bd590f535fb110875a51f7db2b62b5b Generic Virtual Memmap support for SPARSEMEM Below is a startup dump. Any help in finding the problem would be appreciated. -Geoff ps3_mm_add_memory:317: start_addr 74032000h, start_pfn 74032h, nr_pages 17000h <4>swapper: page allocation failure. order:12, mode:0x80d0 Call Trace: [c6047820] [c000e700] .show_stack+0x68/0x1b0 (unreliable) [c60478c0] [c0089eb4] .__alloc_pages+0x358/0x3ac [c60479b0] [c00a3964] .vmemmap_alloc_block+0x6c/0xf4 [c6047a40] [c0026544] .vmemmap_populate+0x74/0x100 [c6047ae0] [c00a385c] .sparse_mem_map_populate+0x38/0x5c [c6047b70] [c00a36e4] .sparse_add_one_section+0x64/0x128 [c6047c20] [c00aa74c] .__add_pages+0xac/0x18c [c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60 [c6047d60] [c00aa5b0] .add_memory+0xd4/0x124 [c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108 [c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8 [c6047f90] [c0021d88] .kernel_thread+0x4c/0x68 Mem-info: DMA per-cpu: CPU0: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0 CPU1: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0 Active:0 inactive:0 dirty:0 writeback:0 unstable:0 free:18094 slab:122 mapped:0 pagetables:0 bounce:0 DMA free:72376kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:129280kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 DMA: 8*4kB 5*8kB 5*16kB 7*32kB 3*64kB 5*128kB 4*256kB 3*512kB 5*1024kB 3*2048kB 4*4096kB 5*8192kB 0*16384kB = 72376kB Swap cache: add 0, delete 0, find 0/0, race 0+0 Free swap = 0kB Total swap = 0kB Free swap:0kB 32768 pages of RAM 10403 reserved pages 0 pages shared 0 pages swap cached The kernel is using 16MB pages for the linear mapping and, since its in the same region, the sparse virtural memmap. PS3 uses hotplug for all most all of its memory. In this case, its trying to allocate an additional page to cover a new region of the memory map. However, the initial 128 MB is fragmented, we have 8 8M chunks but no 16MB ones. <1>Unable to handle kernel paging request for data at address 0xcf0001960b10 <1>Faulting instruction address: 0xc0087340 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2 PS3 Modules linked in: NIP: c0087340 LR: c008733c CTR: REGS: c6047900 TRAP: 0300 Not tainted (2.6.24-rc3-ps3-linux-dev-g91428d55-dirty) MSR: 80008032 CR: 2200 XER: DAR: cf0001960b10, DSISR: 4200 TASK = c6041080[1] 'swapper' THREAD: c6044000 CPU: 1 <6>GPR00: c6047b80 c052b410 c6001b40 <6>GPR04: 0001 0003 0008 <6>GPR08: 0002 cf0001960b08 c6051240 0003 <6>GPR12: 0003 c0484080 100d 00bc5000 <6>GPR16: 07fff000 0001 100a 100d <6>GPR20: 100df628 100df458 100df678 <6>GPR24: 00740336 c0492c00 0001 <6>GPR28: 000740325000 000740324924 c04ce9a8 cf0001960ae0 NIP [c0087340] .memmap_init_zone+0xf0/0x134 LR [c008733c] .memmap_init_zone+0xec/0x134 Call Trace: [c6047b80] [c01da530] .add_memory_block+0xd8/0x108 (unreliable) [c6047c20] [c00aa7ac] .__add_pages+0x10c/0x18c [c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60 [c6047d60] [c00aa5b0] .add_memory+0xd4/0x124 [c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108 [c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8 [c6047f90] [c0021d88] .kernel_thread+0x4c/0x68 Instruction dump: 901f000c 38000400 7d20f8a8 7d290378 7d20f9ad 40a2fff4 7ba00521 7fe3fb78 3882 41820008 4b0d 393f0028 f93f0028 3bbd0001 3bff0038 <0>Kernel panic - not
Re: PS3: trouble with SPARSEMEM_VMEMMAP and kexec
On Dec 2, 2007, at 9:59 PM, Geoff Levand wrote: Hi. I'm finding that recently kexec'ed kernels on PS3 will panic on startup. It seems the trouble was introduced with the ppc64 SPARSEMEM_VMEMMAP support. The problem is the same when starting either new or old kernels: 2.6.24 - 2.6.23 ok 2.6.24 - 2.6.23 panic 2.6.24 - 2.6.24 panic I'm not sure I completely follow this. What is the difference between 1 and 2 ? Also, you are talking about starting with kexec, but I don't see how that fits in the failure you have below. In other words, there may be more than one failure. But I can talk a bit about the scope of the problem in the current traceback. These are the commits that seem to introduce the problem: d29eff7bca60c9ee401d691d4562a4abca8de543 ppc64: SPARSEMEM_VMEMMAP suppor 8f6aac419bd590f535fb110875a51f7db2b62b5b Generic Virtual Memmap support for SPARSEMEM Below is a startup dump. Any help in finding the problem would be appreciated. -Geoff ps3_mm_add_memory:317: start_addr 74032000h, start_pfn 74032h, nr_pages 17000h 4swapper: page allocation failure. order:12, mode:0x80d0 Call Trace: [c6047820] [c000e700] .show_stack+0x68/0x1b0 (unreliable) [c60478c0] [c0089eb4] .__alloc_pages+0x358/0x3ac [c60479b0] [c00a3964] .vmemmap_alloc_block+0x6c/0xf4 [c6047a40] [c0026544] .vmemmap_populate+0x74/0x100 [c6047ae0] [c00a385c] .sparse_mem_map_populate+0x38/0x5c [c6047b70] [c00a36e4] .sparse_add_one_section+0x64/0x128 [c6047c20] [c00aa74c] .__add_pages+0xac/0x18c [c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60 [c6047d60] [c00aa5b0] .add_memory+0xd4/0x124 [c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108 [c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8 [c6047f90] [c0021d88] .kernel_thread+0x4c/0x68 Mem-info: DMA per-cpu: CPU0: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0 CPU1: Hot: hi: 42, btch: 7 usd: 0 Cold: hi: 14, btch: 3 usd: 0 Active:0 inactive:0 dirty:0 writeback:0 unstable:0 free:18094 slab:122 mapped:0 pagetables:0 bounce:0 DMA free:72376kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:129280kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 DMA: 8*4kB 5*8kB 5*16kB 7*32kB 3*64kB 5*128kB 4*256kB 3*512kB 5*1024kB 3*2048kB 4*4096kB 5*8192kB 0*16384kB = 72376kB Swap cache: add 0, delete 0, find 0/0, race 0+0 Free swap = 0kB Total swap = 0kB Free swap:0kB 32768 pages of RAM 10403 reserved pages 0 pages shared 0 pages swap cached The kernel is using 16MB pages for the linear mapping and, since its in the same region, the sparse virtural memmap. PS3 uses hotplug for all most all of its memory. In this case, its trying to allocate an additional page to cover a new region of the memory map. However, the initial 128 MB is fragmented, we have 8 8M chunks but no 16MB ones. 1Unable to handle kernel paging request for data at address 0xcf0001960b10 1Faulting instruction address: 0xc0087340 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2 PS3 Modules linked in: NIP: c0087340 LR: c008733c CTR: REGS: c6047900 TRAP: 0300 Not tainted (2.6.24-rc3-ps3-linux-dev-g91428d55-dirty) MSR: 80008032 EE,IR,DR CR: 2200 XER: DAR: cf0001960b10, DSISR: 4200 TASK = c6041080[1] 'swapper' THREAD: c6044000 CPU: 1 6GPR00: c6047b80 c052b410 c6001b40 6GPR04: 0001 0003 0008 6GPR08: 0002 cf0001960b08 c6051240 0003 6GPR12: 0003 c0484080 100d 00bc5000 6GPR16: 07fff000 0001 100a 100d 6GPR20: 100df628 100df458 100df678 6GPR24: 00740336 c0492c00 0001 6GPR28: 000740325000 000740324924 c04ce9a8 cf0001960ae0 NIP [c0087340] .memmap_init_zone+0xf0/0x134 LR [c008733c] .memmap_init_zone+0xec/0x134 Call Trace: [c6047b80] [c01da530] .add_memory_block+0xd8/0x108 (unreliable) [c6047c20] [c00aa7ac] .__add_pages+0x10c/0x18c [c6047cd0] [c0025fd4] .arch_add_memory+0x44/0x60 [c6047d60] [c00aa5b0] .add_memory+0xd4/0x124 [c6047e00] [c0452544] .ps3_mm_add_memory+0x8c/0x108 [c6047ea0] [c04417c4] .kernel_init+0x1f4/0x3b8 [c6047f90] [c0021d88] .kernel_thread+0x4c/0x68 Instruction dump: 901f000c 38000400 7d20f8a8 7d290378 7d20f9ad 40a2fff4 7ba00521 7fe3fb78 3882 41820008 4b0d 393f0028 f9290008 f93f0028 3bbd0001 3bff0038 0Kernel panic - not syncing:
[PATCH] Fix sched_domain sysctl registration again
commit 029190c515f15f512ac85de8fc686d4dbd0ae731 (cpuset sched_load_balance flag) was not tested SCHED_DEBUG enabled as committed as it dereferences NULL when used and it reordered the sysctl registration to cause it to never show any domains or their tunables. Fixes: 1) restore arch_init_sched_domains ordering we can't walk the domains before we build them presently we register cpus with empty directories (no domain directories or files). 2) make unregister_sched_domain_sysctl do nothing when already unregistered detach_destroy_domains is now called one set of cpus at a time unregister_syctl dereferences NULL if called with a null. While the the function would always dereference null if called twice, in the previous code it was always called once and then was followed a register. So only the hidden bug of the sysctl_root_table not being allocated followed by an attempt to free it would have shown the error. 3) always call unregister and register in partition_sched_domains The code is "smart" about unregistering only needed domains. Since we aren't guaranteed any calls to unregister, always unregister. Without calling register on the way out we will not have a table or any sysctl tree. 4) warn if register is called without unregistering The previous table memory is lost, leaving pointers to the later freed memory in sysctl and leaking the memory of the tables. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> --- [resend with From set] against e8b8c977734193adedf2b0f607d6252c78e86394 Before this patch on a 2-core 4-thread box compiled for SMT and NUMA, the domains appear empty (there are actually 3 levels per cpu). And as soon as two domains a null pointer is dereferenced (unreliable in this case is stack garbage): bu19a:~# ls -R /proc/sys/kernel/sched_domain/ /proc/sys/kernel/sched_domain/: cpu0 cpu1 cpu2 cpu3 /proc/sys/kernel/sched_domain/cpu0: /proc/sys/kernel/sched_domain/cpu1: /proc/sys/kernel/sched_domain/cpu2: /proc/sys/kernel/sched_domain/cpu3: bu19a:~# mkdir /dev/cpuset bu19a:~# mount -tcpuset cpuset /dev/cpuset/ bu19a:~# cd /dev/cpuset/ bu19a:/dev/cpuset# echo 0 > sched_load_balance bu19a:/dev/cpuset# mkdir one bu19a:/dev/cpuset# echo 1 > one/cpus bu19a:/dev/cpuset# echo 0 > one/sched_load_balance Unable to handle kernel paging request for data at address 0x0018 Faulting instruction address: 0xc006b608 NIP: c006b608 LR: c006b604 CTR: REGS: c00018d973f0 TRAP: 0300 Not tainted (2.6.23-bml) MSR: 90009032 CR: 28242442 XER: DAR: 0018, DSISR: 4000 TASK = c0001912e340[1987] 'bash' THREAD: c00018d94000 CPU: 2 .. NIP [c006b608] .unregister_sysctl_table+0x38/0x110 LR [c006b604] .unregister_sysctl_table+0x34/0x110 Call Trace: [c00018d97670] [c7017270] 0xc7017270 (unreliable) [c00018d97720] [c0058710] .detach_destroy_domains+0x30/0xb0 [c00018d977b0] [c005cf1c] .partition_sched_domains+0x1bc/0x230 [c00018d97870] [c009fdc4] .rebuild_sched_domains+0xb4/0x4c0 [c00018d97970] [c00a02e8] .update_flag+0x118/0x170 [c00018d97a80] [c00a1768] .cpuset_common_file_write+0x568/0x820 [c00018d97c00] [c009d95c] .cgroup_file_write+0x7c/0x180 [c00018d97cf0] [c00e76b8] .vfs_write+0xe8/0x1b0 [c00018d97d90] [c00e810c] .sys_write+0x4c/0x90 [c00018d97e30] [c000852c] syscall_exit+0x0/0x40 diff --git a/kernel/sched.c b/kernel/sched.c index afe76ec..2af50ec 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5463,11 +5463,12 @@ static void register_sched_domain_sysctl(void) struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); char buf[32]; + WARN_ON(sd_ctl_dir[0].child); + sd_ctl_dir[0].child = entry; + if (entry == NULL) return; - sd_ctl_dir[0].child = entry; - for_each_online_cpu(i) { snprintf(buf, 32, "cpu%d", i); entry->procname = kstrdup(buf, GFP_KERNEL); @@ -5475,14 +5476,19 @@ static void register_sched_domain_sysctl(void) entry->child = sd_alloc_ctl_cpu_table(i); entry++; } + + WARN_ON(sd_sysctl_header); sd_sysctl_header = register_sysctl_table(sd_ctl_root); } +/* may be called multiple times per register */ static void unregister_sched_domain_sysctl(void) { - unregister_sysctl_table(sd_sysctl_header); + if (sd_sysctl_header) + unregister_sysctl_table(sd_sysctl_header); sd_sysctl_header = NULL; - sd_free_ctl_entry(_ctl_dir[0].child); + if (sd_ctl_dir[0].child) + sd_free_ctl_entry(_ctl_dir[0].child); } #else static void register_sched
[PATCH] restore arch/ppc/boot cflags
Commit 9a39e273d4df0560c724c5fe71f6314a0583ca2b removed the boot directory addition to CFLAGS that was being used by the subdirectory builds. For the other files, that patch set EXTRA_CFLAGS, but Makefile.build explicitly sets that to empty as it is explicitly for a single directory only. Append to KBUILD_CFLAGS instead. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> --- The commit also changed xtensia to export EXTRA_CFLAGS from its boot directory, that needs to be fixed too. from ARCH=ppc prep_defconfig: /data/home/miltonm/work.git/arch/ppc/boot/of1275/write.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/read.c:11:20:/data/home/miltonm/work.git/arch/ppc/boot/of1275/ofstdio.c:11:20: error: error: of1275.h: No such file or directoryof1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/release.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/ofinit.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:12:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/getprop.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:14:20: error: serial.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:13:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/enter.c:11:20: error: of1275.h: No such file or directory error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:12:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/finddevice.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/bootinfo.c:14:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/misc-common.c:18:22: error: nonstdio.h: No such file or directory of1275.h: No such file or directory diff --git a/arch/ppc/boot/Makefile b/arch/ppc/boot/Makefile index 487dc66..52a0d38 100644 --- a/arch/ppc/boot/Makefile +++ b/arch/ppc/boot/Makefile @@ -13,6 +13,7 @@ # modified by Cort ([EMAIL PROTECTED]) # +KBUILD_CFLAGS += -fno-builtin -D__BOOTER__ -Iarch/$(ARCH)/boot/include HOSTCFLAGS += -Iarch/$(ARCH)/boot/include BOOT_TARGETS = zImage zImage.initrd znetboot znetboot.initrd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] restore arch/ppc/boot cflags
Commit 9a39e273d4df0560c724c5fe71f6314a0583ca2b removed the boot directory addition to CFLAGS that was being used by the subdirectory builds. For the other files, that patch set EXTRA_CFLAGS, but Makefile.build explicitly sets that to empty as it is explicitly for a single directory only. Append to KBUILD_CFLAGS instead. Signed-off-by: Milton Miller [EMAIL PROTECTED] --- The commit also changed xtensia to export EXTRA_CFLAGS from its boot directory, that needs to be fixed too. from ARCH=ppc prep_defconfig: /data/home/miltonm/work.git/arch/ppc/boot/of1275/write.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/read.c:11:20:/data/home/miltonm/work.git/arch/ppc/boot/of1275/ofstdio.c:11:20: error: error: of1275.h: No such file or directoryof1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/release.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/ofinit.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:12:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/map.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/getprop.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:14:20: error: serial.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/ns16550.c:13:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/enter.c:11:20: error: of1275.h: No such file or directory error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:12:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/claim.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/of1275/finddevice.c:11:20: error: of1275.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/bootinfo.c:14:22: error: nonstdio.h: No such file or directory /data/home/miltonm/work.git/arch/ppc/boot/common/misc-common.c:18:22: error: nonstdio.h: No such file or directory of1275.h: No such file or directory diff --git a/arch/ppc/boot/Makefile b/arch/ppc/boot/Makefile index 487dc66..52a0d38 100644 --- a/arch/ppc/boot/Makefile +++ b/arch/ppc/boot/Makefile @@ -13,6 +13,7 @@ # modified by Cort ([EMAIL PROTECTED]) # +KBUILD_CFLAGS += -fno-builtin -D__BOOTER__ -Iarch/$(ARCH)/boot/include HOSTCFLAGS += -Iarch/$(ARCH)/boot/include BOOT_TARGETS = zImage zImage.initrd znetboot znetboot.initrd - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix sched_domain sysctl registration again
commit 029190c515f15f512ac85de8fc686d4dbd0ae731 (cpuset sched_load_balance flag) was not tested SCHED_DEBUG enabled as committed as it dereferences NULL when used and it reordered the sysctl registration to cause it to never show any domains or their tunables. Fixes: 1) restore arch_init_sched_domains ordering we can't walk the domains before we build them presently we register cpus with empty directories (no domain directories or files). 2) make unregister_sched_domain_sysctl do nothing when already unregistered detach_destroy_domains is now called one set of cpus at a time unregister_syctl dereferences NULL if called with a null. While the the function would always dereference null if called twice, in the previous code it was always called once and then was followed a register. So only the hidden bug of the sysctl_root_table not being allocated followed by an attempt to free it would have shown the error. 3) always call unregister and register in partition_sched_domains The code is smart about unregistering only needed domains. Since we aren't guaranteed any calls to unregister, always unregister. Without calling register on the way out we will not have a table or any sysctl tree. 4) warn if register is called without unregistering The previous table memory is lost, leaving pointers to the later freed memory in sysctl and leaking the memory of the tables. Signed-off-by: Milton Miller [EMAIL PROTECTED] --- [resend with From set] against e8b8c977734193adedf2b0f607d6252c78e86394 Before this patch on a 2-core 4-thread box compiled for SMT and NUMA, the domains appear empty (there are actually 3 levels per cpu). And as soon as two domains a null pointer is dereferenced (unreliable in this case is stack garbage): bu19a:~# ls -R /proc/sys/kernel/sched_domain/ /proc/sys/kernel/sched_domain/: cpu0 cpu1 cpu2 cpu3 /proc/sys/kernel/sched_domain/cpu0: /proc/sys/kernel/sched_domain/cpu1: /proc/sys/kernel/sched_domain/cpu2: /proc/sys/kernel/sched_domain/cpu3: bu19a:~# mkdir /dev/cpuset bu19a:~# mount -tcpuset cpuset /dev/cpuset/ bu19a:~# cd /dev/cpuset/ bu19a:/dev/cpuset# echo 0 sched_load_balance bu19a:/dev/cpuset# mkdir one bu19a:/dev/cpuset# echo 1 one/cpus bu19a:/dev/cpuset# echo 0 one/sched_load_balance Unable to handle kernel paging request for data at address 0x0018 Faulting instruction address: 0xc006b608 NIP: c006b608 LR: c006b604 CTR: REGS: c00018d973f0 TRAP: 0300 Not tainted (2.6.23-bml) MSR: 90009032 EE,ME,IR,DR CR: 28242442 XER: DAR: 0018, DSISR: 4000 TASK = c0001912e340[1987] 'bash' THREAD: c00018d94000 CPU: 2 .. NIP [c006b608] .unregister_sysctl_table+0x38/0x110 LR [c006b604] .unregister_sysctl_table+0x34/0x110 Call Trace: [c00018d97670] [c7017270] 0xc7017270 (unreliable) [c00018d97720] [c0058710] .detach_destroy_domains+0x30/0xb0 [c00018d977b0] [c005cf1c] .partition_sched_domains+0x1bc/0x230 [c00018d97870] [c009fdc4] .rebuild_sched_domains+0xb4/0x4c0 [c00018d97970] [c00a02e8] .update_flag+0x118/0x170 [c00018d97a80] [c00a1768] .cpuset_common_file_write+0x568/0x820 [c00018d97c00] [c009d95c] .cgroup_file_write+0x7c/0x180 [c00018d97cf0] [c00e76b8] .vfs_write+0xe8/0x1b0 [c00018d97d90] [c00e810c] .sys_write+0x4c/0x90 [c00018d97e30] [c000852c] syscall_exit+0x0/0x40 diff --git a/kernel/sched.c b/kernel/sched.c index afe76ec..2af50ec 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5463,11 +5463,12 @@ static void register_sched_domain_sysctl(void) struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); char buf[32]; + WARN_ON(sd_ctl_dir[0].child); + sd_ctl_dir[0].child = entry; + if (entry == NULL) return; - sd_ctl_dir[0].child = entry; - for_each_online_cpu(i) { snprintf(buf, 32, cpu%d, i); entry-procname = kstrdup(buf, GFP_KERNEL); @@ -5475,14 +5476,19 @@ static void register_sched_domain_sysctl(void) entry-child = sd_alloc_ctl_cpu_table(i); entry++; } + + WARN_ON(sd_sysctl_header); sd_sysctl_header = register_sysctl_table(sd_ctl_root); } +/* may be called multiple times per register */ static void unregister_sched_domain_sysctl(void) { - unregister_sysctl_table(sd_sysctl_header); + if (sd_sysctl_header) + unregister_sysctl_table(sd_sysctl_header); sd_sysctl_header = NULL; - sd_free_ctl_entry(sd_ctl_dir[0].child); + if (sd_ctl_dir[0].child) + sd_free_ctl_entry(sd_ctl_dir[0].child); } #else static void register_sched_domain_sysctl(void) @@ -6426,13 +6432,17
[PATCH] sched domain sysctl: free kstrdup allocations
The procnames for the cpu and domain were allocated via kstrdup and so should also be freed. The names for the files are static, but we can differentiate them by the presence of the proc_handler. If a kstrdup (of < 32 characters) fails the sysctl code will not see the procname or remaining table entries, but any child tables and names will be reclaimed upon free. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> --- Hi Ingo. It occurred to me this morning that the procname field was dynamically allocated and needed to be freed. I started to put in break statements when allocation failed but it was approaching 50% error handling code. I came up with this alternative of looping while entry->mode is set and checking proc_handler instead of ->table. Alternatively, the string version of the domain name and cpu number could be stored the structs. I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation counts after taking a cpuset exclusive and back. milton Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-15 12:21:38.0 -0500 +++ kernel/kernel/sched.c 2007-10-15 12:22:12.0 -0500 @@ -5290,11 +5290,20 @@ static struct ctl_table *sd_alloc_ctl_en static void sd_free_ctl_entry(struct ctl_table **tablep) { - struct ctl_table *entry = *tablep; + struct ctl_table *entry; - for (entry = *tablep; entry->procname; entry++) + /* +* In the intermediate directories, both the child directory and +* procname are dynamically allocated and could fail but the mode +* will always be set. In the lowest directory the names are +* static strings and all have proc handlers. +*/ + for (entry = *tablep; entry->mode; entry++) { if (entry->child) sd_free_ctl_entry(>child); + if (entry->proc_handler == NULL) + kfree(entry->procname); + } kfree(*tablep); *tablep = NULL; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: free kstrdup allocations
The procnames for the cpu and domain were allocated via kstrdup and so should also be freed. The names for the files are static, but we can differentiate them by the presence of the proc_handler. If a kstrdup (of 32 characters) fails the sysctl code will not see the procname or remaining table entries, but any child tables and names will be reclaimed upon free. Signed-off-by: Milton Miller [EMAIL PROTECTED] --- Hi Ingo. It occurred to me this morning that the procname field was dynamically allocated and needed to be freed. I started to put in break statements when allocation failed but it was approaching 50% error handling code. I came up with this alternative of looping while entry-mode is set and checking proc_handler instead of -table. Alternatively, the string version of the domain name and cpu number could be stored the structs. I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation counts after taking a cpuset exclusive and back. milton Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-15 12:21:38.0 -0500 +++ kernel/kernel/sched.c 2007-10-15 12:22:12.0 -0500 @@ -5290,11 +5290,20 @@ static struct ctl_table *sd_alloc_ctl_en static void sd_free_ctl_entry(struct ctl_table **tablep) { - struct ctl_table *entry = *tablep; + struct ctl_table *entry; - for (entry = *tablep; entry-procname; entry++) + /* +* In the intermediate directories, both the child directory and +* procname are dynamically allocated and could fail but the mode +* will always be set. In the lowest directory the names are +* static strings and all have proc handlers. +*/ + for (entry = *tablep; entry-mode; entry++) { if (entry-child) sd_free_ctl_entry(entry-child); + if (entry-proc_handler == NULL) + kfree(entry-procname); + } kfree(*tablep); *tablep = NULL; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain debug sysctl fixes
I forgot to put the seqence numbers in the patch subjects (I did in the message ids). Here is the order I diffed them: sched domain sysctl: register online cpus sched domain sysctl: free and rebuild when changing domains sched.c: use kcalloc sched domain sysctl: don't bug on alloc failure sched domain sysctl: sysctl tables can have no holes milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: don't bug on alloc failure
Now that we are calling this at runtime, a more relaxed error path is suggested. If an allocation fails, we just register the partial table, which will show empty directories. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:59:02.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:59:07.0 -0500 @@ -5285,8 +5285,6 @@ static struct ctl_table *sd_alloc_ctl_en struct ctl_table *entry = kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL); - BUG_ON(!entry); - return entry; } @@ -5319,6 +5317,9 @@ sd_alloc_ctl_domain_table(struct sched_d { struct ctl_table *table = sd_alloc_ctl_entry(14); + if (table == NULL) + return NULL; + set_table_entry([0], "min_interval", >min_interval, sizeof(long), 0644, proc_doulongvec_minmax); set_table_entry([1], "max_interval", >max_interval, @@ -5356,6 +5357,8 @@ static ctl_table *sd_alloc_ctl_cpu_table for_each_domain(cpu, sd) domain_num++; entry = table = sd_alloc_ctl_entry(domain_num + 1); + if (table == NULL) + return NULL; i = 0; for_each_domain(cpu, sd) { @@ -5376,6 +5379,9 @@ static void register_sched_domain_sysctl struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); char buf[32]; + if (entry == NULL) + return; + sd_ctl_dir[0].child = entry; for_each_online_cpu(i) { - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: sysctl tables can have no holes
The register_sysctl_table stops on the first entry without a procname and ctl_name, so we need to fill in consecutive entries. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:59:07.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:59:16.0 -0500 @@ -5315,7 +5315,7 @@ set_table_entry(struct ctl_table *entry, static struct ctl_table * sd_alloc_ctl_domain_table(struct sched_domain *sd) { - struct ctl_table *table = sd_alloc_ctl_entry(14); + struct ctl_table *table = sd_alloc_ctl_entry(12); if (table == NULL) return NULL; @@ -5338,11 +5338,12 @@ sd_alloc_ctl_domain_table(struct sched_d sizeof(int), 0644, proc_dointvec_minmax); set_table_entry([8], "imbalance_pct", >imbalance_pct, sizeof(int), 0644, proc_dointvec_minmax); - set_table_entry([10], "cache_nice_tries", + set_table_entry([9], "cache_nice_tries", >cache_nice_tries, sizeof(int), 0644, proc_dointvec_minmax); - set_table_entry([12], "flags", >flags, + set_table_entry([10], "flags", >flags, sizeof(int), 0644, proc_dointvec_minmax); + /* [11] is terminator */ return table; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: register online cpus
init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu variables. If the cpus_possible mask is not contigious this will result in a crash referencing unallocated data. If the online mask is not contigious then we would show offline cpus and miss online ones. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-11 23:48:10.0 -0500 +++ kernel/kernel/sched.c 2007-10-11 23:55:51.0 -0500 @@ -5367,11 +5367,12 @@ static void init_sched_domain_sysctl(voi sd_ctl_dir[0].child = entry; - for (i = 0; i < cpu_num; i++, entry++) { + for_each_online_cpu(i) { snprintf(buf, 32, "cpu%d", i); entry->procname = kstrdup(buf, GFP_KERNEL); entry->mode = 0555; entry->child = sd_alloc_ctl_cpu_table(i); + entry++; } sd_sysctl_header = register_sysctl_table(sd_ctl_root); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched.c: use kcalloc
kcalloc checks for n * sizeof(element) overflows and it zeros. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:58:35.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:58:38.0 -0500 @@ -5283,10 +5283,9 @@ static struct ctl_table sd_ctl_root[] = static struct ctl_table *sd_alloc_ctl_entry(int n) { struct ctl_table *entry = - kmalloc(n * sizeof(struct ctl_table), GFP_KERNEL); + kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL); BUG_ON(!entry); - memset(entry, 0, n * sizeof(struct ctl_table)); return entry; } @@ -6080,7 +6079,7 @@ static int build_sched_domains(const cpu /* * Allocate the per-node list of sched groups */ - sched_group_nodes = kzalloc(sizeof(struct sched_group *)*MAX_NUMNODES, + sched_group_nodes = kcalloc(MAX_NUMNODES, sizeof(struct sched_group *), GFP_KERNEL); if (!sched_group_nodes) { printk(KERN_WARNING "Can not alloc sched group node list\n"); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: free and rebuild when changing domains
Unregister and free the sysctl table before destroying domains, then rebuild and register after creating the new domains. This prevents the sysctl table from pointing to freed memory for root to write. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:55:19.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:58:35.0 -0500 @@ -5291,6 +5291,18 @@ static struct ctl_table *sd_alloc_ctl_en return entry; } +static void sd_free_ctl_entry(struct ctl_table **tablep) +{ + struct ctl_table *entry = *tablep; + + for (entry = *tablep; entry->procname; entry++) + if (entry->child) + sd_free_ctl_entry(>child); + + kfree(*tablep); + *tablep = NULL; +} + static void set_table_entry(struct ctl_table *entry, const char *procname, void *data, int maxlen, @@ -5359,7 +5371,7 @@ static ctl_table *sd_alloc_ctl_cpu_table } static struct ctl_table_header *sd_sysctl_header; -static void init_sched_domain_sysctl(void) +static void register_sched_domain_sysctl(void) { int i, cpu_num = num_online_cpus(); struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); @@ -5376,8 +5388,18 @@ static void init_sched_domain_sysctl(voi } sd_sysctl_header = register_sysctl_table(sd_ctl_root); } + +static void unregister_sched_domain_sysctl(void) +{ + unregister_sysctl_table(sd_sysctl_header); + sd_sysctl_header = NULL; + sd_free_ctl_entry(_ctl_dir[0].child); +} #else -static void init_sched_domain_sysctl(void) +static void register_sched_domain_sysctl(void) +{ +} +static void unregister_sched_domain_sysctl(void) { } #endif @@ -6311,6 +6333,8 @@ static int arch_init_sched_domains(const err = build_sched_domains(_default_map); + register_sched_domain_sysctl(); + return err; } @@ -6327,6 +6351,8 @@ static void detach_destroy_domains(const { int i; + unregister_sched_domain_sysctl(); + for_each_cpu_mask(i, *cpu_map) cpu_attach_domain(NULL, i); synchronize_sched(); @@ -6357,6 +6383,8 @@ int partition_sched_domains(cpumask_t *p if (!err && !cpus_empty(*partition2)) err = build_sched_domains(partition2); + register_sched_domain_sysctl(); + return err; } @@ -6488,8 +6516,6 @@ void __init sched_init_smp(void) /* XXX: Theoretical race here - CPU may be hotplugged now */ hotcpu_notifier(update_sched_domains, 0); - init_sched_domain_sysctl(); - /* Move init over to a non-isolated CPU */ if (set_cpus_allowed(current, non_isolated_cpus) < 0) BUG(); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain debug sysctl fixes
The following 5 patches provide fixes for the sysctl code added under CONFIG_DEBUG_SCHED. After debugging a boot failure dereferencing not per-cpu data for an impossible cpu, I found the other issues by inspection. Following these fixes, /proc/sysctl/kernel/sched_debug/ should work with exclusive cpusets, cpu hotplug, offline and not present cpus, and will show all 11 files. milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: sysctl tables can have no holes
The register_sysctl_table stops on the first entry without a procname and ctl_name, so we need to fill in consecutive entries. Signed-off-by: Milton Miller [EMAIL PROTECTED] Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:59:07.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:59:16.0 -0500 @@ -5315,7 +5315,7 @@ set_table_entry(struct ctl_table *entry, static struct ctl_table * sd_alloc_ctl_domain_table(struct sched_domain *sd) { - struct ctl_table *table = sd_alloc_ctl_entry(14); + struct ctl_table *table = sd_alloc_ctl_entry(12); if (table == NULL) return NULL; @@ -5338,11 +5338,12 @@ sd_alloc_ctl_domain_table(struct sched_d sizeof(int), 0644, proc_dointvec_minmax); set_table_entry(table[8], imbalance_pct, sd-imbalance_pct, sizeof(int), 0644, proc_dointvec_minmax); - set_table_entry(table[10], cache_nice_tries, + set_table_entry(table[9], cache_nice_tries, sd-cache_nice_tries, sizeof(int), 0644, proc_dointvec_minmax); - set_table_entry(table[12], flags, sd-flags, + set_table_entry(table[10], flags, sd-flags, sizeof(int), 0644, proc_dointvec_minmax); + /* table[11] is terminator */ return table; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: register online cpus
init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu variables. If the cpus_possible mask is not contigious this will result in a crash referencing unallocated data. If the online mask is not contigious then we would show offline cpus and miss online ones. Signed-off-by: Milton Miller [EMAIL PROTECTED] Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-11 23:48:10.0 -0500 +++ kernel/kernel/sched.c 2007-10-11 23:55:51.0 -0500 @@ -5367,11 +5367,12 @@ static void init_sched_domain_sysctl(voi sd_ctl_dir[0].child = entry; - for (i = 0; i cpu_num; i++, entry++) { + for_each_online_cpu(i) { snprintf(buf, 32, cpu%d, i); entry-procname = kstrdup(buf, GFP_KERNEL); entry-mode = 0555; entry-child = sd_alloc_ctl_cpu_table(i); + entry++; } sd_sysctl_header = register_sysctl_table(sd_ctl_root); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched.c: use kcalloc
kcalloc checks for n * sizeof(element) overflows and it zeros. Signed-off-by: Milton Miller [EMAIL PROTECTED] Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:58:35.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:58:38.0 -0500 @@ -5283,10 +5283,9 @@ static struct ctl_table sd_ctl_root[] = static struct ctl_table *sd_alloc_ctl_entry(int n) { struct ctl_table *entry = - kmalloc(n * sizeof(struct ctl_table), GFP_KERNEL); + kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL); BUG_ON(!entry); - memset(entry, 0, n * sizeof(struct ctl_table)); return entry; } @@ -6080,7 +6079,7 @@ static int build_sched_domains(const cpu /* * Allocate the per-node list of sched groups */ - sched_group_nodes = kzalloc(sizeof(struct sched_group *)*MAX_NUMNODES, + sched_group_nodes = kcalloc(MAX_NUMNODES, sizeof(struct sched_group *), GFP_KERNEL); if (!sched_group_nodes) { printk(KERN_WARNING Can not alloc sched group node list\n); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: free and rebuild when changing domains
Unregister and free the sysctl table before destroying domains, then rebuild and register after creating the new domains. This prevents the sysctl table from pointing to freed memory for root to write. Signed-off-by: Milton Miller [EMAIL PROTECTED] Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:55:19.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:58:35.0 -0500 @@ -5291,6 +5291,18 @@ static struct ctl_table *sd_alloc_ctl_en return entry; } +static void sd_free_ctl_entry(struct ctl_table **tablep) +{ + struct ctl_table *entry = *tablep; + + for (entry = *tablep; entry-procname; entry++) + if (entry-child) + sd_free_ctl_entry(entry-child); + + kfree(*tablep); + *tablep = NULL; +} + static void set_table_entry(struct ctl_table *entry, const char *procname, void *data, int maxlen, @@ -5359,7 +5371,7 @@ static ctl_table *sd_alloc_ctl_cpu_table } static struct ctl_table_header *sd_sysctl_header; -static void init_sched_domain_sysctl(void) +static void register_sched_domain_sysctl(void) { int i, cpu_num = num_online_cpus(); struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); @@ -5376,8 +5388,18 @@ static void init_sched_domain_sysctl(voi } sd_sysctl_header = register_sysctl_table(sd_ctl_root); } + +static void unregister_sched_domain_sysctl(void) +{ + unregister_sysctl_table(sd_sysctl_header); + sd_sysctl_header = NULL; + sd_free_ctl_entry(sd_ctl_dir[0].child); +} #else -static void init_sched_domain_sysctl(void) +static void register_sched_domain_sysctl(void) +{ +} +static void unregister_sched_domain_sysctl(void) { } #endif @@ -6311,6 +6333,8 @@ static int arch_init_sched_domains(const err = build_sched_domains(cpu_default_map); + register_sched_domain_sysctl(); + return err; } @@ -6327,6 +6351,8 @@ static void detach_destroy_domains(const { int i; + unregister_sched_domain_sysctl(); + for_each_cpu_mask(i, *cpu_map) cpu_attach_domain(NULL, i); synchronize_sched(); @@ -6357,6 +6383,8 @@ int partition_sched_domains(cpumask_t *p if (!err !cpus_empty(*partition2)) err = build_sched_domains(partition2); + register_sched_domain_sysctl(); + return err; } @@ -6488,8 +6516,6 @@ void __init sched_init_smp(void) /* XXX: Theoretical race here - CPU may be hotplugged now */ hotcpu_notifier(update_sched_domains, 0); - init_sched_domain_sysctl(); - /* Move init over to a non-isolated CPU */ if (set_cpus_allowed(current, non_isolated_cpus) 0) BUG(); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain debug sysctl fixes
The following 5 patches provide fixes for the sysctl code added under CONFIG_DEBUG_SCHED. After debugging a boot failure dereferencing not per-cpu data for an impossible cpu, I found the other issues by inspection. Following these fixes, /proc/sysctl/kernel/sched_debug/ should work with exclusive cpusets, cpu hotplug, offline and not present cpus, and will show all 11 files. milton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain sysctl: don't bug on alloc failure
Now that we are calling this at runtime, a more relaxed error path is suggested. If an allocation fails, we just register the partial table, which will show empty directories. Signed-off-by: Milton Miller [EMAIL PROTECTED] Index: kernel/kernel/sched.c === --- kernel.orig/kernel/sched.c 2007-10-12 03:59:02.0 -0500 +++ kernel/kernel/sched.c 2007-10-12 03:59:07.0 -0500 @@ -5285,8 +5285,6 @@ static struct ctl_table *sd_alloc_ctl_en struct ctl_table *entry = kcalloc(n, sizeof(struct ctl_table), GFP_KERNEL); - BUG_ON(!entry); - return entry; } @@ -5319,6 +5317,9 @@ sd_alloc_ctl_domain_table(struct sched_d { struct ctl_table *table = sd_alloc_ctl_entry(14); + if (table == NULL) + return NULL; + set_table_entry(table[0], min_interval, sd-min_interval, sizeof(long), 0644, proc_doulongvec_minmax); set_table_entry(table[1], max_interval, sd-max_interval, @@ -5356,6 +5357,8 @@ static ctl_table *sd_alloc_ctl_cpu_table for_each_domain(cpu, sd) domain_num++; entry = table = sd_alloc_ctl_entry(domain_num + 1); + if (table == NULL) + return NULL; i = 0; for_each_domain(cpu, sd) { @@ -5376,6 +5379,9 @@ static void register_sched_domain_sysctl struct ctl_table *entry = sd_alloc_ctl_entry(cpu_num + 1); char buf[32]; + if (entry == NULL) + return; + sd_ctl_dir[0].child = entry; for_each_online_cpu(i) { - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] sched domain debug sysctl fixes
I forgot to put the seqence numbers in the patch subjects (I did in the message ids). Here is the order I diffed them: sched domain sysctl: register online cpus sched domain sysctl: free and rebuild when changing domains sched.c: use kcalloc sched domain sysctl: don't bug on alloc failure sched domain sysctl: sysctl tables can have no holes milton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] powerpc: don't enable cpu hotplug on mpic-based pseries
Don't allow cpu hotplug on pSeries systems lacking XICS interrupt controller, since current code is hardcoded to call xics routines. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> -- Olof's patch searched the device-tree again, looking for an mpic. This code instead checks that we found an xics the first time by checking the init function. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 9711eb0..20f010a 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -262,6 +262,12 @@ static int __init pseries_cpu_hotplug_init(void) return 0; } + if (ppc_md.init_IRQ != xics_init_IRQ) { + printk(KERN_INFO "pSeries CPU Hotplug only supported on xics " + "interrupt controllers - disabling"); + return 0; + } + ppc_md.cpu_die = pseries_mach_cpu_die; smp_ops->cpu_disable = pseries_cpu_disable; smp_ops->cpu_die = pseries_cpu_die; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] [patch 09/12] Fix SMP poweroff hangs
On Tue, Oct 9 2007 Thomas Gleixner wrote: >On Tue, 9 Oct 2007, Linus Torvalds wrote: >>On Wed, 10 Oct 2007, Thomas Gleixner wrote: >>> >>> Wrapping it into a #ifdef CONFIG_X86 would be sufficient. >> >> Well, the ppc oops seems to be a ppc bug regardless. >> >> If CPU_HOTPLUG isn't defined, the thing does nothing. And if it is >> defined, I don't see why/how ppc can validly oops. So I think the first >> thing to do is to try to figure out why it oopses, not to disable it for >> ppc. > >Fair enough. OTOH for the affected PPC users it's a regression and that's >what I'm concerned of. > >tglx I agree this exposes a ppc arch bug (and the reason has been diagnosed), but might not other archs have similar problems? The patch ties power-off code to suspend and hibernate cpu hotplug code. Why not put this ACPI requirement in the ACPI code? There is already a hook called at the same place this function was called that is only used for power off. Also, the code is compiled for CONFIG_PM_SLEEP_SMP, which is not CONFIG_HOTPLUG_CPU, it requires the selection of either CONFIG_HIBERNATION or CONFIG_SUSPEND. I confirmed that both of these can be disabled on i386 without disabling acpi. In such a kernel the existing patch is insufficient. drivers/acpi/sleep/main.c |2 ++ kernel/sys.c |1 - 2 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/acpi/sleep/main.c b/drivers/acpi/sleep/main.c index 2cbb9aa..79589bd 100644 --- a/drivers/acpi/sleep/main.c +++ b/drivers/acpi/sleep/main.c @@ -15,6 +15,7 @@ #include #include #include +#include #include @@ -382,6 +383,7 @@ static void acpi_power_off_prepare(void) { /* Prepare to power off the system */ acpi_sleep_prepare(ACPI_STATE_S5); + disable_nonboot_cpus(); } static void acpi_power_off(void) diff --git a/kernel/sys.c b/kernel/sys.c index 8ae2e63..138f5c8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -879,7 +879,6 @@ void kernel_power_off(void) kernel_shutdown_prepare(SYSTEM_POWER_OFF); if (pm_power_off_prepare) pm_power_off_prepare(); - disable_nonboot_cpus(); sysdev_shutdown(); printk(KERN_EMERG "Power down.\n"); machine_power_off(); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] [patch 09/12] Fix SMP poweroff hangs
On Tue, Oct 9 2007 Thomas Gleixner wrote: On Tue, 9 Oct 2007, Linus Torvalds wrote: On Wed, 10 Oct 2007, Thomas Gleixner wrote: Wrapping it into a #ifdef CONFIG_X86 would be sufficient. Well, the ppc oops seems to be a ppc bug regardless. If CPU_HOTPLUG isn't defined, the thing does nothing. And if it is defined, I don't see why/how ppc can validly oops. So I think the first thing to do is to try to figure out why it oopses, not to disable it for ppc. Fair enough. OTOH for the affected PPC users it's a regression and that's what I'm concerned of. tglx I agree this exposes a ppc arch bug (and the reason has been diagnosed), but might not other archs have similar problems? The patch ties power-off code to suspend and hibernate cpu hotplug code. Why not put this ACPI requirement in the ACPI code? There is already a hook called at the same place this function was called that is only used for power off. Also, the code is compiled for CONFIG_PM_SLEEP_SMP, which is not CONFIG_HOTPLUG_CPU, it requires the selection of either CONFIG_HIBERNATION or CONFIG_SUSPEND. I confirmed that both of these can be disabled on i386 without disabling acpi. In such a kernel the existing patch is insufficient. drivers/acpi/sleep/main.c |2 ++ kernel/sys.c |1 - 2 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/acpi/sleep/main.c b/drivers/acpi/sleep/main.c index 2cbb9aa..79589bd 100644 --- a/drivers/acpi/sleep/main.c +++ b/drivers/acpi/sleep/main.c @@ -15,6 +15,7 @@ #include linux/dmi.h #include linux/device.h #include linux/suspend.h +#include linux/cpu.h #include asm/io.h @@ -382,6 +383,7 @@ static void acpi_power_off_prepare(void) { /* Prepare to power off the system */ acpi_sleep_prepare(ACPI_STATE_S5); + disable_nonboot_cpus(); } static void acpi_power_off(void) diff --git a/kernel/sys.c b/kernel/sys.c index 8ae2e63..138f5c8 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -879,7 +879,6 @@ void kernel_power_off(void) kernel_shutdown_prepare(SYSTEM_POWER_OFF); if (pm_power_off_prepare) pm_power_off_prepare(); - disable_nonboot_cpus(); sysdev_shutdown(); printk(KERN_EMERG Power down.\n); machine_power_off(); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] powerpc: don't enable cpu hotplug on mpic-based pseries
Don't allow cpu hotplug on pSeries systems lacking XICS interrupt controller, since current code is hardcoded to call xics routines. Signed-off-by: Milton Miller [EMAIL PROTECTED] -- Olof's patch searched the device-tree again, looking for an mpic. This code instead checks that we found an xics the first time by checking the init function. diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 9711eb0..20f010a 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -262,6 +262,12 @@ static int __init pseries_cpu_hotplug_init(void) return 0; } + if (ppc_md.init_IRQ != xics_init_IRQ) { + printk(KERN_INFO pSeries CPU Hotplug only supported on xics + interrupt controllers - disabling); + return 0; + } + ppc_md.cpu_die = pseries_mach_cpu_die; smp_ops-cpu_disable = pseries_cpu_disable; smp_ops-cpu_die = pseries_cpu_die; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH Kbuild] Call only one make with all targets for O=
Change the invocations of make in the output directory Makefile and the main Makefile for seperate object trees to pass all goals to one $(MAKE) via a new phony target "sub-make" and the existing target _all. When compiling with seperate object directories, a seperate make is called in the context of another directory (from the output directory the main Makefile is called, the Makefile is then restarted with current directory set to the object tree). Before this patch, when multiple make command goals are specified, each target results in a seperate make invocation. With make -j, these invocations may run in parallel, resulting in multiple commands running in the same directory clobbering each others results. I did not try to address make -j for mixed dot-config and no-dot-config targets. Because the order does matter, a solution was not obvious. Perhaps a simple check for MAKEFLAGS having -j and refusing to run would be approprate. Signed-off-by: Milton Miller <[EMAIL PROTECTED]> --- I chose @: as the phony command after the sub-make target does all the work; is there a better alternative? It looks like :; is used for Makefile. Index: kernel/Makefile === --- kernel.orig/Makefile2007-09-19 01:55:45.0 -0500 +++ kernel/Makefile 2007-09-19 02:01:16.0 -0500 @@ -116,12 +116,16 @@ KBUILD_OUTPUT := $(shell cd $(KBUILD_OUT $(if $(KBUILD_OUTPUT),, \ $(error output directory "$(saved-output)" does not exist)) -PHONY += $(MAKECMDGOALS) +PHONY += $(MAKECMDGOALS) sub-make -$(filter-out _all,$(MAKECMDGOALS)) _all: +$(filter-out _all sub-make,$(MAKECMDGOALS)) _all: sub-make + @: + +sub-make: FORCE $(if $(KBUILD_VERBOSE:1=),@)$(MAKE) -C $(KBUILD_OUTPUT) \ KBUILD_SRC=$(CURDIR) \ - KBUILD_EXTMOD="$(KBUILD_EXTMOD)" -f $(CURDIR)/Makefile $@ + KBUILD_EXTMOD="$(KBUILD_EXTMOD)" -f $(CURDIR)/Makefile \ + $(filter-out _all sub-make,$(MAKECMDGOALS)) # Leave processing to above invocation of make skip-makefile := 1 Index: kernel/scripts/mkmakefile === --- kernel.orig/scripts/mkmakefile 2007-09-19 01:55:45.0 -0500 +++ kernel/scripts/mkmakefile 2007-09-19 02:01:53.0 -0500 @@ -26,11 +26,13 @@ MAKEFLAGS += --no-print-directory .PHONY: all \$(MAKECMDGOALS) +all:= \$(filter-out all Makefile,\$(MAKECMDGOALS)) + all: - \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) + \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$(all) Makefile:; -\$(filter-out all Makefile,\$(MAKECMDGOALS)) %/: - \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$@ +\$(all) %/: all + @: EOF - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH Kbuild] Call only one make with all targets for O=
Change the invocations of make in the output directory Makefile and the main Makefile for seperate object trees to pass all goals to one $(MAKE) via a new phony target sub-make and the existing target _all. When compiling with seperate object directories, a seperate make is called in the context of another directory (from the output directory the main Makefile is called, the Makefile is then restarted with current directory set to the object tree). Before this patch, when multiple make command goals are specified, each target results in a seperate make invocation. With make -j, these invocations may run in parallel, resulting in multiple commands running in the same directory clobbering each others results. I did not try to address make -j for mixed dot-config and no-dot-config targets. Because the order does matter, a solution was not obvious. Perhaps a simple check for MAKEFLAGS having -j and refusing to run would be approprate. Signed-off-by: Milton Miller [EMAIL PROTECTED] --- I chose @: as the phony command after the sub-make target does all the work; is there a better alternative? It looks like :; is used for Makefile. Index: kernel/Makefile === --- kernel.orig/Makefile2007-09-19 01:55:45.0 -0500 +++ kernel/Makefile 2007-09-19 02:01:16.0 -0500 @@ -116,12 +116,16 @@ KBUILD_OUTPUT := $(shell cd $(KBUILD_OUT $(if $(KBUILD_OUTPUT),, \ $(error output directory $(saved-output) does not exist)) -PHONY += $(MAKECMDGOALS) +PHONY += $(MAKECMDGOALS) sub-make -$(filter-out _all,$(MAKECMDGOALS)) _all: +$(filter-out _all sub-make,$(MAKECMDGOALS)) _all: sub-make + @: + +sub-make: FORCE $(if $(KBUILD_VERBOSE:1=),@)$(MAKE) -C $(KBUILD_OUTPUT) \ KBUILD_SRC=$(CURDIR) \ - KBUILD_EXTMOD=$(KBUILD_EXTMOD) -f $(CURDIR)/Makefile $@ + KBUILD_EXTMOD=$(KBUILD_EXTMOD) -f $(CURDIR)/Makefile \ + $(filter-out _all sub-make,$(MAKECMDGOALS)) # Leave processing to above invocation of make skip-makefile := 1 Index: kernel/scripts/mkmakefile === --- kernel.orig/scripts/mkmakefile 2007-09-19 01:55:45.0 -0500 +++ kernel/scripts/mkmakefile 2007-09-19 02:01:53.0 -0500 @@ -26,11 +26,13 @@ MAKEFLAGS += --no-print-directory .PHONY: all \$(MAKECMDGOALS) +all:= \$(filter-out all Makefile,\$(MAKECMDGOALS)) + all: - \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) + \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$(all) Makefile:; -\$(filter-out all Makefile,\$(MAKECMDGOALS)) %/: - \$(MAKE) -C \$(KERNELSRC) O=\$(KERNELOUTPUT) \$@ +\$(all) %/: all + @: EOF - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] Re: Hibernation considerations
On Jul 20, 2007, at 1:17 PM, Alan Stern wrote: On Fri, 20 Jul 2007, Milton Miller wrote: On Jul 20, 2007, at 11:20 AM, Alan Stern wrote: On Fri, 20 Jul 2007, Milton Miller wrote: We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. What can't we do? We've already worked with the drivers to quesce the hardware and put any information to resume the device in ram. Now we ask them to put their device in low power mode so we can go to sleep. Even if we schedule, the only thing userspace could touch is memory. Userspace can submit I/O requests. Someone will have to audit every driver to make sure that such I/O requests don't cause a quiesced device to become active. If the device is active, it will make the memory snapshot inconsistent with the on-device data. If a driver is waking a device between the time it was told by hibernation "suspend all operations and save your device state to ram" and "resume your device" then it is a buggy driver. That's exactly my point. As far as I know nobody has done a survey, but I bet you'd find _many_ drivers are buggy either in this way or the converse (forcing an I/O request to fail immediately instead of waiting until the suspend is over when it could succeed). They have this bug because they were written -- those which include any suspend/resume support at all -- under the assumption that they could rely on the freezer. And that's why Rafael said "We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation." Until the drivers are fixed -- which seems like a tremendous job -- none of this will work. So this is in the way of removing the freezer ... but as we are not relying on doing any io other than suspend device operation, save state to ram, then later put device in low power mode for s3 and/or s4, and finally restore and resume to running. I argue the process can make the io request after we write to disk, we just can't service it. If we are suspended it will go to the request queue, and eventually the process will wait for normal throttling mechanisms until the driver is woken up. Many drivers don't have request queues. Even for the ones that do, there are I/O pathways that bypass the queue (think of ioctl or sysfs). So its not a flag in make_request, fine. Actually, my point was more "what kernel services do the drivers need to transition from quiesced to low power for acpi S4 or suspend-to-ram"? We can't give them allocate-memory (but we give them a call "we are going to suspend" when they can), but does "run this tasklet" help? What timer facilities are needed? Some drivers need the ability to schedule. Some will need the ability to allocate memory (although GFP_ATOMIC is probably sufficient). Some will need timers to run. Can they allocate the memory in advance? (Call them when we know we want to suspend, they make the allocations they will need; we later call them again to release the allocations). If you need timers, you probably want some scheduling? Do we need to differentate init (por by bios) and resume from quiesced (for reboot, kexec start/resume)? I hope not. Yes we do. can you elabrate? Note I was not asking resume-from-low power vs init-from-por. We still get that distinction. How do these drivers work today when we kexec? The reason I'm asking is its hard to tell the first kernel what happened. We can say "we powered off, and we were restarted", but it becomes much harder when each device may or may not have a driver in the save kernel if we have to differentate for each device if it was initialized and later quiesced by the jump kernel during save or never touched. And we need to tell the resume from hybernate code "i touched it" "no i didn't" and "we resumed from s4" "no it was from s5". This is why I've been proposing that we don't create the suspend image with devices in the low power state, but only in a quiesced state similar to the initial state. I'm proposing a sequence like: (1) start allocating pinned memory to reduce saved image size (2) allocate and map blocks to save maximum image (we know how much ram is not in 1, so the max size) (3) tell drivers we are going to suspend. userspace is still running, swaping still active, etc. now is the time to allocate memory to save device state. (4) do what we want to slow down userspace making requests (ie run freezer today) (5) call drivers while still scheduling with interrupts "save oppertunitiy". From this point, any new request should be queued or the process put on a wait queue. (6) suspend timers, turn off interrupts (7) call drivers with interrupts off (final save) (8) jump to other kernel to save the image (9) call drivers to transition to low power (10) finish operatio
Re: [linux-pm] Re: Hibernation considerations
On Jul 20, 2007, at 11:20 AM, Alan Stern wrote: On Fri, 20 Jul 2007, Milton Miller wrote: We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. What can't we do? We've already worked with the drivers to quesce the hardware and put any information to resume the device in ram. Now we ask them to put their device in low power mode so we can go to sleep. Even if we schedule, the only thing userspace could touch is memory. Userspace can submit I/O requests. Someone will have to audit every driver to make sure that such I/O requests don't cause a quiesced device to become active. If the device is active, it will make the memory snapshot inconsistent with the on-device data. If a driver is waking a device between the time it was told by hibernation "suspend all operations and save your device state to ram" and "resume your device" then it is a buggy driver. I argue the process can make the io request after we write to disk, we just can't service it. If we are suspended it will go to the request queue, and eventually the process will wait for normal throttling mechanisms until the driver is woken up. It may mean the driver has to set a flag so that it knows it had an iorequest arrive while it was suspended and needs to wake the queue during its resume function. Actually, my point was more "what kernel services do the drivers need to transition from quiesced to low power for acpi S4 or suspend-to-ram"? We can't give them allocate-memory (but we give them a call "we are going to suspend" when they can), but does "run this tasklet" help? What timer facilities are needed? Do we need to differentate init (por by bios) and resume from quiesced (for reboot, kexec start/resume)? I hope not. milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] Re: Hibernation considerations
On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote: On Friday, 20 July 2007 01:07, [EMAIL PROTECTED] wrote: On Thu, 19 Jul 2007, Rafael J. Wysocki wrote: On Thursday, 19 July 2007 17:46, Milton Miller wrote: The currently identified problems under discussion include: (1) how to interact with acpi to enter into S4. (2) how to identify which memory needs to be saved (3) how to communicate where to save the memory (4) what state should devices be in when switching kernels (5) the complicated setup required with the current patch (6) what code restores the image (7) how to avoid corrupting filesystems mounted by the hibernated kernel I didn't realize this was a discussion item. I thought the options were clear, for some filesystem types you can mount them read-only, but for ext3 (and possilby other less common ones) you just plain cannot touch them. That's correct. And since you cannot thouch ext3, you need either to assume that you won't touch filesystems at all, or to have a code to recognize the filesystem you're dealing with. Or add a small bit of infrastructure that errors writes at make_request if you don't have a magic "i am a direct block device write from userspace" flag on the bio. The hibernate may fail, but you don't corrupt the media. If you don't get the image out, resume back to the "this is resume" instead of the power-down path. (2) Upon start-up (by which I mean what happens after the user has pressed the power button or something like that): * check if the image is present (and valid) _without_ enabling ACPI (we don't do that now, but I see no reason for not doing it in the new framework) * if the image is present (and valid), load it * turn on ACPI (unless already turned on by the BIOS, that is) * execute the _BFS global control method * execute the _WAK global control method * continue Here, the first two things should be done by the image-loading kernel, but the remaining operations have to be carried out by the restored kernel. Here I agree. Here is my proposal. Instead of trying to both write the image and suspend, I think this all becomes much simpler if we limit the scope the work of the second kernel. Its purpose is to write the image. After that its done. The platform can be powered off if we are going to S5. However, to support suspend to ram and suspend to disk, we return to the first kernel. We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. In that case, however, the kexec-based approach would have only one advantage over the current one. Namely, it would allow us to create bigger images. we all agree that tasks cannot run during the suspend-to-ram state, but the disagreement is over what this means at one extreme it could mean that you would need the full freezer as per the current suspend projects. at the other extreme it could mean that all that's needed is to invoke the suspend-to-ram routine before anything else on the suspended kernel on the return from the save and restore kernel. we just need to figure out which it is (or if it's somewhere in between). Well, I think that the "invoke the suspend-to-ram routine before anything else on the suspended kernel" thing won't be easy to implement in practice. Why? You don't expect suspend-to-ram in drivers to be implemented? We need more speperation of the quiesce drivers from power-down devices? Note that we are just talking about "suspend devices and put their state in ram", not actually invoking the platform to suspend to ram. And I'm actually saying we free memory and maybe allocate disk blocks for the save before we suspend (see below). Message-ID: <[EMAIL PROTECTED]> On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote: (1) Filesystems mounted before the hibernation are untouchable This is because some file systems do a fsck or other activity even when mounted read only. For the kexec case, however, this should be "file systems mounted by the hibernated system must not be written". As has been mentioned in the past, we should be able to use something like dm snapshot to allow fsck and the file system to see the cleaned copy while not actually writing the media. We can't _require_ users to use the dm snapshot in order for the hibernation to work, sorry. And by _reading_ from a filesystem you generally update metadata. not if the filesystem is mounted read-only (except on ext3) Well, if the filesystem in question is a journaling one and the hibernated kernel has mounted this fs read-write, this seems to be tricky anyway. Yes. I would argue writing to existing blocks of a file (not thorugh the filesystem, just getting their blocsk from the file system) should be safe, but it occurs to me that may not be the case if your fsck and bmap move data blocks from some update log to the f
Re: [linux-pm] Re: Hibernation considerations
On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote: On Thursday, 19 July 2007 17:46, Milton Miller wrote: The currently identified problems under discussion include: (1) how to interact with acpi to enter into S4. (2) how to identify which memory needs to be saved (3) how to communicate where to save the memory (4) what state should devices be in when switching kernels (5) the complicated setup required with the current patch (6) what code restores the image (7) how to avoid corrupting filesystems mounted by the hibernated kernel Ok I talked on this too. I'll now start with quotes from several articles in this thread and my responses. Message-ID: <[EMAIL PROTECTED]> On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote: (1) Upon entering the sleep state, which IMO can be done _after_ the image has been saved: * figure out which devices can wake up * put devices into low power states (wake-up devices are placed in the Dx states compatible with the wake capability, the others are powered off) * execute the _PTS global control method * switch off the nonlocal CPUs (eg. nonboot CPUs on x86) * execute the _GTS global control method * set the GPE enable registers corresponding to the wake-up devices) * make the platform enter S4 (there's a well defined procedure for that) I think that this should be done by the image-saving kernel. Message-ID: <[EMAIL PROTECTED]> On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard expressed his agreement with this block but also confusion on the other blocks. I strongly disagree. (1) as has been pointed out, this requires the new kernel to understand all io devices in the first kernel. (2) it requires both kernels to talk to ACPI. This is doomed to failure. How can the second kernel initialize ACPI? The platform thinks it has already been initialized. Do we plan to always undo all acpi initialization? Good question. I don't know. (2) Upon start-up (by which I mean what happens after the user has pressed the power button or something like that): * check if the image is present (and valid) _without_ enabling ACPI (we don't do that now, but I see no reason for not doing it in the new framework) * if the image is present (and valid), load it * turn on ACPI (unless already turned on by the BIOS, that is) * execute the _BFS global control method * execute the _WAK global control method * continue Here, the first two things should be done by the image-loading kernel, but the remaining operations have to be carried out by the restored kernel. Here I agree. Here is my proposal. Instead of trying to both write the image and suspend, I think this all becomes much simpler if we limit the scope the work of the second kernel. Its purpose is to write the image. After that its done. The platform can be powered off if we are going to S5. However, to support suspend to ram and suspend to disk, we return to the first kernel. We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. What can't we do? We've already worked with the drivers to quesce the hardware and put any information to resume the device in ram. Now we ask them to put their device in low power mode so we can go to sleep. Even if we schedule, the only thing userspace could touch is memory. If we resume, they just run those computations again. In that case, however, the kexec-based approach would have only one advantage over the current one. Namely, it would allow us to create bigger images. The advantage is we don't have to come up with a way to teach drivers "wake up to run these requests, but no other requests". We don't have to figure out what we need to resume to allow them to process a request. This means that the first kernel will need to know why it got resumed. Was the system powered off, and this is the resume from the user? Or was it restarted because the image has been saved, and its now time to actually suspend until woken up? If you look at it, this is the same interface we have with the magic arch_suspend hook -- did we just suspend and its time to write the image, or did we just resume and its time to wake everything up. I think this can be easily solved by giving the image saving kernel two resume points: one for the image has been written, and one for we rebooted and have restored the image. I'm not familiar with ACPI. Perhaps we need a third to differentiate we read the image from S4 instead of from S5, but that information must be available to the OS because it needs that to know if it should resume from hibernate. By making the split at image save and restore we have several advantages: (1) the kernel always initializes with devices in the init or quiesced but active state. (2) the kernel always resumes with devices in the init or quiesced but active state. (3) the kjump save and restore kernel doe
Re: [linux-pm] Re: Hibernation considerations
On Jul 19, 2007, at 12:31 PM, [EMAIL PROTECTED] wrote: On Thu, 19 Jul 2007, Milton Miller wrote: (2) Upon start-up (by which I mean what happens after the user has pressed the power button or something like that): * check if the image is present (and valid) _without_ enabling ACPI (we don't do that now, but I see no reason for not doing it in the new framework) * if the image is present (and valid), load it * turn on ACPI (unless already turned on by the BIOS, that is) * execute the _BFS global control method * execute the _WAK global control method * continue Here, the first two things should be done by the image-loading kernel, but the remaining operations have to be carried out by the restored kernel. Here I agree. Here is my proposal. Instead of trying to both write the image and suspend, I think this all becomes much simpler if we limit the scope the work of the second kernel. Its purpose is to write the image. After that its done. The platform can be powered off if we are going to S5. However, to support suspend to ram and suspend to disk, we return to the first kernel. This means that the first kernel will need to know why it got resumed. Was the system powered off, and this is the resume from the user? Or was it restarted because the image has been saved, and its now time to actually suspend until woken up? If you look at it, this is the same interface we have with the magic arch_suspend hook -- did we just suspend and its time to write the image, or did we just resume and its time to wake everything up. I think this can be easily solved by giving the image saving kernel two resume points: one for the image has been written, and one for we rebooted and have restored the image. I'm not familiar with ACPI. Perhaps we need a third to differentiate we read the image from S4 instead of from S5, but that information must be available to the OS because it needs that to know if it should resume from hibernate. are we sure that there are only 2-3 possible actions? or should this be made into a simple jump table so that it's extendable? At 2 I don't think we need a jump table. Even if we had a table, we have to identify what each entry means. If we start getting more then we can change from command line to table. As noted in the thread Message-ID: <[EMAIL PROTECTED]> Subject: [linux-pm] Re: hibernation/snapshot design on Mon Jul 9 08:23:53 2007, Jeremy Maitin-Shepard wrote: (3) how to communicate where to save the memory This is an intresting topic. The suspended kernel has most IO and disk space. It also knows how much space is to be occupied by the kernel. So communicating a block map to the second kernel would be the obvious choice. But the second kernel must be able to find the image to restore it, and it must have drivers for the media. Also, this is not feasible for storing to nfs. I think we will end up with several methods. One would be supply a list of blocks, and implement a file system that reads the file by reading the scatter list from media. The restore kernel then only needs to read an anchor, and can build upon that until the image is read into memory. Or do this in userspace. I don't know how this compares to the current restore path. I wasn't able to identify the code that creates the on disk structure in my 10 minute perusal of kernel/power/. A second method will be to supply a device and file that will be mounted by the save kernel, then unmounted and restored. This would require a partition that is not mounted or open by the suspended kernel (or use nfs or a similar protocol that is designed for multiple client concurrent access). A third method would be to allocate a file with the first kernel, and make sure the blocks are flushed to disk. The save and restore kernels map the file system using a snapshot device. Writing would map the blocks and use the block offset to write to the real device using the method from the first option; reading could be done directly from the snapshot device. The first and third option are dead on log based file systems (where the data is stored in the log). remember that the save and restore kernel can access the memory of the suspending kernel, so as long as the data is in a known format and there is a pointer to the data in a known location, the save and restore kernel can retreive the data from memory, there's no need to involve media. I agree that the the save kernel can read the list from the being-saved kernel. However, when restoring, the being-saved (being-restored) kernel is not accessable, so the save list has to be stored as part of the image. Simplifying kjump: the proposal for v3. The current code is trying to use crash dump area as a safe, reserved area to run the second kernel. However, that means that the kernel has to be linked specially to run in the reserved area. I th
Re: [linux-pm] Re: Hibernation considerations
On Jul 19, 2007, at 12:31 PM, [EMAIL PROTECTED] wrote: On Thu, 19 Jul 2007, Milton Miller wrote: (2) Upon start-up (by which I mean what happens after the user has pressed the power button or something like that): * check if the image is present (and valid) _without_ enabling ACPI (we don't do that now, but I see no reason for not doing it in the new framework) * if the image is present (and valid), load it * turn on ACPI (unless already turned on by the BIOS, that is) * execute the _BFS global control method * execute the _WAK global control method * continue Here, the first two things should be done by the image-loading kernel, but the remaining operations have to be carried out by the restored kernel. Here I agree. Here is my proposal. Instead of trying to both write the image and suspend, I think this all becomes much simpler if we limit the scope the work of the second kernel. Its purpose is to write the image. After that its done. The platform can be powered off if we are going to S5. However, to support suspend to ram and suspend to disk, we return to the first kernel. This means that the first kernel will need to know why it got resumed. Was the system powered off, and this is the resume from the user? Or was it restarted because the image has been saved, and its now time to actually suspend until woken up? If you look at it, this is the same interface we have with the magic arch_suspend hook -- did we just suspend and its time to write the image, or did we just resume and its time to wake everything up. I think this can be easily solved by giving the image saving kernel two resume points: one for the image has been written, and one for we rebooted and have restored the image. I'm not familiar with ACPI. Perhaps we need a third to differentiate we read the image from S4 instead of from S5, but that information must be available to the OS because it needs that to know if it should resume from hibernate. are we sure that there are only 2-3 possible actions? or should this be made into a simple jump table so that it's extendable? At 2 I don't think we need a jump table. Even if we had a table, we have to identify what each entry means. If we start getting more then we can change from command line to table. As noted in the thread Message-ID: [EMAIL PROTECTED] Subject: [linux-pm] Re: hibernation/snapshot design on Mon Jul 9 08:23:53 2007, Jeremy Maitin-Shepard wrote: (3) how to communicate where to save the memory This is an intresting topic. The suspended kernel has most IO and disk space. It also knows how much space is to be occupied by the kernel. So communicating a block map to the second kernel would be the obvious choice. But the second kernel must be able to find the image to restore it, and it must have drivers for the media. Also, this is not feasible for storing to nfs. I think we will end up with several methods. One would be supply a list of blocks, and implement a file system that reads the file by reading the scatter list from media. The restore kernel then only needs to read an anchor, and can build upon that until the image is read into memory. Or do this in userspace. I don't know how this compares to the current restore path. I wasn't able to identify the code that creates the on disk structure in my 10 minute perusal of kernel/power/. A second method will be to supply a device and file that will be mounted by the save kernel, then unmounted and restored. This would require a partition that is not mounted or open by the suspended kernel (or use nfs or a similar protocol that is designed for multiple client concurrent access). A third method would be to allocate a file with the first kernel, and make sure the blocks are flushed to disk. The save and restore kernels map the file system using a snapshot device. Writing would map the blocks and use the block offset to write to the real device using the method from the first option; reading could be done directly from the snapshot device. The first and third option are dead on log based file systems (where the data is stored in the log). remember that the save and restore kernel can access the memory of the suspending kernel, so as long as the data is in a known format and there is a pointer to the data in a known location, the save and restore kernel can retreive the data from memory, there's no need to involve media. I agree that the the save kernel can read the list from the being-saved kernel. However, when restoring, the being-saved (being-restored) kernel is not accessable, so the save list has to be stored as part of the image. Simplifying kjump: the proposal for v3. The current code is trying to use crash dump area as a safe, reserved area to run the second kernel. However, that means that the kernel has to be linked specially to run in the reserved area. I think we
Re: [linux-pm] Re: Hibernation considerations
On Jul 19, 2007, at 3:28 PM, Rafael J. Wysocki wrote: On Thursday, 19 July 2007 17:46, Milton Miller wrote: The currently identified problems under discussion include: (1) how to interact with acpi to enter into S4. (2) how to identify which memory needs to be saved (3) how to communicate where to save the memory (4) what state should devices be in when switching kernels (5) the complicated setup required with the current patch (6) what code restores the image (7) how to avoid corrupting filesystems mounted by the hibernated kernel Ok I talked on this too. I'll now start with quotes from several articles in this thread and my responses. Message-ID: [EMAIL PROTECTED] On Tue Jul 17 13:10:00 2007, Rafael J. Wysocki wrote: (1) Upon entering the sleep state, which IMO can be done _after_ the image has been saved: * figure out which devices can wake up * put devices into low power states (wake-up devices are placed in the Dx states compatible with the wake capability, the others are powered off) * execute the _PTS global control method * switch off the nonlocal CPUs (eg. nonboot CPUs on x86) * execute the _GTS global control method * set the GPE enable registers corresponding to the wake-up devices) * make the platform enter S4 (there's a well defined procedure for that) I think that this should be done by the image-saving kernel. Message-ID: [EMAIL PROTECTED] On Tue Jul 17 13:35:52 2007, Jeremy Maitin-Shepard expressed his agreement with this block but also confusion on the other blocks. I strongly disagree. (1) as has been pointed out, this requires the new kernel to understand all io devices in the first kernel. (2) it requires both kernels to talk to ACPI. This is doomed to failure. How can the second kernel initialize ACPI? The platform thinks it has already been initialized. Do we plan to always undo all acpi initialization? Good question. I don't know. (2) Upon start-up (by which I mean what happens after the user has pressed the power button or something like that): * check if the image is present (and valid) _without_ enabling ACPI (we don't do that now, but I see no reason for not doing it in the new framework) * if the image is present (and valid), load it * turn on ACPI (unless already turned on by the BIOS, that is) * execute the _BFS global control method * execute the _WAK global control method * continue Here, the first two things should be done by the image-loading kernel, but the remaining operations have to be carried out by the restored kernel. Here I agree. Here is my proposal. Instead of trying to both write the image and suspend, I think this all becomes much simpler if we limit the scope the work of the second kernel. Its purpose is to write the image. After that its done. The platform can be powered off if we are going to S5. However, to support suspend to ram and suspend to disk, we return to the first kernel. We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. What can't we do? We've already worked with the drivers to quesce the hardware and put any information to resume the device in ram. Now we ask them to put their device in low power mode so we can go to sleep. Even if we schedule, the only thing userspace could touch is memory. If we resume, they just run those computations again. In that case, however, the kexec-based approach would have only one advantage over the current one. Namely, it would allow us to create bigger images. The advantage is we don't have to come up with a way to teach drivers wake up to run these requests, but no other requests. We don't have to figure out what we need to resume to allow them to process a request. This means that the first kernel will need to know why it got resumed. Was the system powered off, and this is the resume from the user? Or was it restarted because the image has been saved, and its now time to actually suspend until woken up? If you look at it, this is the same interface we have with the magic arch_suspend hook -- did we just suspend and its time to write the image, or did we just resume and its time to wake everything up. I think this can be easily solved by giving the image saving kernel two resume points: one for the image has been written, and one for we rebooted and have restored the image. I'm not familiar with ACPI. Perhaps we need a third to differentiate we read the image from S4 instead of from S5, but that information must be available to the OS because it needs that to know if it should resume from hibernate. By making the split at image save and restore we have several advantages: (1) the kernel always initializes with devices in the init or quiesced but active state. (2) the kernel always resumes with devices in the init or quiesced but active state. (3) the kjump save and restore kernel does not need to know how to suspend
Re: [linux-pm] Re: Hibernation considerations
On Jul 20, 2007, at 6:17 AM, Rafael J. Wysocki wrote: On Friday, 20 July 2007 01:07, [EMAIL PROTECTED] wrote: On Thu, 19 Jul 2007, Rafael J. Wysocki wrote: On Thursday, 19 July 2007 17:46, Milton Miller wrote: The currently identified problems under discussion include: (1) how to interact with acpi to enter into S4. (2) how to identify which memory needs to be saved (3) how to communicate where to save the memory (4) what state should devices be in when switching kernels (5) the complicated setup required with the current patch (6) what code restores the image (7) how to avoid corrupting filesystems mounted by the hibernated kernel I didn't realize this was a discussion item. I thought the options were clear, for some filesystem types you can mount them read-only, but for ext3 (and possilby other less common ones) you just plain cannot touch them. That's correct. And since you cannot thouch ext3, you need either to assume that you won't touch filesystems at all, or to have a code to recognize the filesystem you're dealing with. Or add a small bit of infrastructure that errors writes at make_request if you don't have a magic i am a direct block device write from userspace flag on the bio. The hibernate may fail, but you don't corrupt the media. If you don't get the image out, resume back to the this is resume instead of the power-down path. (2) Upon start-up (by which I mean what happens after the user has pressed the power button or something like that): * check if the image is present (and valid) _without_ enabling ACPI (we don't do that now, but I see no reason for not doing it in the new framework) * if the image is present (and valid), load it * turn on ACPI (unless already turned on by the BIOS, that is) * execute the _BFS global control method * execute the _WAK global control method * continue Here, the first two things should be done by the image-loading kernel, but the remaining operations have to be carried out by the restored kernel. Here I agree. Here is my proposal. Instead of trying to both write the image and suspend, I think this all becomes much simpler if we limit the scope the work of the second kernel. Its purpose is to write the image. After that its done. The platform can be powered off if we are going to S5. However, to support suspend to ram and suspend to disk, we return to the first kernel. We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. In that case, however, the kexec-based approach would have only one advantage over the current one. Namely, it would allow us to create bigger images. we all agree that tasks cannot run during the suspend-to-ram state, but the disagreement is over what this means at one extreme it could mean that you would need the full freezer as per the current suspend projects. at the other extreme it could mean that all that's needed is to invoke the suspend-to-ram routine before anything else on the suspended kernel on the return from the save and restore kernel. we just need to figure out which it is (or if it's somewhere in between). Well, I think that the invoke the suspend-to-ram routine before anything else on the suspended kernel thing won't be easy to implement in practice. Why? You don't expect suspend-to-ram in drivers to be implemented? We need more speperation of the quiesce drivers from power-down devices? Note that we are just talking about suspend devices and put their state in ram, not actually invoking the platform to suspend to ram. And I'm actually saying we free memory and maybe allocate disk blocks for the save before we suspend (see below). Message-ID: [EMAIL PROTECTED] On Sun Jul 15 05:27:03 2007, Rafael J. Wysocki wrote: (1) Filesystems mounted before the hibernation are untouchable This is because some file systems do a fsck or other activity even when mounted read only. For the kexec case, however, this should be file systems mounted by the hibernated system must not be written. As has been mentioned in the past, we should be able to use something like dm snapshot to allow fsck and the file system to see the cleaned copy while not actually writing the media. We can't _require_ users to use the dm snapshot in order for the hibernation to work, sorry. And by _reading_ from a filesystem you generally update metadata. not if the filesystem is mounted read-only (except on ext3) Well, if the filesystem in question is a journaling one and the hibernated kernel has mounted this fs read-write, this seems to be tricky anyway. Yes. I would argue writing to existing blocks of a file (not thorugh the filesystem, just getting their blocsk from the file system) should be safe, but it occurs to me that may not be the case if your fsck and bmap move data blocks from some update log to the file system. But we know the (maximum) image size. So we could
Re: [linux-pm] Re: Hibernation considerations
On Jul 20, 2007, at 11:20 AM, Alan Stern wrote: On Fri, 20 Jul 2007, Milton Miller wrote: We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. What can't we do? We've already worked with the drivers to quesce the hardware and put any information to resume the device in ram. Now we ask them to put their device in low power mode so we can go to sleep. Even if we schedule, the only thing userspace could touch is memory. Userspace can submit I/O requests. Someone will have to audit every driver to make sure that such I/O requests don't cause a quiesced device to become active. If the device is active, it will make the memory snapshot inconsistent with the on-device data. If a driver is waking a device between the time it was told by hibernation suspend all operations and save your device state to ram and resume your device then it is a buggy driver. I argue the process can make the io request after we write to disk, we just can't service it. If we are suspended it will go to the request queue, and eventually the process will wait for normal throttling mechanisms until the driver is woken up. It may mean the driver has to set a flag so that it knows it had an iorequest arrive while it was suspended and needs to wake the queue during its resume function. Actually, my point was more what kernel services do the drivers need to transition from quiesced to low power for acpi S4 or suspend-to-ram? We can't give them allocate-memory (but we give them a call we are going to suspend when they can), but does run this tasklet help? What timer facilities are needed? Do we need to differentate init (por by bios) and resume from quiesced (for reboot, kexec start/resume)? I hope not. milton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] Re: Hibernation considerations
On Jul 20, 2007, at 1:17 PM, Alan Stern wrote: On Fri, 20 Jul 2007, Milton Miller wrote: On Jul 20, 2007, at 11:20 AM, Alan Stern wrote: On Fri, 20 Jul 2007, Milton Miller wrote: We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. What can't we do? We've already worked with the drivers to quesce the hardware and put any information to resume the device in ram. Now we ask them to put their device in low power mode so we can go to sleep. Even if we schedule, the only thing userspace could touch is memory. Userspace can submit I/O requests. Someone will have to audit every driver to make sure that such I/O requests don't cause a quiesced device to become active. If the device is active, it will make the memory snapshot inconsistent with the on-device data. If a driver is waking a device between the time it was told by hibernation suspend all operations and save your device state to ram and resume your device then it is a buggy driver. That's exactly my point. As far as I know nobody has done a survey, but I bet you'd find _many_ drivers are buggy either in this way or the converse (forcing an I/O request to fail immediately instead of waiting until the suspend is over when it could succeed). They have this bug because they were written -- those which include any suspend/resume support at all -- under the assumption that they could rely on the freezer. And that's why Rafael said We can't do this unless we have frozen tasks (this way, or another) before carrying out the entire operation. Until the drivers are fixed -- which seems like a tremendous job -- none of this will work. So this is in the way of removing the freezer ... but as we are not relying on doing any io other than suspend device operation, save state to ram, then later put device in low power mode for s3 and/or s4, and finally restore and resume to running. I argue the process can make the io request after we write to disk, we just can't service it. If we are suspended it will go to the request queue, and eventually the process will wait for normal throttling mechanisms until the driver is woken up. Many drivers don't have request queues. Even for the ones that do, there are I/O pathways that bypass the queue (think of ioctl or sysfs). So its not a flag in make_request, fine. Actually, my point was more what kernel services do the drivers need to transition from quiesced to low power for acpi S4 or suspend-to-ram? We can't give them allocate-memory (but we give them a call we are going to suspend when they can), but does run this tasklet help? What timer facilities are needed? Some drivers need the ability to schedule. Some will need the ability to allocate memory (although GFP_ATOMIC is probably sufficient). Some will need timers to run. Can they allocate the memory in advance? (Call them when we know we want to suspend, they make the allocations they will need; we later call them again to release the allocations). If you need timers, you probably want some scheduling? Do we need to differentate init (por by bios) and resume from quiesced (for reboot, kexec start/resume)? I hope not. Yes we do. can you elabrate? Note I was not asking resume-from-low power vs init-from-por. We still get that distinction. How do these drivers work today when we kexec? The reason I'm asking is its hard to tell the first kernel what happened. We can say we powered off, and we were restarted, but it becomes much harder when each device may or may not have a driver in the save kernel if we have to differentate for each device if it was initialized and later quiesced by the jump kernel during save or never touched. And we need to tell the resume from hybernate code i touched it no i didn't and we resumed from s4 no it was from s5. This is why I've been proposing that we don't create the suspend image with devices in the low power state, but only in a quiesced state similar to the initial state. I'm proposing a sequence like: (1) start allocating pinned memory to reduce saved image size (2) allocate and map blocks to save maximum image (we know how much ram is not in 1, so the max size) (3) tell drivers we are going to suspend. userspace is still running, swaping still active, etc. now is the time to allocate memory to save device state. (4) do what we want to slow down userspace making requests (ie run freezer today) (5) call drivers while still scheduling with interrupts save oppertunitiy. From this point, any new request should be queued or the process put on a wait queue. (6) suspend timers, turn off interrupts (7) call drivers with interrupts off (final save) (8) jump to other kernel to save the image (9) call drivers to transition to low power (10) finish operations to platform suspend on hybernate (11) call drivers to resume, telling them if from suspend-to-ram or suspend-to-disk, possibly in two stages
[PATCH] e100 rx: or s and el bits
In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description talks about emulating another driver by setting addtional bits and the being unable to test when submitted. Seeing the & operator to set more bits made me suspicious, and indeed the bits are defined in positive logic: cb_s = 0x4000, cb_el = 0x8000, So anding those together would be 0. I'm guessing they should be or'd, but don't have hardware here to test, much like the committed patch. In fact, I'll let someone else do the compile test too. I'll update the comment. (It looks like the end of list and s bits would not be set in the template nor cleared when linking in recieve skbs, so as long as the kernel can keep up with the 100Mb card we wouldn't see the adapter go off the linked list, possibly explaining any successful use of this patch written against 2.6.14). Signed-off-by: Milton Miller <[EMAIL PROTECTED]> --- linux-2.6/drivers/net/e100.c.orig 2007-05-01 05:19:17.0 -0500 +++ linux-2.6/drivers/net/e100.c2007-05-01 05:22:14.0 -0500 @@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic ((nic->mac >= mac_82558_D101_A4) ? cb_cid : cb_i)); /* Template for a freshly allocated RFD */ - nic->blank_rfd.command = cpu_to_le16(cb_el & cb_s); + nic->blank_rfd.command = cpu_to_le16(cb_el | cb_s); nic->blank_rfd.rbd = 0x; nic->blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); @@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic } /* Link the RFD to end of RFA by linking previous RFD to -* this one, and clearing EL bit of previous. */ +* this one, and clearing EL and S bit of previous. */ if(rx->prev->skb) { struct rfd *prev_rfd = (struct rfd *)rx->prev->skb->data; put_unaligned(cpu_to_le32(rx->dma_addr), (u32 *)_rfd->link); wmb(); - prev_rfd->command &= ~cpu_to_le16(cb_el & cb_s); + prev_rfd->command &= ~cpu_to_le16(cb_el | cb_s); pci_dma_sync_single_for_device(nic->pdev, rx->prev->dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] e100 rx: or s and el bits
In commit d52df4a35af569071fda3f4eb08e47cc7023f094, the description talks about emulating another driver by setting addtional bits and the being unable to test when submitted. Seeing the operator to set more bits made me suspicious, and indeed the bits are defined in positive logic: cb_s = 0x4000, cb_el = 0x8000, So anding those together would be 0. I'm guessing they should be or'd, but don't have hardware here to test, much like the committed patch. In fact, I'll let someone else do the compile test too. I'll update the comment. (It looks like the end of list and s bits would not be set in the template nor cleared when linking in recieve skbs, so as long as the kernel can keep up with the 100Mb card we wouldn't see the adapter go off the linked list, possibly explaining any successful use of this patch written against 2.6.14). Signed-off-by: Milton Miller [EMAIL PROTECTED] --- linux-2.6/drivers/net/e100.c.orig 2007-05-01 05:19:17.0 -0500 +++ linux-2.6/drivers/net/e100.c2007-05-01 05:22:14.0 -0500 @@ -947,7 +947,7 @@ static void e100_get_defaults(struct nic ((nic-mac = mac_82558_D101_A4) ? cb_cid : cb_i)); /* Template for a freshly allocated RFD */ - nic-blank_rfd.command = cpu_to_le16(cb_el cb_s); + nic-blank_rfd.command = cpu_to_le16(cb_el | cb_s); nic-blank_rfd.rbd = 0x; nic-blank_rfd.size = cpu_to_le16(VLAN_ETH_FRAME_LEN); @@ -1769,13 +1769,13 @@ static int e100_rx_alloc_skb(struct nic } /* Link the RFD to end of RFA by linking previous RFD to -* this one, and clearing EL bit of previous. */ +* this one, and clearing EL and S bit of previous. */ if(rx-prev-skb) { struct rfd *prev_rfd = (struct rfd *)rx-prev-skb-data; put_unaligned(cpu_to_le32(rx-dma_addr), (u32 *)prev_rfd-link); wmb(); - prev_rfd-command = ~cpu_to_le16(cb_el cb_s); + prev_rfd-command = ~cpu_to_le16(cb_el | cb_s); pci_dma_sync_single_for_device(nic-pdev, rx-prev-dma_addr, sizeof(struct rfd), PCI_DMA_TODEVICE); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 00/12] Syslets, Threadlets, generic AIO support, v5
Anton Blanchard wrote: Hi Ingo, this is the v5 release of the syslet/threadlet subsystem: http://redhat.com/~mingo/syslet-patches/ Nice! I too went and downloaded patches-v5 for review. First off, one problem I noticed in sys_async_wait: + ah->events_left = min_wait_events - (kernel_ring_idx - user_ring_idx); This completely misses the wraparound case of kernel_ring_idx < user_ring_idx. I wonder if this is causing some of the benchmark problems? (add max_ring_index if kernel < user). I tried to port this to ppc64 but found a few problems: The 64bit powerpc ABI has the concept of a TOC (r2) which is used for per function data. This means this wont work: [deleted] I think we would want to change restore_ip to restore_function, and then create a per arch helper, perhaps: void set_user_context(struct task_struct *task, unsigned long stack, unsigned long function, unsigned long retval); ppc64 could then grab the ip and r2 values from the function descriptor. The other issue involves the syscall table: asmlinkage struct syslet_uatom __user * sys_async_exec(struct syslet_uatom __user *uatom, struct async_head_user __user *ahu) { return __sys_async_exec(uatom, ahu, sys_call_table, NR_syscalls); } This exposes the layout of the syscall table. Unfortunately it wont work on ppc64. In arch/powerpc/kernel/systbl.S: #define COMPAT_SYS(func).llong .sys_##func,.compat_sys_##func Both syscall tables are overlaid. Anton In addition, the entries in the table are not function pointers, they are the actual code targets. So we need a arch helper to invoke the system call. Here is another problem with your compat code. Just telling user space that everything is u64 and having the kernel retrieve pointers and ulong doesn't work, you have to actually copy in u64 values and truncate them down. Your current code is broken on all 32bit big endian kernels. Actually, the check needs to be that the upper 32 bits are 0 or return -EINVAL. In addition, the compat syscall entry points assume that the arguments have been truncated to compat_ulong values by the syscall entry path, and that they only need to do sign extension (and/or pointer conversion on s390 with its 31 bit pointers). So all compat kernels are broken. The two of these things together makes me think we want two copy functions. At that point we may as well define the struct uatom in terms of ulong and compat_ulong for the compat_uatom. That would lead to two copies of exec_uatom, but the elimination of passing the syscall table as an argument down. The need_resched and signal check could become part of the common next_uatom routine, although it would need to know uatom+1 instead of doing the addition in itself. Other observations: All the logic setting at and async_ready is a bit hard to follow. After some analysis, t->at is only ever set to >__at and async_ready is only set to the same at or NULL. Both of these should become flags, and at->task should be converted to container_of. Also, the name at is hard to grep / search for. The stop flags are decoded with a case but are not densely encoded, rather they are one hot. We either need to error on multiple stop bits being set, stop on each possible condition, or encode them densely. There is no check for flags being set that are not recognized. If we ever add a flag for another stop condition this would lead to incorrect execution by the kernel. There are some syscalls that can return -EFAULT but later have force_syscall_noerror. We should create a stop on ERROR and clear the force_noerror flag between syscalls. The umem_add syscall should add force_noerror if the put_user succeeds. In copy_uatom, you call verify_read on the entire uatom. This means that the struct with all user space size has to be within the process limit, which violates your assertion that userspace doesn't need the whole structure. If we add the requirement that the space that would be occupied by the complete atom has to exist, then we can copy the whole struct uatom with copy_from_user and then copy the args with get_user. User space can still pack them more densely, and we can still stop copying on a null arg pointer. Actually, calling access_ok then __get_user can be more expensive on some architectures because they have to verify both start and length on access_ok but can only verify start on get_user because they have unmapped areas between user space and kernel space. This would also mean that we don't check arg_ptr for NULL without verifying that get_user actually worked. The gotos in exec_uatom are just a while loop with a break. sys_umem_add should be in /lib under lib-y in the Makefile. In fact declaring the function weak does not make it a weak syscall implementation on some architectures. Weak syscalls aliases to sys_ni_syscall are needed for when async support is not selected in Kconfig. The Documentation
Re: [patch 00/12] Syslets, Threadlets, generic AIO support, v5
Anton Blanchard wrote: Hi Ingo, this is the v5 release of the syslet/threadlet subsystem: http://redhat.com/~mingo/syslet-patches/ Nice! I too went and downloaded patches-v5 for review. First off, one problem I noticed in sys_async_wait: + ah-events_left = min_wait_events - (kernel_ring_idx - user_ring_idx); This completely misses the wraparound case of kernel_ring_idx user_ring_idx. I wonder if this is causing some of the benchmark problems? (add max_ring_index if kernel user). I tried to port this to ppc64 but found a few problems: The 64bit powerpc ABI has the concept of a TOC (r2) which is used for per function data. This means this wont work: [deleted] I think we would want to change restore_ip to restore_function, and then create a per arch helper, perhaps: void set_user_context(struct task_struct *task, unsigned long stack, unsigned long function, unsigned long retval); ppc64 could then grab the ip and r2 values from the function descriptor. The other issue involves the syscall table: asmlinkage struct syslet_uatom __user * sys_async_exec(struct syslet_uatom __user *uatom, struct async_head_user __user *ahu) { return __sys_async_exec(uatom, ahu, sys_call_table, NR_syscalls); } This exposes the layout of the syscall table. Unfortunately it wont work on ppc64. In arch/powerpc/kernel/systbl.S: #define COMPAT_SYS(func).llong .sys_##func,.compat_sys_##func Both syscall tables are overlaid. Anton In addition, the entries in the table are not function pointers, they are the actual code targets. So we need a arch helper to invoke the system call. Here is another problem with your compat code. Just telling user space that everything is u64 and having the kernel retrieve pointers and ulong doesn't work, you have to actually copy in u64 values and truncate them down. Your current code is broken on all 32bit big endian kernels. Actually, the check needs to be that the upper 32 bits are 0 or return -EINVAL. In addition, the compat syscall entry points assume that the arguments have been truncated to compat_ulong values by the syscall entry path, and that they only need to do sign extension (and/or pointer conversion on s390 with its 31 bit pointers). So all compat kernels are broken. The two of these things together makes me think we want two copy functions. At that point we may as well define the struct uatom in terms of ulong and compat_ulong for the compat_uatom. That would lead to two copies of exec_uatom, but the elimination of passing the syscall table as an argument down. The need_resched and signal check could become part of the common next_uatom routine, although it would need to know uatom+1 instead of doing the addition in itself. Other observations: All the logic setting at and async_ready is a bit hard to follow. After some analysis, t-at is only ever set to t-__at and async_ready is only set to the same at or NULL. Both of these should become flags, and at-task should be converted to container_of. Also, the name at is hard to grep / search for. The stop flags are decoded with a case but are not densely encoded, rather they are one hot. We either need to error on multiple stop bits being set, stop on each possible condition, or encode them densely. There is no check for flags being set that are not recognized. If we ever add a flag for another stop condition this would lead to incorrect execution by the kernel. There are some syscalls that can return -EFAULT but later have force_syscall_noerror. We should create a stop on ERROR and clear the force_noerror flag between syscalls. The umem_add syscall should add force_noerror if the put_user succeeds. In copy_uatom, you call verify_read on the entire uatom. This means that the struct with all user space size has to be within the process limit, which violates your assertion that userspace doesn't need the whole structure. If we add the requirement that the space that would be occupied by the complete atom has to exist, then we can copy the whole struct uatom with copy_from_user and then copy the args with get_user. User space can still pack them more densely, and we can still stop copying on a null arg pointer. Actually, calling access_ok then __get_user can be more expensive on some architectures because they have to verify both start and length on access_ok but can only verify start on get_user because they have unmapped areas between user space and kernel space. This would also mean that we don't check arg_ptr for NULL without verifying that get_user actually worked. The gotos in exec_uatom are just a while loop with a break. sys_umem_add should be in /lib under lib-y in the Makefile. In fact declaring the function weak does not make it a weak syscall implementation on some architectures. Weak syscalls aliases to sys_ni_syscall are needed for when async support is not selected in Kconfig. The Documentation patch
Re: Make sure we populate the initroot filesystem late enough
On Feb 27, 2007, at 2:24 AM, David Woodhouse wrote: On Sun, 2007-02-25 at 20:13 -0800, Linus Torvalds wrote: On Sun, 25 Feb 2007, David Woodhouse wrote: Can you try adding something like memset(start, 0xf0, end - start); Yeah, I did that before giving up on it for the day and going in search of dinner. It changes the failure mode to a BUG() in cache_free_debugcheck(), at line 2876 of mm/slab.c Ok, that's just strange. In this case I hadn't left the 'return' in free_initrd_mem(). I was poisoning the pages and then returning them to the pool as usual. If I poison the pages and _don't_ return them to the pool, it boots fine. PageReserved is set on every page in the initrd region; total page_count() is equal to the number of pages (which doesn't _necessarily_ mean that page_count() for every page is equal to 1 but it's a strong hint that that's the case). Looking in /dev/mem after it boots, I see that my poison is still present throughout the whole region. One obvious thing to do would be to remove all the "__initdata" entries in mm/slab.c.. This is biting us long before we call free_initmem(). But I'd also like to see the full backtrace for the BUG_ON(), in case that gives any clues at all. I'll see if I can find a camera. It smells like the pages weren't actually reserved in the first place and we were blithely allocating them. The only problem with that theory is that the initrd doesn't seem to be getting corrupted -- and if we were handing out its pages like that then surely _something_ would have scribbled on it before we tried to read it. Yeah, I don't think it's necessarily initrd itself, I'd be more inclined to think that the reason you see this change with the initrd unpacking is simply that it does a lot of allocations for the initrd files, so I think it is only indirectly involved - just because it ends up being a slab user. Whatever happens, initrd as a 'slab user' is fine. The crashes happen _later_, when someone else is using the memory which used to belong to the initrd. In that 'BUG at slab.c:2876' I mentioned above, r3 was within the initrd region. As I said, I'll try to find a camera. Just a thought, Any chance you are using one of the unusal code paths, like the bootloader moving the initrd or using a kernel-crash region? milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Make sure we populate the initroot filesystem late enough
On Feb 27, 2007, at 2:24 AM, David Woodhouse wrote: On Sun, 2007-02-25 at 20:13 -0800, Linus Torvalds wrote: On Sun, 25 Feb 2007, David Woodhouse wrote: Can you try adding something like memset(start, 0xf0, end - start); Yeah, I did that before giving up on it for the day and going in search of dinner. It changes the failure mode to a BUG() in cache_free_debugcheck(), at line 2876 of mm/slab.c Ok, that's just strange. In this case I hadn't left the 'return' in free_initrd_mem(). I was poisoning the pages and then returning them to the pool as usual. If I poison the pages and _don't_ return them to the pool, it boots fine. PageReserved is set on every page in the initrd region; total page_count() is equal to the number of pages (which doesn't _necessarily_ mean that page_count() for every page is equal to 1 but it's a strong hint that that's the case). Looking in /dev/mem after it boots, I see that my poison is still present throughout the whole region. One obvious thing to do would be to remove all the __initdata entries in mm/slab.c.. This is biting us long before we call free_initmem(). But I'd also like to see the full backtrace for the BUG_ON(), in case that gives any clues at all. I'll see if I can find a camera. It smells like the pages weren't actually reserved in the first place and we were blithely allocating them. The only problem with that theory is that the initrd doesn't seem to be getting corrupted -- and if we were handing out its pages like that then surely _something_ would have scribbled on it before we tried to read it. Yeah, I don't think it's necessarily initrd itself, I'd be more inclined to think that the reason you see this change with the initrd unpacking is simply that it does a lot of allocations for the initrd files, so I think it is only indirectly involved - just because it ends up being a slab user. Whatever happens, initrd as a 'slab user' is fine. The crashes happen _later_, when someone else is using the memory which used to belong to the initrd. In that 'BUG at slab.c:2876' I mentioned above, r3 was within the initrd region. As I said, I'll try to find a camera. Just a thought, Any chance you are using one of the unusal code paths, like the bootloader moving the initrd or using a kernel-crash region? milton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20: Rebuild after trivial patch rebuilds way too much
On Mon, 12 Feb 2007 12:38:15 -0500 (EST), Andrey Borzenkov wrote: > Is not it too much fore a trivial two lines patch to a single file? > > {pts/1}% stg import > ~/patch/Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch > Importing > patch "Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch"... done > Now at patch "Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch" > {pts/1}% make -C ~/src/linux-git O=$HOME/build/linux-2.6.20 > make: Entering directory `/home/bor/src/linux-git' .. 120 lines of make ... You can ask Kbuild why its building with make V=2. make V=2 -C ~/src/linux-git O=$HOME/build/linux-2.6.20 from scripts/Kbuild.include: ### # why - tell why a a target got build # enabled by make V=2 # Output (listed in the order they are checked): # (1) - due to target is PHONY # (2) - due to target missing # (3) - due to: file1.h file2.h # (4) - due to command line change # (5) - due to missing .cmd file # (6) - due to target not in $(targets) # (1) PHONY targets are always build # (2) No target, so we better build it # (3) Prerequisite is newer than target # (4) The command line stored in the file named dir/.target.cmd # differed from actual command line. This happens when compiler # options changes # (5) No dir/.target.cmd file (used to store command line) # (6) No dir/.target.cmd file and target not listed in $(targets) # This is a good hint that there is a bug in the kbuild file Regards, milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.20: Rebuild after trivial patch rebuilds way too much
On Mon, 12 Feb 2007 12:38:15 -0500 (EST), Andrey Borzenkov wrote: Is not it too much fore a trivial two lines patch to a single file? {pts/1}% stg import ~/patch/Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch Importing patch Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch... done Now at patch Re_2_6_20_rc6_libata_PATA_ATAPI_CDROM_is_not_working.patch {pts/1}% make -C ~/src/linux-git O=$HOME/build/linux-2.6.20 make: Entering directory `/home/bor/src/linux-git' .. 120 lines of make ... You can ask Kbuild why its building with make V=2. make V=2 -C ~/src/linux-git O=$HOME/build/linux-2.6.20 from scripts/Kbuild.include: ### # why - tell why a a target got build # enabled by make V=2 # Output (listed in the order they are checked): # (1) - due to target is PHONY # (2) - due to target missing # (3) - due to: file1.h file2.h # (4) - due to command line change # (5) - due to missing .cmd file # (6) - due to target not in $(targets) # (1) PHONY targets are always build # (2) No target, so we better build it # (3) Prerequisite is newer than target # (4) The command line stored in the file named dir/.target.cmd # differed from actual command line. This happens when compiler # options changes # (5) No dir/.target.cmd file (used to store command line) # (6) No dir/.target.cmd file and target not listed in $(targets) # This is a good hint that there is a bug in the kbuild file Regards, milton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 9, 2007, at 10:17 AM, Carl Love wrote: On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote: On Feb 8, 2007, at 4:51 PM, Carl Love wrote: On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: 1) sample rate setup In the current patch, the user specifies a sample rate as a time interval. The kernel is (a) calling cpufreq to get the current cpu frequency, (b) converting the rate to a cycle count, (c) converting this to a 24 bit LFSR count, an iterative algorithm (in this patch, starting from one of 256 values so a max of 2^16 or 64k iterations), (d) calculating an trace unload interval. In addition, a cpufreq notifier is registered to recalculate on frequency changes. No. The user issues the command opcontrol --event:N where N is the number of events (cycles, l2 cache misses, instructions retired etc) that are to elapse between collecting the samples. So you are saying that b and c are primary, and a is used to calculate a safe value for d. All of the above work is dont, just from a different starting point? There are two things 1) setup the LFSR to control how often the Hardware puts samples into the trace buffer. 2) setup the kernel timer to read the trace buffer (your d, d is a function of the cpu freq) and process the samples. (c is nonsense) Well, its cacluate the rate from the count vs count from the rate. The kernel timer was set with the goal the the hardware trace buffer would not get more then half full to ensure we would not lose samples even for the maximum rate that the hardware would be adding samples to the trace buffer (user specified N=100,000). That should be well commented. The OProfile passes the value N to the kernel via the variable ctr[i].count. Where i is the performance counter entry for that event. Ok I haven't looked a the api closely. Actually, the oprofile userspace fills out files in a file system to tell the kernel what it needs to know. The powerpc code defines the reources needed to use the PMU hardware, which is (1) common mux selects, for processors that need them, and (2) a set of directories, one for each of the pmu counters, each of which contain the controls for that counter (such as enable kernel space, enable user space, event timeout to interrupt or sample collection, etc. The ctr[i].count is one of these files. Specifically with SPU profiling, we do not use performance counters because the CELL HW does not allow the normal the PPU to read the SPU PC when a performance counter interrupt occurs. We are using some additional hw support in the chip that allows us to periodically capture the SPU PC. There is an LFSR hardware counter that can be started at an arbitrary LFSR value. When the last LFSR value in the sequence is reached, a sample is taken and stored in the trace buffer. Hence, the value of N specified by the user must get converted to the LFSR value that is N from the end of the sequence. Ok so its arbitray load count to max vs count and compare. A critical detail when computing the value to load, but the net result is the same; the value for the count it hard to determine. The above statement makes no sense to me. I think I talked past you. Your description of hadrware vs mine was different in that the counter always ends at a specified point and is loaded with the variable count, where mine had it comparing to the count as it incremented. However, I now see that you were referring to the fact that what the user specifes, the count, has to be converted to a LFSR value. My point is this can be done in the user-space oprofile code. It already has to look up magic numbers for setting up event selection muxes for other hardware, adding a lfsr calculation is not beyond resoon. Nor is having it provide two values in two files. Determining the initial LFSR value that is N values from the last value in the sequence is not easy to do. Well, its easy, its just order(N). The same clock that the processor is running at is used to control the LFSR count. Hence the LFSR counter increments once per CPU clock cycle regardless of the CPU frequency or changes in the frequency. There is no calculation for the LFSR value that is a function of the processor frequency. There is no need to adjust the LFSR when the processor frequency changes. Oh, so the lfsr doesn't have to be recomputed, only the time between unloads. The LFSR value is computed ONCE when you start OProfile. The value is setup in the hardware once when OProfile starts. The hardware will always restart with the value given to it after it reaches the last value in the sequence. What you call the "time between unloads" is the time at which you schedule the kernel routine to empty the trace buffer. It is calculated once. It would only need to be recomputed if the cpu frequency changed. The obvio
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 9, 2007, at 10:17 AM, Carl Love wrote: On Thu, 2007-02-08 at 20:46 -0600, Milton Miller wrote: On Feb 8, 2007, at 4:51 PM, Carl Love wrote: On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: 1) sample rate setup In the current patch, the user specifies a sample rate as a time interval. The kernel is (a) calling cpufreq to get the current cpu frequency, (b) converting the rate to a cycle count, (c) converting this to a 24 bit LFSR count, an iterative algorithm (in this patch, starting from one of 256 values so a max of 2^16 or 64k iterations), (d) calculating an trace unload interval. In addition, a cpufreq notifier is registered to recalculate on frequency changes. No. The user issues the command opcontrol --event:N where N is the number of events (cycles, l2 cache misses, instructions retired etc) that are to elapse between collecting the samples. So you are saying that b and c are primary, and a is used to calculate a safe value for d. All of the above work is dont, just from a different starting point? There are two things 1) setup the LFSR to control how often the Hardware puts samples into the trace buffer. 2) setup the kernel timer to read the trace buffer (your d, d is a function of the cpu freq) and process the samples. (c is nonsense) Well, its cacluate the rate from the count vs count from the rate. The kernel timer was set with the goal the the hardware trace buffer would not get more then half full to ensure we would not lose samples even for the maximum rate that the hardware would be adding samples to the trace buffer (user specified N=100,000). That should be well commented. The OProfile passes the value N to the kernel via the variable ctr[i].count. Where i is the performance counter entry for that event. Ok I haven't looked a the api closely. Actually, the oprofile userspace fills out files in a file system to tell the kernel what it needs to know. The powerpc code defines the reources needed to use the PMU hardware, which is (1) common mux selects, for processors that need them, and (2) a set of directories, one for each of the pmu counters, each of which contain the controls for that counter (such as enable kernel space, enable user space, event timeout to interrupt or sample collection, etc. The ctr[i].count is one of these files. Specifically with SPU profiling, we do not use performance counters because the CELL HW does not allow the normal the PPU to read the SPU PC when a performance counter interrupt occurs. We are using some additional hw support in the chip that allows us to periodically capture the SPU PC. There is an LFSR hardware counter that can be started at an arbitrary LFSR value. When the last LFSR value in the sequence is reached, a sample is taken and stored in the trace buffer. Hence, the value of N specified by the user must get converted to the LFSR value that is N from the end of the sequence. Ok so its arbitray load count to max vs count and compare. A critical detail when computing the value to load, but the net result is the same; the value for the count it hard to determine. The above statement makes no sense to me. I think I talked past you. Your description of hadrware vs mine was different in that the counter always ends at a specified point and is loaded with the variable count, where mine had it comparing to the count as it incremented. However, I now see that you were referring to the fact that what the user specifes, the count, has to be converted to a LFSR value. My point is this can be done in the user-space oprofile code. It already has to look up magic numbers for setting up event selection muxes for other hardware, adding a lfsr calculation is not beyond resoon. Nor is having it provide two values in two files. Determining the initial LFSR value that is N values from the last value in the sequence is not easy to do. Well, its easy, its just order(N). The same clock that the processor is running at is used to control the LFSR count. Hence the LFSR counter increments once per CPU clock cycle regardless of the CPU frequency or changes in the frequency. There is no calculation for the LFSR value that is a function of the processor frequency. There is no need to adjust the LFSR when the processor frequency changes. Oh, so the lfsr doesn't have to be recomputed, only the time between unloads. The LFSR value is computed ONCE when you start OProfile. The value is setup in the hardware once when OProfile starts. The hardware will always restart with the value given to it after it reaches the last value in the sequence. What you call the time between unloads is the time at which you schedule the kernel routine to empty the trace buffer. It is calculated once. It would only need to be recomputed if the cpu frequency changed. The obvious problem
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 9, 2007, at 1:10 PM, Arnd Bergmann wrote: On Friday 09 February 2007 19:47, Milton Miller wrote: On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote: Doing the translation in two stages in user space, as you suggest here, definitely makes sense to me. I think it can be done a little simpler though: Why would you need the accurate dcookie information to be provided by the kernel? The ELF loader is done in user space, and the kernel only reproduces what it thinks that came up with. If the kernel only gives the dcookie information about the SPU ELF binary to the oprofile user space, then that can easily recreate the same mapping. Actually, I was trying to cover issues such as anonymous memory. If the kernel doesn't generate dcookies for the load segments the information is not recoverable once the task exits. This would also allow the loader to create an artifical elf header that covered both the base executable and a dynamicly linked one. Other alternatives exist, such as a structure for context-id that would have its own identifing magic and an array of elf header pointers. But _why_ do you want to solve that problem? we don't have dynamically linked binaries and I really don't see why the loader would want to create artificial elf headers... I'm explainign how they could be handled. Actully I think the other proposal (an identified structure that points to sevear elf headers) would be more approprate. As you point out, if there are presently no dynamic libraries in use, it doesn't have to be solved today. I'm just trying to make the code future proof, or at least a clear path forward. The kernel still needs to provide the overlay identifiers though. Yes, which means it needs to parse the header (or libpse be enhanced to write the monitor info to a spufs file). we thought about this in the past and discarded it because of the complexity of an spufs interface that would handle this correctly. Not sure what would be difficuult, and it would allow other binary formats. But parsing the headers in the kernel means existing userspace doesn't have to be upgraded, so I am not proposing this requirement. yes, this sounds nice. But tt does not at all help accuracy, only performance, right? It allows the user space to know when the sample was taken and be aware of the ambiguity. If the sample rate is high enough in relation to the overlay switch rate, user space could decide to discard the ambiguous samples. yes, good point. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. Yes, that would be a new feature if the kernel passed dcookie information for every section, but I doubt that it is worth it. I have not seen any program that allows loading code from more than one ELF file. In particular, the ELF format on the SPU is currently lacking the relocation mechanisms that you would need for resolving spu-side symbols at load time Actually, It could check all load segments, and only report those where the dcookie changes (as opposed to the offset). I'm not really following you here, but probably you misunderstood my point as well. I was thinking in terms of dyanmic libraries, and totally skipped your comment about the relocaiton info being missing. My reply point was that the table could be compressed to the current entry if all hased to the same vm area. This seems to incur a run-time overhead on the SPU even if not profiling, I would consider that not acceptable. It definitely is overhead. Which means it would have to be optional, like gprof. There is some work going on for another profiler independent of the hardware feature that only relies on instrumenting the spu executable for things like DMA transfers and overlay changes. Regardless, its beyond the current scope. milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to present samples that may have been collected during context switch processing, they must be discarded. I thought it already did handle overlays, what did I miss here? It does, see my reply to Maynard. Not sure what I was thinking when I wrote this, possibly I was thinking of the inaccuracies. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. Doing the translation in two stages in user space, as you suggest here, definitely makes sense to me. I think it can be done a little simpler though: Why would you need the accurate dcookie information to be provided by the kernel? The ELF loader is done in user space, and the kernel only reproduces what it thinks that came up with. If the kernel only gives the dcookie information about the SPU ELF binary to the oprofile user space, then that can easily recreate the same mapping. Actually, I was trying to cover issues such as anonymous memory. If the kernel doesn't generate dcookies for the load segments the information is not recoverable once the task exits. This would also allow the loader to create an artifical elf header that covered both the base executable and a dynamicly linked one. Other alternatives exist, such as a structure for context-id that would have its own identifing magic and an array of elf header pointers. The kernel still needs to provide the overlay identifiers though. Yes, which means it needs to parse the header (or libpse be enhanced to write the monitor info to a spufs file). To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. right. This approach trades translation lookup overhead for each recorded sample for a burst of data on new context activation. In addition it exposes the sample point of the overlay identifier vs the address collection. This allows the ambiguity to be exposed to user space. In addition, with the above proposed kernel timer vs sample collection, user space could limit the elapsed time between the address collection and the overlay id check. yes, this sounds nice. But tt does not at all help accuracy, only performance, right? It allows the user space to know when the sample was taken and be aware of the ambiguity. If the sample rate is high enough in relation to the overlay switch rate, user space could decide to discard the ambiguous samples. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. Yes, that would be a new feature if the kernel passed dcookie information for every section, but I doubt that it is worth it. I have not seen any program that allows loading code from more than one ELF file. In particular, the ELF format on the SPU is currently lacking the relocation mechanisms that you would need for resolving spu-side symbols at load time Actually, It could check all load segments, and only report those where the dcookie changes (as opposed to the offset). . If better overlay identification is required, in theory the overlay switch code could be augmented to record the switches (DMA reference time from the PowerPC memory and record a relative decrementer in the SPU), this is obviously a future item. But it is facilitated by having user space resolve the SPU to source file translation. This seems to incur a run-time overhead on the SPU even if not profiling, I would consider that not acceptable. It definitely is overhead. Which means it would have to be optional, like gprof. milton - To unsubscribe from this list: send the line "unsubscribe linux-kernel" i
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 5:59 PM, Maynard Johnson wrote: Milton, Thank you for your comments. Carl will reply to certain parts of your posting where he's more knowledgeable than I. See my replies below. Thanks for the pleasant tone and dialog. Milton Miller wrote: On Feb 6, 2007, at 5:02 PM, Carl Love wrote: This is the first update to the patch previously posted by Maynard Johnson as "PATCH 4/4. Add support to OProfile for profiling CELL". Data collected The current patch starts tackling these translation issues for the presently common case of a static self contained binary from a single file, either single separate source file or embedded in the data of the host application. When creating the trace entry for a SPU context switch, it records the application owner, pid, tid, and dcookie of the main executable. It addition, it looks up the object-id as a virtual address and records the offset if it is non-zero, or the dcookie of the object if it is zero. The code then creates a data structure by reading the elf headers from the user process (at the address given by the object-id) and building a list of SPU address to elf object offsets, as specified by the ELF loader headers. In addition to the elf loader section, it processes the overlay headers and records the address, size, and magic number of the overlay. When the hardware trace entries are processed, each address is looked up this structure and translated to the elf offset. If it is an overlay region, the overlay identify word is read and the list is searched for the matching overlay. The resulting offset is sent to the oprofile system. The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to Yes, we do handle overlays. (Note: I'm looking into a bug right now in our overlay support.) I knew you handled overlays, and I did not mean to say that you did not. I am not sure how that got there. I may have been thinking of the kernel supplied context switch code discussion, or how the code supplied dcookie or offset but not both. Actually, I might have been referring to the fact that overlays are guessed rather than recorded. present samples that may have been collected during context switch processing, they must be discarded. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. Discussions on this topic in the past have resulted in the current implementation precisely because we're able to record the samples as fileoffsets, just as the userspace tools expect. I was not part of the previous discussions, so please forgive me. I haven't had time to check out how much this would impact the userspace tools, but my gut feel is that it would be quite significant. If we were developing this module with a matching newly-created userspace tool, I would be more inclined to agree that this makes sense. I have not yet studied the user space tool. In fact, when I made this proposal, I had not studied the kernel oprofile code either, although I had read the concepts and discussion of the event buffer when the base patch was added to the kernel. I have read and now consider myself to have some understanding of the kernel side. I note that the user space tool calls itself alpha and the kernel support experimental. I only looked at the user space enough to determine it is written in C++. I would hope the tool would be modular enough to insert a data transformation pass to do this conversion that the kernel is doing. But you give no rationale for your proposal that justifies the change. The current implementation works, it has no impact on normal, non-profiling behavior, and the overhead during profiling is not noticeable. I was proposing this for several of reasons. One, there were expressed limitations in the current proposal. There is a requirement that everything be linked into one ELF object for it to be profiled seemed significant to me. This implies that shared libraries (well, dynamic linked ones) have no pa
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 5:59 PM, Maynard Johnson wrote: Milton, Thank you for your comments. Carl will reply to certain parts of your posting where he's more knowledgeable than I. See my replies below. Thanks for the pleasant tone and dialog. Milton Miller wrote: On Feb 6, 2007, at 5:02 PM, Carl Love wrote: This is the first update to the patch previously posted by Maynard Johnson as PATCH 4/4. Add support to OProfile for profiling CELL. Data collected The current patch starts tackling these translation issues for the presently common case of a static self contained binary from a single file, either single separate source file or embedded in the data of the host application. When creating the trace entry for a SPU context switch, it records the application owner, pid, tid, and dcookie of the main executable. It addition, it looks up the object-id as a virtual address and records the offset if it is non-zero, or the dcookie of the object if it is zero. The code then creates a data structure by reading the elf headers from the user process (at the address given by the object-id) and building a list of SPU address to elf object offsets, as specified by the ELF loader headers. In addition to the elf loader section, it processes the overlay headers and records the address, size, and magic number of the overlay. When the hardware trace entries are processed, each address is looked up this structure and translated to the elf offset. If it is an overlay region, the overlay identify word is read and the list is searched for the matching overlay. The resulting offset is sent to the oprofile system. The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to Yes, we do handle overlays. (Note: I'm looking into a bug right now in our overlay support.) I knew you handled overlays, and I did not mean to say that you did not. I am not sure how that got there. I may have been thinking of the kernel supplied context switch code discussion, or how the code supplied dcookie or offset but not both. Actually, I might have been referring to the fact that overlays are guessed rather than recorded. present samples that may have been collected during context switch processing, they must be discarded. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. Discussions on this topic in the past have resulted in the current implementation precisely because we're able to record the samples as fileoffsets, just as the userspace tools expect. I was not part of the previous discussions, so please forgive me. I haven't had time to check out how much this would impact the userspace tools, but my gut feel is that it would be quite significant. If we were developing this module with a matching newly-created userspace tool, I would be more inclined to agree that this makes sense. I have not yet studied the user space tool. In fact, when I made this proposal, I had not studied the kernel oprofile code either, although I had read the concepts and discussion of the event buffer when the base patch was added to the kernel. I have read and now consider myself to have some understanding of the kernel side. I note that the user space tool calls itself alpha and the kernel support experimental. I only looked at the user space enough to determine it is written in C++. I would hope the tool would be modular enough to insert a data transformation pass to do this conversion that the kernel is doing. But you give no rationale for your proposal that justifies the change. The current implementation works, it has no impact on normal, non-profiling behavior, and the overhead during profiling is not noticeable. I was proposing this for several of reasons. One, there were expressed limitations in the current proposal. There is a requirement that everything be linked into one ELF object for it to be profiled seemed significant to me. This implies that shared libraries (well, dynamic linked ones) have no path forward
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to present samples that may have been collected during context switch processing, they must be discarded. I thought it already did handle overlays, what did I miss here? It does, see my reply to Maynard. Not sure what I was thinking when I wrote this, possibly I was thinking of the inaccuracies. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. Doing the translation in two stages in user space, as you suggest here, definitely makes sense to me. I think it can be done a little simpler though: Why would you need the accurate dcookie information to be provided by the kernel? The ELF loader is done in user space, and the kernel only reproduces what it thinks that came up with. If the kernel only gives the dcookie information about the SPU ELF binary to the oprofile user space, then that can easily recreate the same mapping. Actually, I was trying to cover issues such as anonymous memory. If the kernel doesn't generate dcookies for the load segments the information is not recoverable once the task exits. This would also allow the loader to create an artifical elf header that covered both the base executable and a dynamicly linked one. Other alternatives exist, such as a structure for context-id that would have its own identifing magic and an array of elf header pointers. The kernel still needs to provide the overlay identifiers though. Yes, which means it needs to parse the header (or libpse be enhanced to write the monitor info to a spufs file). To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. right. This approach trades translation lookup overhead for each recorded sample for a burst of data on new context activation. In addition it exposes the sample point of the overlay identifier vs the address collection. This allows the ambiguity to be exposed to user space. In addition, with the above proposed kernel timer vs sample collection, user space could limit the elapsed time between the address collection and the overlay id check. yes, this sounds nice. But tt does not at all help accuracy, only performance, right? It allows the user space to know when the sample was taken and be aware of the ambiguity. If the sample rate is high enough in relation to the overlay switch rate, user space could decide to discard the ambiguous samples. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. Yes, that would be a new feature if the kernel passed dcookie information for every section, but I doubt that it is worth it. I have not seen any program that allows loading code from more than one ELF file. In particular, the ELF format on the SPU is currently lacking the relocation mechanisms that you would need for resolving spu-side symbols at load time Actually, It could check all load segments, and only report those where the dcookie changes (as opposed to the offset). . If better overlay identification is required, in theory the overlay switch code could be augmented to record the switches (DMA reference time from the PowerPC memory and record a relative decrementer in the SPU), this is obviously a future item. But it is facilitated by having user space resolve the SPU to source file translation. This seems to incur a run-time overhead on the SPU even if not profiling, I would consider that not acceptable. It definitely is overhead. Which means it would have to be optional, like gprof. milton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 9, 2007, at 1:10 PM, Arnd Bergmann wrote: On Friday 09 February 2007 19:47, Milton Miller wrote: On Feb 8, 2007, at 11:21 AM, Arnd Bergmann wrote: Doing the translation in two stages in user space, as you suggest here, definitely makes sense to me. I think it can be done a little simpler though: Why would you need the accurate dcookie information to be provided by the kernel? The ELF loader is done in user space, and the kernel only reproduces what it thinks that came up with. If the kernel only gives the dcookie information about the SPU ELF binary to the oprofile user space, then that can easily recreate the same mapping. Actually, I was trying to cover issues such as anonymous memory. If the kernel doesn't generate dcookies for the load segments the information is not recoverable once the task exits. This would also allow the loader to create an artifical elf header that covered both the base executable and a dynamicly linked one. Other alternatives exist, such as a structure for context-id that would have its own identifing magic and an array of elf header pointers. But _why_ do you want to solve that problem? we don't have dynamically linked binaries and I really don't see why the loader would want to create artificial elf headers... I'm explainign how they could be handled. Actully I think the other proposal (an identified structure that points to sevear elf headers) would be more approprate. As you point out, if there are presently no dynamic libraries in use, it doesn't have to be solved today. I'm just trying to make the code future proof, or at least a clear path forward. The kernel still needs to provide the overlay identifiers though. Yes, which means it needs to parse the header (or libpse be enhanced to write the monitor info to a spufs file). we thought about this in the past and discarded it because of the complexity of an spufs interface that would handle this correctly. Not sure what would be difficuult, and it would allow other binary formats. But parsing the headers in the kernel means existing userspace doesn't have to be upgraded, so I am not proposing this requirement. yes, this sounds nice. But tt does not at all help accuracy, only performance, right? It allows the user space to know when the sample was taken and be aware of the ambiguity. If the sample rate is high enough in relation to the overlay switch rate, user space could decide to discard the ambiguous samples. yes, good point. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. Yes, that would be a new feature if the kernel passed dcookie information for every section, but I doubt that it is worth it. I have not seen any program that allows loading code from more than one ELF file. In particular, the ELF format on the SPU is currently lacking the relocation mechanisms that you would need for resolving spu-side symbols at load time Actually, It could check all load segments, and only report those where the dcookie changes (as opposed to the offset). I'm not really following you here, but probably you misunderstood my point as well. I was thinking in terms of dyanmic libraries, and totally skipped your comment about the relocaiton info being missing. My reply point was that the table could be compressed to the current entry if all hased to the same vm area. This seems to incur a run-time overhead on the SPU even if not profiling, I would consider that not acceptable. It definitely is overhead. Which means it would have to be optional, like gprof. There is some work going on for another profiler independent of the hardware feature that only relies on instrumenting the spu executable for things like DMA transfers and overlay changes. Regardless, its beyond the current scope. milton - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 4:51 PM, Carl Love wrote: On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: 1) sample rate setup In the current patch, the user specifies a sample rate as a time interval. The kernel is (a) calling cpufreq to get the current cpu frequency, (b) converting the rate to a cycle count, (c) converting this to a 24 bit LFSR count, an iterative algorithm (in this patch, starting from one of 256 values so a max of 2^16 or 64k iterations), (d) calculating an trace unload interval. In addition, a cpufreq notifier is registered to recalculate on frequency changes. No. The user issues the command opcontrol --event:N where N is the number of events (cycles, l2 cache misses, instructions retired etc) that are to elapse between collecting the samples. So you are saying that b and c are primary, and a is used to calculate a safe value for d. All of the above work is dont, just from a different starting point? The OProfile passes the value N to the kernel via the variable ctr[i].count. Where i is the performance counter entry for that event. Ok I haven't looked a the api closely. Specifically with SPU profiling, we do not use performance counters because the CELL HW does not allow the normal the PPU to read the SPU PC when a performance counter interrupt occurs. We are using some additional hw support in the chip that allows us to periodically capture the SPU PC. There is an LFSR hardware counter that can be started at an arbitrary LFSR value. When the last LFSR value in the sequence is reached, a sample is taken and stored in the trace buffer. Hence, the value of N specified by the user must get converted to the LFSR value that is N from the end of the sequence. Ok so its arbitray load count to max vs count and compare. A critical detail when computing the value to load, but the net result is the same; the value for the count it hard to determine. The same clock that the processor is running at is used to control the LFSR count. Hence the LFSR counter increments once per CPU clock cycle regardless of the CPU frequency or changes in the frequency. There is no calculation for the LFSR value that is a function of the processor frequency. There is no need to adjust the LFSR when the processor frequency changes. Oh, so the lfsr doesn't have to be recomputed, only the time between unloads. Milton had a comment about the code if (ctr[0].event == SPU_CYCLES_EVENT_NUM) { + spu_cycle_reset = ctr[0].count; + return; + } Well, given the above description, it is clear that if you are doing SPU event profiling, the value N is put into the cntr[0].cnt entry since there is only one event. Thus in cell_global_start_spu() you use spu_cycle_reset as the argument to the lfsr calculation routine to get the LFSR value that is N from the end of the sequence. I was looking at the patch and the context was not very good. You might consider adding -p to your diff command, it provides the function name after the @@. However, in this case, I think I just need to see the final result. The obvious problem is step (c), running a loop potentially 64 thousand times in kernel space will have a noticeable impact on other threads. I propose instead that user space perform the above 4 steps, and provide the kernel with two inputs: (1) the value to load in the LFSR and (2) the periodic frequency / time interval at which to empty the hardware trace buffer, perform sample analysis, and send the data to the oprofile subsystem. There should be no security issues with this approach. If the LFSR value is calculated incorrectly, either it will be too short, causing the trace array to overfill and data to be dropped, or it will be too long, and there will be fewer samples. Likewise, the kernel periodic poll can be too long, again causing overflow, or too frequent, causing only timer execution overhead. Various data is collected by the kernel while processing the periodic timer, this approach would also allow the profiling tools to control the frequency of this collection. More frequent collection results in more accurate sample data, with the linear cost of poll execution overhead. Frequency changes can be handled either by the profile code setting collection at a higher than necessary rate, or by interacting with the governor to limit the speeds. Optionally, the kernel can add a record indicating that some data was likely dropped if it is able to read all 256 entries without underflowing the array. This can be used as hint to user space that the kernel time was too long for the collection rate. Moving the sample rate computation to user space sounds like the right idea, but why not have a more drastic version of it: No, I do
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
f the overlay. When the hardware trace entries are processed, each address is looked up this structure and translated to the elf offset. If it is an overlay region, the overlay identify word is read and the list is searched for the matching overlay. The resulting offset is sent to the oprofile system. The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to present samples that may have been collected during context switch processing, they must be discarded. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. This approach trades translation lookup overhead for each recorded sample for a burst of data on new context activation. In addition it exposes the sample point of the overlay identifier vs the address collection. This allows the ambiguity to be exposed to user space. In addition, with the above proposed kernel timer vs sample collection, user space could limit the elapsed time between the address collection and the overlay id check. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. If better overlay identification is required, in theory the overlay switch code could be augmented to record the switches (DMA reference time from the PowerPC memory and record a relative decrementer in the SPU), this is obviously a future item. But it is facilitated by having user space resolve the SPU to source file translation. milton -- [EMAIL PROTECTED] Milton Miller Speaking for myself only. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
trace entries are processed, each address is looked up this structure and translated to the elf offset. If it is an overlay region, the overlay identify word is read and the list is searched for the matching overlay. The resulting offset is sent to the oprofile system. The current patch specifically identifies that only single elf objects are handled. There is no code to handle dynamic linked libraries or overlays. Nor is there any method to present samples that may have been collected during context switch processing, they must be discarded. My proposal is to change what is presented to user space. Instead of trying to translate the SPU address to the backing file as the samples are recorded, store the samples as the SPU context and address. The context switch would record tid, pid, object id as it does now. In addition, if this is a new object-id, the kernel would read elf headers as it does today. However, it would then proceed to provide accurate dcookie information for each loader region and overlay. To identify which overlays are active, (instead of the present read on use and search the list to translate approach) the kernel would record the location of the overlay identifiers as it parsed the kernel, but would then read the identification word and would record the present value as an sample from a separate but related stream. The kernel could maintain the last value for each overlay and only send profile events for the deltas. This approach trades translation lookup overhead for each recorded sample for a burst of data on new context activation. In addition it exposes the sample point of the overlay identifier vs the address collection. This allows the ambiguity to be exposed to user space. In addition, with the above proposed kernel timer vs sample collection, user space could limit the elapsed time between the address collection and the overlay id check. This approach allows multiple objects by its nature. A new elf header could be constructed in memory that contained the union of the elf objects load segments, and the tools will magically work. Alternatively the object id could point to a new structure, identified via a new header, that it points to other elf headers (easily differentiated by the elf magic headers). Other binary formats, including several objects in a ar archive, could be supported. If better overlay identification is required, in theory the overlay switch code could be augmented to record the switches (DMA reference time from the PowerPC memory and record a relative decrementer in the SPU), this is obviously a future item. But it is facilitated by having user space resolve the SPU to source file translation. milton -- [EMAIL PROTECTED] Milton Miller Speaking for myself only. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Cbe-oss-dev] [RFC, PATCH] CELL Oprofile SPU profiling updated patch
On Feb 8, 2007, at 4:51 PM, Carl Love wrote: On Thu, 2007-02-08 at 18:21 +0100, Arnd Bergmann wrote: On Thursday 08 February 2007 15:18, Milton Miller wrote: 1) sample rate setup In the current patch, the user specifies a sample rate as a time interval. The kernel is (a) calling cpufreq to get the current cpu frequency, (b) converting the rate to a cycle count, (c) converting this to a 24 bit LFSR count, an iterative algorithm (in this patch, starting from one of 256 values so a max of 2^16 or 64k iterations), (d) calculating an trace unload interval. In addition, a cpufreq notifier is registered to recalculate on frequency changes. No. The user issues the command opcontrol --event:N where N is the number of events (cycles, l2 cache misses, instructions retired etc) that are to elapse between collecting the samples. So you are saying that b and c are primary, and a is used to calculate a safe value for d. All of the above work is dont, just from a different starting point? The OProfile passes the value N to the kernel via the variable ctr[i].count. Where i is the performance counter entry for that event. Ok I haven't looked a the api closely. Specifically with SPU profiling, we do not use performance counters because the CELL HW does not allow the normal the PPU to read the SPU PC when a performance counter interrupt occurs. We are using some additional hw support in the chip that allows us to periodically capture the SPU PC. There is an LFSR hardware counter that can be started at an arbitrary LFSR value. When the last LFSR value in the sequence is reached, a sample is taken and stored in the trace buffer. Hence, the value of N specified by the user must get converted to the LFSR value that is N from the end of the sequence. Ok so its arbitray load count to max vs count and compare. A critical detail when computing the value to load, but the net result is the same; the value for the count it hard to determine. The same clock that the processor is running at is used to control the LFSR count. Hence the LFSR counter increments once per CPU clock cycle regardless of the CPU frequency or changes in the frequency. There is no calculation for the LFSR value that is a function of the processor frequency. There is no need to adjust the LFSR when the processor frequency changes. Oh, so the lfsr doesn't have to be recomputed, only the time between unloads. Milton had a comment about the code if (ctr[0].event == SPU_CYCLES_EVENT_NUM) { + spu_cycle_reset = ctr[0].count; + return; + } Well, given the above description, it is clear that if you are doing SPU event profiling, the value N is put into the cntr[0].cnt entry since there is only one event. Thus in cell_global_start_spu() you use spu_cycle_reset as the argument to the lfsr calculation routine to get the LFSR value that is N from the end of the sequence. I was looking at the patch and the context was not very good. You might consider adding -p to your diff command, it provides the function name after the @@. However, in this case, I think I just need to see the final result. The obvious problem is step (c), running a loop potentially 64 thousand times in kernel space will have a noticeable impact on other threads. I propose instead that user space perform the above 4 steps, and provide the kernel with two inputs: (1) the value to load in the LFSR and (2) the periodic frequency / time interval at which to empty the hardware trace buffer, perform sample analysis, and send the data to the oprofile subsystem. There should be no security issues with this approach. If the LFSR value is calculated incorrectly, either it will be too short, causing the trace array to overfill and data to be dropped, or it will be too long, and there will be fewer samples. Likewise, the kernel periodic poll can be too long, again causing overflow, or too frequent, causing only timer execution overhead. Various data is collected by the kernel while processing the periodic timer, this approach would also allow the profiling tools to control the frequency of this collection. More frequent collection results in more accurate sample data, with the linear cost of poll execution overhead. Frequency changes can be handled either by the profile code setting collection at a higher than necessary rate, or by interacting with the governor to limit the speeds. Optionally, the kernel can add a record indicating that some data was likely dropped if it is able to read all 256 entries without underflowing the array. This can be used as hint to user space that the kernel time was too long for the collection rate. Moving the sample rate computation to user space sounds like the right idea, but why not have a more drastic version of it: No, I do
Re: [PATCH] ppc64: iSeries early printk breakage
Does anything actually break without this patch? My reading of unregister_console says we will acquire the console semaphore, walk the list, fail to find the console, relase the semaphore, and return. Hmm... unless there is a problem with the console preference code? I don't see anything that should deref a bad pointer, maybe it breaks the preference? From get web: if (console_drivers == NULL) preferred_console = selected_console; else if (console->flags & CON_CONSDEV) console_drivers->flags |= CON_CONSDEV; On Sep 7, 2005, at 4:52 AM, Stephen Rothwell wrote: The earlier commit 8d9273918635f0301368c01b56c03a6f339e8d51 (Consolidate early console and PPCDBG code) broke iSeries because it caused unregister_console(_console) to be called unconditionally. iSeries never registers the udbg_console. This just reverts part of the change. Signed-off-by: Stephen Rothwell <[EMAIL PROTECTED]> --- arch/ppc64/kernel/udbg.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) -- Cheers, Stephen Rothwell[EMAIL PROTECTED] http://www.canb.auug.org.au/~sfr/ 234f5032f6ccb4d72e4b74d33af55716b67d8a27 diff --git a/arch/ppc64/kernel/udbg.c b/arch/ppc64/kernel/udbg.c --- a/arch/ppc64/kernel/udbg.c +++ b/arch/ppc64/kernel/udbg.c @@ -158,14 +158,20 @@ static struct console udbg_console = { .index = -1, }; +static int early_console_initialized; + void __init disable_early_printk(void) { + if (!early_console_initialized) + return; unregister_console(_console); + early_console_initialized = 0; } /* called by setup_system */ void register_early_udbg_console(void) { + early_console_initialized = 1; register_console(_console); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ppc64: iSeries early printk breakage
Does anything actually break without this patch? My reading of unregister_console says we will acquire the console semaphore, walk the list, fail to find the console, relase the semaphore, and return. Hmm... unless there is a problem with the console preference code? I don't see anything that should deref a bad pointer, maybe it breaks the preference? From get web: if (console_drivers == NULL) preferred_console = selected_console; else if (console-flags CON_CONSDEV) console_drivers-flags |= CON_CONSDEV; On Sep 7, 2005, at 4:52 AM, Stephen Rothwell wrote: The earlier commit 8d9273918635f0301368c01b56c03a6f339e8d51 (Consolidate early console and PPCDBG code) broke iSeries because it caused unregister_console(udbg_console) to be called unconditionally. iSeries never registers the udbg_console. This just reverts part of the change. Signed-off-by: Stephen Rothwell [EMAIL PROTECTED] --- arch/ppc64/kernel/udbg.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) -- Cheers, Stephen Rothwell[EMAIL PROTECTED] http://www.canb.auug.org.au/~sfr/ 234f5032f6ccb4d72e4b74d33af55716b67d8a27 diff --git a/arch/ppc64/kernel/udbg.c b/arch/ppc64/kernel/udbg.c --- a/arch/ppc64/kernel/udbg.c +++ b/arch/ppc64/kernel/udbg.c @@ -158,14 +158,20 @@ static struct console udbg_console = { .index = -1, }; +static int early_console_initialized; + void __init disable_early_printk(void) { + if (!early_console_initialized) + return; unregister_console(udbg_console); + early_console_initialized = 0; } /* called by setup_system */ void register_early_udbg_console(void) { + early_console_initialized = 1; register_console(udbg_console); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/