Re: What is 2.4 Linux networking performance like compared to BSD?
hans, we've found that the TCP and UDP performance on 2.4 is *dramatically* better than 2.2. with the acenic gig-e driver on PIII-933 UP (66MHz x 64bits PCI) we are getting 993 Mb/s with 2.4.0 with jumbo frames (about 850 Mb/s with standard ethernet frames). the best number we got with 2.2 was about 650 with jumbos and 550 with standard. i'd recommend it's networking performance to anyone. todd underwood [EMAIL PROTECTED] On Thu, 1 Mar 2001, Hans Reiser wrote: > Date: Thu, 01 Mar 2001 02:26:20 +0300 > From: Hans Reiser <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Subject: What is 2.4 Linux networking performance like compared to BSD? > > I have a client that wants to implement a webcache, but is very leery of > implementing it on Linux rather than BSD. > > They know that iMimic's polymix performance on Linux 2.2.* is half what it is on > BSD. Has the Linux 2.4 networking code caught up to BSD? > > Can I tell them not to worry about the Linux networking code strangling their > webcache product's performance, or not? > > Hans > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
UDP performance drop from -test9 to 2.4.0 (fwd)
sorry for the resend. i sent this earlier today but still haven't seen it, so i'm resending without attachments. the originally attached netperf results are at: http://www.unm.edu/~todd/udp.2.4.0-test9.9000mtu http://www.unm.edu/~todd/udp.2.4.0-test9.1500mtu http://www.unm.edu/~todd/udp.2.4.0.9000mtu http://www.unm.edu/~todd/udp.2.4.0.1500mtu -- Forwarded message -- Date: Mon, 8 Jan 2001 09:38:24 -0700 (MST) From: Todd <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: UDP performance drop from -test9 to 2.4.0 folx, i'm seeing a signficant performance decline between the -test9 and the release version 2.4.0 on udp on acenic gig-e cards. i should say that the performance is startlingly better than it was under 2.2.14 (where the best we could get was 150+ Mbps on udp) but it still represents a signficant decline. i'm attaching netperf numbers with 1500 and 9000 byte mtus for both kernel versions (with interrupt coalescing and everything else as default--this is just a compile/boot/test scenario with no tuning in particular). standard kernel, standard /proc settings, standard acenic firmware and driver that shipped with the version in question. the summary of results: 12% reduction in performance at 1500-byte mtu, 6% reduction at 9000. when looking at the results remember that the second line in each case is the 'receive bandwith' line, which is the only useful number to report. kernel version max udp bandwithmtu message size == === 2.4.0-test9 554.9 15001470 2.4.0-test9 636.7 90008870 2.4.0 489.4 15001470 2.4.0 597.3 90007270 both versions show the horrible effect that the reassembly of fragments has on receive bandwith, but that's not really my question for the day :-). any info would be helpful. i'd be happy to run any further tests to get at the bottom of this. todd = Todd Underwood, [EMAIL PROTECTED] = - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
hw or other prob?
folx, on 2.4.0-test6 i got a bunch of the following log messages and then the console went dead (well, power-saving monitor and no amount of activity on the keyboard or mouse made it wake up), but the machine still responded to pings (could not connect w/ telnet or ssh or rlogin). does this look like a genuine hw problem or should i upgrade to a newer 2.4.0 test kernel? i'm happy to do any troubleshooting suggested. log messages (lots and lots of them. they all look just like this): Nov 12 06:43:30 zapata kernel: APIC error interrupt on CPU#0, should never happen. Nov 12 06:43:30 zapata kernel: ... APIC ESR0: 0002 Nov 12 06:43:30 zapata kernel: ... APIC ESR1: 0002 Nov 12 06:43:30 zapata kernel: ... bit 1: APIC Receive CS Error (hw problem). Nov 12 06:43:30 zapata kernel: APIC error interrupt on CPU#1, should never happen. Nov 12 06:43:30 zapata kernel: ... APIC ESR0: 0002 Nov 12 06:43:30 zapata kernel: ... APIC ESR1: 0002 Nov 12 06:43:30 zapata kernel: ... bit 1: APIC Receive CS Error (hw problem). Nov 12 07:43:18 zapata kernel: APIC error interrupt on CPU#1, should never happen. Nov 12 07:43:18 zapata kernel: ... APIC ESR0: 0002 Nov 12 07:43:18 zapata kernel: ... APIC ESR1: 0002 Nov 12 07:43:18 zapata kernel: ... bit 1: APIC Receive CS Error (hw problem). Nov 12 07:43:18 zapata kernel: APIC error interrupt on CPU#0, should never happen. machine is smp pIII450 running on a intel N440bx mb. main disk is ide, but i'm using sym53c8xx to access two scsi disks in a logical volume. thanks, todd underwood [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: zero-copy TCP
> > Jes Sorensen wrote: > > > It took me a little while in the beginning to convince Alteon to open > > > up and provide docs, but since they saw the light they have been > > > extremely helpful and went much further in their openness than I had > > > ever expected or dared to hope for. > > > > And now it's really showing in their favour. An amazing number of > > research groups are writing applications to run on the Alteon cards. While I agree with what's going on right now, the recent purchase of Alteon by Nortel (primarily for their switch line, not for the NICs) leaves quite a bit of doubt in my mind about the future of the card and the openness of the firmware in particular. todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Dolphin PCI-SCI RPM Drivers 1.1-4 released
folx, i must be missing something here. i'm not aware of a PCI bus that only supports 70 MBps but i am probably ignorant. this is why i was confused by jeff's performance numbers. 33MHz 32-bit PCI busses should do around 120MB/s (just do the math 33*32/8 allowing for some overhead of PCI bus negotiation), much greater than the numbers jeff is reporting. 66 MHz 64bit busses should do on the order of 500MB/s. the performance numbers that jeff is reporting are not very impressive even for the slowest PCI bus. we're seeing 993 Mbps (124MB/s) using the alteon acenic gig-e cards on 32-bit cards on a 66MHz bus. i would expect to get somewhat slower on a 33MHz bus but not catastrophically so (certainly nothing as slow as 60MB/s or 480Mb/s). what am i misunderstanding here? todd On Mon, 29 Jan 2001, Jeff V. Merkey wrote: > Date: Mon, 29 Jan 2001 16:49:53 -0700 > From: Jeff V. Merkey <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: [ANNOUNCE] Dolphin PCI-SCI RPM Drivers 1.1-4 released > > > Relative to some performance questions folks have asked, the SCI > adapters are limited by PCI bus speeds. If your system supports > 64-bit PCI you get much higher numbers. If you have a system > that supports 100+ Megabyte/second PCI throughput, the SCI > adapters will exploit it. > > This test was performed in on a 32-bit PCI system with a PCI bus > architecture that's limited to 70 MB/S. > > Jeff > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ > = Todd Underwood, [EMAIL PROTECTED] criticaltv.com news, analysis and criticism. about tv. and other stuff. = - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Dolphin PCI-SCI RPM Drivers 1.1-4 released
folx, On Tue, 30 Jan 2001, Jeff V. Merkey wrote: > What numbers does G-Enet provide > doing userspace -> userspace transfers, and at what processor > overhead? using stock 2.4 kernel and alteon acenic cards with stock firmware we're seeing 993 MBps userspace->userspace (running netperf UDP_STREAM tests, which run as userspace client and server) with 88% CPU utilization. Using a modified version of the firmware that we wrote we're getting 993Mbps with 55% CPU utilization. > I posted the **ACCURATE** numbers from my test, but I did clarify that I > was using a system with a limp PCI bus. > > Jeff i appreciate that. i'm just trying to figure out why the numbers are so low compared to the network speed you mentioned. todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] Dolphin PCI-SCI RPM Drivers 1.1-4 released
jeff, On Tue, 30 Jan 2001, Jeff V. Merkey wrote: > On 32 bit PCI, the average we are seeing going userpace -> userspace is > 120-140 MB/S ranges in those systems that have a PCI bus with > bridge chipsets that can support these data rates. > > That's 2 x G-Enet. good numbers. not really 2 x Gig-e, though, is it. we're getting 993Mbps (124Mb/s) on alteon acenic gig-e adapters on 32bit/66MHz pci machines. i'm not saying that the nubmers aren't good or that the technology doesn't sound promising (SCI is definitely cool stuff). I'm just trying to put the numbers in perspective and show that the targets should be higher. i like your argument about cpu utilization. that's really the key. anything that moves significant traffic and can reduce cpu utilization will help enormously. unfortunately, you've not posted any cpu numbers. this is analogous to transmeta's complaint about battery life and processor speed: no one benchmarks both metrics at the same time. if they did, intel chips would not look nearly as good as they do. similarly, most people are still not posting bandwidth numbers along with cpu utilization numbers. if they did, lots of fast networks would be less appealing. todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Website question
Dear Linux Kernel, We have visited your website spinics.net today , and would like to request a link exchange between our websites. You may visit our website at barretire.com. Our website contains quality automotive content and tools that visitors of your website will find useful. Please reply if you are interested in exchanging links with us, and we will post a link to your website immediately. Here is the information you will need to create a link to our website: Title: Wheels & Tires shipped to your door Link: http://www.barretire.com Desc: Your internet wheel & tire source, Specializing in American Racing wheels, Motegi Wheels, Motto wheels, Riax wheels, Nitto tires, BF Goodrich tires, Yokohama tires & more. We also offer package deals shipped directly to your door. We look forward to partnering together. Best regards, Todd Hebert LA Wheels Direct - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
On Tue, 20 Nov 2012, Joe Jin wrote: > On 11/20/12 16:59, Dave, Tushar N wrote: >> Have you power off the system completely after modifying eeprom? If not >> please do so. > > Hi Tushar, > > Seems not works for me, would you please help to check what is wrong of my > operations? ... > # lspci -s :52:00.1 -vvv > 52:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet > Controller (rev 06) > <--snip--> > Capabilities: [e0] Express (v1) Endpoint, MSI 00 > DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, > L1 <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE- FLReset- > DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ > Unsupported+ > RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ > MaxPayload 128 bytes, MaxReadReq 4096 bytes > ^ > <--snip--> If you look at the previous section, DevCap, you'll see that it's correctly advertising 256 bytes but the system is negotiating 128 for the link to the Ethernet controller. Things on the "other" side of the link are controlled outside of the e1000 driver. Tushar's first suggestion was to check the PCIe payload settings in the entire chain. Have you done that? Mismatches will cause hangs. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Forgive me if I'm being too repetitious as I think some of this has been mentioned in the past. We (and by we I mean the Ethernet part and driver) can only change the advertised availability of a larger MaxPayloadSize. The size is negotiated by both sides of the link when the link is established. The driver should not change the size of the link as it would be poking at registers outside of its scope and is controlled by the upstream bridge (not us). You also need to check all the PCIe links to get to the device. There can be several to get from the root complex, through bridges, to the endpoint Ethernet controller. The Ethernet part and driver has no control over any other links. You'll have to talk to the motherboard manufacturer about those links. Your original problem appears to be hangs and Tushar asked you to the entire path of PCIe connections from the root complex to the endpoint. Any mismatches in payload can cause hangs and I believe you have had the problem in the past. I'm sure you remember all the lspci commands to list the tree view and to dump all the details from each of the links and I would suggest you do that to check to see that the payload sizes match. What I do is "lspci -tvvv" to see what's connected, then "lspci -s xx:xx.x -vvv" to check the devices on the link. Thanks. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Mary Mcgrath [mailto:mary.mcgr...@oracle.com] Sent: Monday, November 26, 2012 6:07 PM To: Joe Jin Cc: net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-kernel@vger.kernel.org Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang Joe Thank you for working this. I would love to find out how they expect a customer to make the modification To "word 0x1A, and see if the 8th bit is 0 or 1, and to change to 0." I have in turn asked the ct for the lspci command on eth3, maybe the incorrect setting is upstream. Again, thank you. Regards Mary -Original Message- From: Joe Jin Sent: Monday, November 26, 2012 8:00 PM To: Fujinaka, Todd Cc: Dave, Tushar N; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-kernel@vger.kernel.org; Mary Mcgrath Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/27/12 00:23, Fujinaka, Todd wrote: > If you look at the previous section, DevCap, you'll see that it's > correctly advertising 256 bytes but the system is negotiating 128 for > the link to the Ethernet controller. Things on the "other" side of the > link are controlled outside of the e1000 driver. > > Tushar's first suggestion was to check the PCIe payload settings in > the entire chain. Have you done that? Mismatches will cause hangs. Hi Todd, So far I had to know how to modify the maxpayload size, since BIOS have not entry to change this, so I had to use ethtool, now I need to get the offset of MaxPayload size in eeprom, I ever tried to find from Intel online document but failed, any idea? Thanks in advance, Joe -- Monitor your physical, virtual and cloud infrastructure from a single web console. Get in-depth insight into apps, servers, databases, vmware, SAP, cloud infrastructure, etc. Download 30-day Free Trial. Pricing starts from $795 for 25 servers or applications! http://p.sf.net/sfu/zoho_dev2dev_nov ___ E1000-devel mailing list e1000-de...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about IntelĀ® Ethernet, visit http://communities.intel.com/community/wired -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Thanks for the clarification. I was just going by the PCIe spec, which says the lowest value of both ends is used, and I figured SOMETHING had to be looking at that and doing some sort of negotiation. I'm no BIOS guy, so I'm not sure what's actually going on, whether something walks the PCIe tree or if the BIOS just sets all the values to the minimum. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Ben Hutchings [mailto:bhutchi...@solarflare.com] Sent: Tuesday, November 27, 2012 10:11 AM To: Fujinaka, Todd; Mary Mcgrath Cc: Joe Jin; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci Subject: RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote: > Forgive me if I'm being too repetitious as I think some of this has > been mentioned in the past. > > We (and by we I mean the Ethernet part and driver) can only change the > advertised availability of a larger MaxPayloadSize. The size is > negotiated by both sides of the link when the link is established. The > driver should not change the size of the link as it would be poking at > registers outside of its scope and is controlled by the upstream > bridge (not us). [...] MaxPayloadSize (MPS) is not negotiated between devices but is programmed by the system firmware (at least for devices present at boot - the kernel may be responsible in case of hotplug). You can use the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to set a policy that overrides this, but no policy will allow setting MPS above the device's MaxPayloadSizeSupported (MPSS). (These parameters are not documented in Documentation/kernel-parameters.txt! Someone ought to fix that.) Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked.
RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
The only EEPROM I know about or can speak to is the one attached to the 82571 and it doesn't set the MaxPayloadSize. That's done by the BIOS. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Joe Jin [mailto:joe@oracle.com] Sent: Wednesday, November 28, 2012 12:31 AM To: Ben Hutchings Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang On 11/28/12 02:10, Ben Hutchings wrote: > On Tue, 2012-11-27 at 17:32 +0000, Fujinaka, Todd wrote: >> Forgive me if I'm being too repetitious as I think some of this has >> been mentioned in the past. >> >> We (and by we I mean the Ethernet part and driver) can only change >> the advertised availability of a larger MaxPayloadSize. The size is >> negotiated by both sides of the link when the link is established. >> The driver should not change the size of the link as it would be >> poking at registers outside of its scope and is controlled by the >> upstream bridge (not us). > [...] > > MaxPayloadSize (MPS) is not negotiated between devices but is > programmed by the system firmware (at least for devices present at > boot - the kernel may be responsible in case of hotplug). You can use > the kernel parameter 'pci=pcie_bus_perf' (or one of several others) to > set a policy that overrides this, but no policy will allow setting MPS > above the device's MaxPayloadSizeSupported (MPSS). > Ben, Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. So I'm trying to use ethtool modify it from eeprom to see if help or no. Todd, I'll review all MaxPayload for all devices, but need to say if it mismatch, customer could not modify it from BIOS for there was not entry at there, to test it, we have to find how to verify if this is the root cause, so still need to find the offset in eeprom. Thanks in advance, Joe
RE: [E1000-devel] 82571EB: Detected Hardware Unit Hang
Someone else pointed this out to me locally. If you have a non-client BIOS, you should be able to set the MaxPayloadSize using setpci. You have to make sure that you're being consistent throughout all the associated links. Todd Fujinaka Technical Marketing Engineer LAN Access Division (LAD) Intel Corporation todd.fujin...@intel.com (503) 712-4565 -Original Message- From: Ethan Zhao [mailto:ethan.ker...@gmail.com] Sent: Wednesday, November 28, 2012 7:10 PM To: Fujinaka, Todd Cc: Joe Jin; Ben Hutchings; Mary Mcgrath; net...@vger.kernel.org; e1000-de...@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang Joe, Possibly your customer is running a kernel without source code on a platform whose vendor wouldn't like to fix BIOS issue( Is that a HP/Dell server ?). Anyway, to see if is a payload issue or, you could change the payload size with setpci tool to those devices and set the link retrain bit to trigger the link retraining to debug the issue and identity the root cause. I thinks it is much easier than modify the BIOS or eeprom of NIC. e.g. set device control register to 0f 00 (128 bytes payload size) # setpci -v -s 00:02.0 98.w=000f set device link control register to 60h (retrain the link) # setpci -v -s 00:02.0 a0.b=60 Hope it works, Just my 2 cents. ethan.z...@oracle.com On Wed, Nov 28, 2012 at 11:53 PM, Fujinaka, Todd wrote: > The only EEPROM I know about or can speak to is the one attached to the 82571 > and it doesn't set the MaxPayloadSize. That's done by the BIOS. > > Todd Fujinaka > Technical Marketing Engineer > LAN Access Division (LAD) > Intel Corporation > todd.fujin...@intel.com > (503) 712-4565 > > > -Original Message- > From: Joe Jin [mailto:joe@oracle.com] > Sent: Wednesday, November 28, 2012 12:31 AM > To: Ben Hutchings > Cc: Fujinaka, Todd; Mary Mcgrath; net...@vger.kernel.org; > e1000-de...@lists.sf.net; linux-kernel@vger.kernel.org; linux-pci > Subject: Re: [E1000-devel] 82571EB: Detected Hardware Unit Hang > > On 11/28/12 02:10, Ben Hutchings wrote: >> On Tue, 2012-11-27 at 17:32 +, Fujinaka, Todd wrote: >>> Forgive me if I'm being too repetitious as I think some of this has >>> been mentioned in the past. >>> >>> We (and by we I mean the Ethernet part and driver) can only change >>> the advertised availability of a larger MaxPayloadSize. The size is >>> negotiated by both sides of the link when the link is established. >>> The driver should not change the size of the link as it would be >>> poking at registers outside of its scope and is controlled by the >>> upstream bridge (not us). >> [...] >> >> MaxPayloadSize (MPS) is not negotiated between devices but is >> programmed by the system firmware (at least for devices present at >> boot - the kernel may be responsible in case of hotplug). You can >> use the kernel parameter 'pci=pcie_bus_perf' (or one of several >> others) to set a policy that overrides this, but no policy will allow >> setting MPS above the device's MaxPayloadSizeSupported (MPSS). >> > > Ben, > > Unfortunately I'm using 3.0.x kernel and this is not included in the kernel. > So I'm trying to use ethtool modify it from eeprom to see if help or no. > > > Todd, I'll review all MaxPayload for all devices, but need to say if it > mismatch, customer could not modify it from BIOS for there was not entry at > there, to test it, we have to find how to verify if this is the root cause, > so still need to find the offset in eeprom. > > Thanks in advance, > Joe > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] alarmtimer: add functions for timerfd support
Add functions needed for hooking up alarmtimer to timerfd: * alarm_restart: Similar to hrtimer_restart, restart an alarmtimer after the expires time has already been updated (as with alarm_forward). * alarm_forward_now: Similar to hrtimer_forward_now, move the expires time forward to an interval from the current time of the associated clock. * alarm_start_relative: Start an alarmtimer with an expires time relative to the current time of the associated clock. * alarm_expires_remaining: Similar to hrtimer_expires_remaining, return the amount of time remaining until alarm expiry. Signed-off-by: Todd Poynor --- include/linux/alarmtimer.h | 4 kernel/time/alarmtimer.c | 39 ++- 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/include/linux/alarmtimer.h b/include/linux/alarmtimer.h index 9069694..a899402 100644 --- a/include/linux/alarmtimer.h +++ b/include/linux/alarmtimer.h @@ -44,10 +44,14 @@ struct alarm { void alarm_init(struct alarm *alarm, enum alarmtimer_type type, enum alarmtimer_restart (*function)(struct alarm *, ktime_t)); int alarm_start(struct alarm *alarm, ktime_t start); +int alarm_start_relative(struct alarm *alarm, ktime_t start); +void alarm_restart(struct alarm *alarm); int alarm_try_to_cancel(struct alarm *alarm); int alarm_cancel(struct alarm *alarm); u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval); +u64 alarm_forward_now(struct alarm *alarm, ktime_t interval); +ktime_t alarm_expires_remaining(const struct alarm *alarm); /* Provide way to access the rtc device being used by alarmtimers */ struct rtc_device *alarmtimer_get_rtcdev(void); diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index f11d83b..3e5cba2 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -199,6 +199,12 @@ static enum hrtimer_restart alarmtimer_fired(struct hrtimer *timer) } +ktime_t alarm_expires_remaining(const struct alarm *alarm) +{ + struct alarm_base *base = &alarm_bases[alarm->type]; + return ktime_sub(alarm->node.expires, base->gettime()); +} + #ifdef CONFIG_RTC_CLASS /** * alarmtimer_suspend - Suspend time callback @@ -305,7 +311,7 @@ void alarm_init(struct alarm *alarm, enum alarmtimer_type type, } /** - * alarm_start - Sets an alarm to fire + * alarm_start - Sets an absolute alarm to fire * @alarm: ptr to alarm to set * @start: time to run the alarm */ @@ -325,6 +331,31 @@ int alarm_start(struct alarm *alarm, ktime_t start) } /** + * alarm_start_relative - Sets a relative alarm to fire + * @alarm: ptr to alarm to set + * @start: time relative to now to run the alarm + */ +int alarm_start_relative(struct alarm *alarm, ktime_t start) +{ + struct alarm_base *base = &alarm_bases[alarm->type]; + + start = ktime_add(start, base->gettime()); + return alarm_start(alarm, start); +} + +void alarm_restart(struct alarm *alarm) +{ + struct alarm_base *base = &alarm_bases[alarm->type]; + unsigned long flags; + + spin_lock_irqsave(&base->lock, flags); + hrtimer_set_expires(&alarm->timer, alarm->node.expires); + hrtimer_restart(&alarm->timer); + alarmtimer_enqueue(base, alarm); + spin_unlock_irqrestore(&base->lock, flags); +} + +/** * alarm_try_to_cancel - Tries to cancel an alarm timer * @alarm: ptr to alarm to be canceled * @@ -394,6 +425,12 @@ u64 alarm_forward(struct alarm *alarm, ktime_t now, ktime_t interval) return overrun; } +u64 alarm_forward_now(struct alarm *alarm, ktime_t interval) +{ + struct alarm_base *base = &alarm_bases[alarm->type]; + + return alarm_forward(alarm, base->gettime(), interval); +} -- 1.8.2.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] timerfd: add alarm timers
Add support for clocks CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM, thereby enabling wakeup alarm timers via file descriptors. Signed-off-by: Todd Poynor --- fs/timerfd.c | 131 --- 1 file changed, 108 insertions(+), 23 deletions(-) diff --git a/fs/timerfd.c b/fs/timerfd.c index 32b644f..9293121 100644 --- a/fs/timerfd.c +++ b/fs/timerfd.c @@ -8,6 +8,7 @@ * */ +#include #include #include #include @@ -26,7 +27,10 @@ #include struct timerfd_ctx { - struct hrtimer tmr; + union { + struct hrtimer tmr; + struct alarm alarm; + } t; ktime_t tintv; ktime_t moffs; wait_queue_head_t wqh; @@ -41,14 +45,19 @@ struct timerfd_ctx { static LIST_HEAD(cancel_list); static DEFINE_SPINLOCK(cancel_lock); +static inline bool isalarm(struct timerfd_ctx *ctx) +{ + return ctx->clockid == CLOCK_REALTIME_ALARM || + ctx->clockid == CLOCK_BOOTTIME_ALARM; +} + /* * This gets called when the timer event triggers. We set the "expired" * flag, but we do not re-arm the timer (in case it's necessary, * tintv.tv64 != 0) until the timer is accessed. */ -static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) +static void timerfd_triggered(struct timerfd_ctx *ctx) { - struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, tmr); unsigned long flags; spin_lock_irqsave(&ctx->wqh.lock, flags); @@ -56,10 +65,25 @@ static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) ctx->ticks++; wake_up_locked(&ctx->wqh); spin_unlock_irqrestore(&ctx->wqh.lock, flags); +} +static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) +{ + struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, + t.tmr); + timerfd_triggered(ctx); return HRTIMER_NORESTART; } +static enum alarmtimer_restart timerfd_alarmproc(struct alarm *alarm, + ktime_t now) +{ + struct timerfd_ctx *ctx = container_of(alarm, struct timerfd_ctx, + t.alarm); + timerfd_triggered(ctx); + return ALARMTIMER_NORESTART; +} + /* * Called when the clock was set to cancel the timers in the cancel * list. This will wake up processes waiting on these timers. The @@ -107,8 +131,9 @@ static bool timerfd_canceled(struct timerfd_ctx *ctx) static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags) { - if (ctx->clockid == CLOCK_REALTIME && (flags & TFD_TIMER_ABSTIME) && - (flags & TFD_TIMER_CANCEL_ON_SET)) { + if ((ctx->clockid == CLOCK_REALTIME || +ctx->clockid == CLOCK_REALTIME_ALARM) && + (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) { if (!ctx->might_cancel) { ctx->might_cancel = true; spin_lock(&cancel_lock); @@ -124,7 +149,11 @@ static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx) { ktime_t remaining; - remaining = hrtimer_expires_remaining(&ctx->tmr); + if (isalarm(ctx)) + remaining = alarm_expires_remaining(&ctx->t.alarm); + else + remaining = hrtimer_expires_remaining(&ctx->t.tmr); + return remaining.tv64 < 0 ? ktime_set(0, 0): remaining; } @@ -142,11 +171,28 @@ static int timerfd_setup(struct timerfd_ctx *ctx, int flags, ctx->expired = 0; ctx->ticks = 0; ctx->tintv = timespec_to_ktime(ktmr->it_interval); - hrtimer_init(&ctx->tmr, clockid, htmode); - hrtimer_set_expires(&ctx->tmr, texp); - ctx->tmr.function = timerfd_tmrproc; + + if (isalarm(ctx)) { + alarm_init(&ctx->t.alarm, + ctx->clockid == CLOCK_REALTIME_ALARM ? + ALARM_REALTIME : ALARM_BOOTTIME, + timerfd_alarmproc); + } else { + hrtimer_init(&ctx->t.tmr, clockid, htmode); + hrtimer_set_expires(&ctx->t.tmr, texp); + ctx->t.tmr.function = timerfd_tmrproc; + } + if (texp.tv64 != 0) { - hrtimer_start(&ctx->tmr, texp, htmode); + if (isalarm(ctx)) { + if (flags & TFD_TIMER_ABSTIME) + alarm_start(&ctx->t.alarm, texp); + else + alarm_start_relative(&ctx->t.alarm, texp); + } else { + hrtimer_start(&ctx->t.tmr, texp, htmode); + } + if (timerfd_canceled(ctx)) return -ECANCELED; } @@ -
[PATCH 0/2] timerfd support for wakeup alarm timers
Enable wakeup alarm timers via file descriptors. Hook up alarmtimer to timerfd via clocks CLOCK_REALTIME_ALARM and CLOCK_BOOTTIME_ALARM, and add needed functions to alarmtimer. fs/timerfd.c | 131 + include/linux/alarmtimer.h |4 + kernel/time/alarmtimer.c | 39 + 3 files changed, 150 insertions(+), 24 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] alarmtimer: implement minimum alarm interval for allowing suspend
alarmtimer suspend return -EBUSY if the next alarm will fire in less than 2 seconds. This allows one RTC seconds tick to occur subsequent to this check before the alarm wakeup time is set, ensuring the wakeup time is still in the future (assuming the RTC does not tick one more second prior to setting the alarm). If suspend is rejected due to an imminent alarm, hold a wakeup source for 2 seconds to process the alarm prior to reattempting suspend. If setting the alarm incurs an -ETIME for an alarm set in the past, or any other problem setting the alarm, abort suspend and hold a wakelock for 1 second while the alarm is allowed to be serviced or other hopefully transient conditions preventing the alarm clear up. Signed-off-by: Todd Poynor --- kernel/time/alarmtimer.c | 18 +- 1 files changed, 13 insertions(+), 5 deletions(-) diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c index aa27d39..f979d85 100644 --- a/kernel/time/alarmtimer.c +++ b/kernel/time/alarmtimer.c @@ -46,6 +46,8 @@ static struct alarm_base { static ktime_t freezer_delta; static DEFINE_SPINLOCK(freezer_delta_lock); +static struct wakeup_source *ws; + #ifdef CONFIG_RTC_CLASS /* rtc timer and device for setting alarm wakeups at suspend */ static struct rtc_timerrtctimer; @@ -250,6 +252,7 @@ static int alarmtimer_suspend(struct device *dev) unsigned long flags; struct rtc_device *rtc; int i; + int ret; spin_lock_irqsave(&freezer_delta_lock, flags); min = freezer_delta; @@ -279,8 +282,10 @@ static int alarmtimer_suspend(struct device *dev) if (min.tv64 == 0) return 0; - /* XXX - Should we enforce a minimum sleep time? */ - WARN_ON(min.tv64 < NSEC_PER_SEC); + if (ktime_to_ns(min) < 2 * NSEC_PER_SEC) { + __pm_wakeup_event(ws, 2 * MSEC_PER_SEC); + return -EBUSY; + } /* Setup an rtc timer to fire that far in the future */ rtc_timer_cancel(rtc, &rtctimer); @@ -288,9 +293,11 @@ static int alarmtimer_suspend(struct device *dev) now = rtc_tm_to_ktime(tm); now = ktime_add(now, min); - rtc_timer_start(rtc, &rtctimer, now, ktime_set(0, 0)); - - return 0; + /* Set alarm, if in the past reject suspend briefly to handle */ + ret = rtc_timer_start(rtc, &rtctimer, now, ktime_set(0, 0)); + if (ret < 0) + __pm_wakeup_event(ws, 1 * MSEC_PER_SEC); + return ret; } #else static int alarmtimer_suspend(struct device *dev) @@ -821,6 +828,7 @@ static int __init alarmtimer_init(void) error = PTR_ERR(pdev); goto out_drv; } + ws = wakeup_source_register("alarmtimer"); return 0; out_drv: -- 1.7.7.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Hang when booting 2.4.0
My apologies in advance if I missed this in the archives (I found three or four threads discussing hangs at boot, but none related to 2.4.0 and the suggestions for older kernels that still applied to code in 2.4.0 didn't change the symptoms). My kernel hangs after the "Ok, booting the kernel." message. This is on a machine with a DX440LX motherboard and only a single 300Mhz PII installed. I can boot 2.2.16 and 2.2.18 kernels successfully. I've tried paring down the config to a minimum (.config included below), but still have the problem. Specifically removing any SCSI since I had heard of problems with the AIC-78xx. Thanks for any ideas or insight, Todd Goodman # # Automatically generated by make menuconfig: don't edit # CONFIG_X86=y CONFIG_ISA=y # CONFIG_SBUS is not set CONFIG_UID16=y # # Code maturity level options # CONFIG_EXPERIMENTAL=y # # Loadable module support # # CONFIG_MODULES is not set # # Processor type and features # # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set CONFIG_M686FXSR=y # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MCRUSOE is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_CMPXCHG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_L1_CACHE_SHIFT=5 CONFIG_X86_TSC=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_PGE=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_FXSR=y CONFIG_X86_XMM=y # CONFIG_TOSHIBA is not set # CONFIG_MICROCODE is not set # CONFIG_X86_MSR is not set # CONFIG_X86_CPUID is not set CONFIG_NOHIGHMEM=y # CONFIG_HIGHMEM4G is not set # CONFIG_HIGHMEM64G is not set CONFIG_MTRR=y # CONFIG_SMP is not set # CONFIG_X86_UP_IOAPIC is not set # # General setup # CONFIG_NET=y # CONFIG_VISWS is not set CONFIG_PCI=y # CONFIG_PCI_GOBIOS is not set # CONFIG_PCI_GODIRECT is not set CONFIG_PCI_GOANY=y CONFIG_PCI_BIOS=y CONFIG_PCI_DIRECT=y CONFIG_PCI_NAMES=y # CONFIG_EISA is not set # CONFIG_MCA is not set # CONFIG_HOTPLUG is not set # CONFIG_PCMCIA is not set CONFIG_SYSVIPC=y # CONFIG_BSD_PROCESS_ACCT is not set CONFIG_SYSCTL=y CONFIG_KCORE_ELF=y # CONFIG_KCORE_AOUT is not set CONFIG_BINFMT_AOUT=y CONFIG_BINFMT_ELF=y CONFIG_BINFMT_MISC=y # CONFIG_PM is not set # CONFIG_ACPI is not set # CONFIG_APM is not set # # Memory Technology Devices (MTD) # # CONFIG_MTD is not set # # Parallel port support # # CONFIG_PARPORT is not set # # Plug and Play configuration # # CONFIG_PNP is not set # CONFIG_ISAPNP is not set # # Block devices # CONFIG_BLK_DEV_FD=y # CONFIG_BLK_DEV_XD is not set # CONFIG_PARIDE is not set # CONFIG_BLK_CPQ_DA is not set # CONFIG_BLK_CPQ_CISS_DA is not set # CONFIG_BLK_DEV_DAC960 is not set # CONFIG_BLK_DEV_LOOP is not set # CONFIG_BLK_DEV_NBD is not set # CONFIG_BLK_DEV_RAM is not set # CONFIG_BLK_DEV_INITRD is not set # # Multi-device support (RAID and LVM) # # CONFIG_MD is not set # CONFIG_BLK_DEV_MD is not set # CONFIG_MD_LINEAR is not set # CONFIG_MD_RAID0 is not set # CONFIG_MD_RAID1 is not set # CONFIG_MD_RAID5 is not set # CONFIG_BLK_DEV_LVM is not set # CONFIG_LVM_PROC_FS is not set # # Networking options # CONFIG_PACKET=y # CONFIG_PACKET_MMAP is not set # CONFIG_NETLINK is not set # CONFIG_NETFILTER is not set # CONFIG_FILTER is not set CONFIG_UNIX=y CONFIG_INET=y CONFIG_IP_MULTICAST=y # CONFIG_IP_ADVANCED_ROUTER is not set # CONFIG_IP_PNP is not set # CONFIG_NET_IPIP is not set # CONFIG_NET_IPGRE is not set # CONFIG_IP_MROUTE is not set # CONFIG_INET_ECN is not set # CONFIG_SYN_COOKIES is not set # CONFIG_IPV6 is not set # CONFIG_KHTTPD is not set # CONFIG_ATM is not set # CONFIG_IPX is not set # CONFIG_ATALK is not set # CONFIG_DECNET is not set # CONFIG_BRIDGE is not set # CONFIG_X25 is not set # CONFIG_LAPB is not set # CONFIG_LLC is not set # CONFIG_NET_DIVERT is not set # CONFIG_ECONET is not set # CONFIG_WAN_ROUTER is not set # CONFIG_NET_FASTROUTE is not set # CONFIG_NET_HW_FLOWCONTROL is not set # # QoS and/or fair queueing # # CONFIG_NET_SCHED is not set # # Telephony Support # # CONFIG_PHONE is not set # CONFIG_PHONE_IXJ is not set # # ATA/IDE/MFM/RLL support # CONFIG_IDE=y # # IDE, ATA and ATAPI Block devices # CONFIG_BLK_DEV_IDE=y # CONFIG_BLK_DEV_HD_IDE is not set # CONFIG_BLK_DEV_HD is not set CONFIG_BLK_DEV_IDEDISK=y # CONFIG_IDEDISK_MULTI_MODE is not set # CONFIG_BLK_DEV_IDEDISK_VENDOR is not set # CONFIG_BLK_DEV_IDEDISK_FUJITSU is not set # CONFIG_BLK_DEV_IDEDISK_IBM is not set # CONFIG_BLK_DEV_IDEDISK_MAXTOR is not set # CONFIG_BLK_DEV_IDEDISK_QUANTUM is not set # CONFIG_BLK_DEV_IDEDISK_SEAGATE is not set # CONFIG_BLK_DEV_IDEDISK_WD is not set # CONFIG_BLK_DEV_COMMERIAL is not set # CONFIG_BLK_DEV_TIVO is not set # CONFIG_BLK_DEV_IDECS is not set CONFIG_BLK_DEV_IDECD=y # CONFIG_BLK_DEV_IDETAPE is not set # CONFIG_BLK_DEV_IDEFLOPPY is not set # CONFIG_B
RE: Hang when booting 2.4.0
Sorry, I was experiencing a senior moment and had the wrong CPU type configured. Once corrected everything is right in the world again. Thanks to Mark Hahn for pointing out my mistake. Todd > -Original Message- > My apologies in advance if I missed this in the archives (I found three > or four threads discussing hangs at boot, but none related to 2.4.0 and > the suggestions for older kernels that still applied to code in 2.4.0 > didn't change the symptoms). > > My kernel hangs after the "Ok, booting the kernel." message. > > This is on a machine with a DX440LX motherboard and only a single > 300Mhz PII installed. > > I can boot 2.2.16 and 2.2.18 kernels successfully. > > I've tried paring down the config to a minimum (.config included below), > but still have the problem. Specifically removing any SCSI since I > had heard of problems with the AIC-78xx. > > Thanks for any ideas or insight, > > Todd Goodman [SNIP of .config] > Please read the FAQ at http://www.tux.org/lkml/> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
PowerOP 0/3: System power operating point management API
PowerOP is a system power parameter management API submitted for discussion. PowerOP writes and reads power "operating points", comprised of arbitrary integer-valued values, called power parameters, that correspond to registers, clocks, dividers, voltage regulators, etc. that may be modified to set a basic power/performance point for the system. The core basically passes an array of integer-valued power parameters (with very little additional structure imposed by the core) to a platform-specific backend that interprets those values and makes the requested adjustments. PowerOP is intended to leave all power policy decisions to higher layers. An optional sysfs representation of power parameters is also available, primarily for diagnostic use. PowerOP can be thought of as a layer below cpufreq that actually accesses the hardware to make cpu frequency, voltage, core bus, and perhaps other modifications to set a power point, leaving cpufreq to manage the interfaces based around the "cpu frequency" abstraction, the policies and governors that select the frequency, its notifiers, and so forth. An example hooking up support for one cpufreq platform to PowerOP is in patch 3/3. Depending on the ability of the hardware to make software-controlled power/performance adjustments, this may be useful to select custom voltages, bus speeds, etc. in desktop/server systems. Various embedded systems have several parameters that can be set. For example, an XScale PXA27x could be considered to have six basic power parameters (mainly cpu run mode and memory and bus dividers) that for the most part should be set in tandem to known good sets of values as validated by the silicon vendor, plus other parameters possible for disabling PLLs during low-speed execution, and so forth. PowerOP is aimed at supporting this kind of system, where the cpu frequency abstraction specifies only part of the operating point that may be managed from software. It also pushes the hardware-level power parameter management down to a level that can be shared with other power management policy frameworks, in use in some embedded systems, that wish to deal with entire operating points as the basic unit of system power management.. There are many ways to tackle those issues, of course, and a new API layer is arguably rather heavyweight. This is one suggested way that tries to minimize disturbing existing power management code. Comments very much appreciated. Patch 2/3 is a desktop-oriented example of PowerOP; embedded examples will follow soon. -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PowerOP 1/3: PowerOP core
The PowerOP core provides the struct powerop_point that defines an operating point, and trivial routines for calling the platform-specific backend to read and set operating points and for registering a single machine-dependent backend. Operating points are an array of 32-bit integer-valued power parameters, almost entirely interpreted by the platform-specific backend. Higher-layer software shares information on the power parameters made available by the backend (that is, the interpretation of the integers in the array) via header files. The value -1 is commonly used to denote an unspecified value, but in situations where all ones is a valid power parameter value some extra smarts may be needed. It optionally adds sysfs attributes for reading and writing individual power parameter values, mainly for diagnostic purposes. For example, using the Intel Centrino patch that follows, parameters for frequency and core voltage for CPU #0 appear: # ls /sys/powerop/param/ cpu0 v0 Index: linux-2.6.12/drivers/Makefile === --- linux-2.6.12.orig/drivers/Makefile 2005-06-30 01:32:17.0 + +++ linux-2.6.12/drivers/Makefile 2005-07-29 00:39:44.0 + @@ -58,6 +58,7 @@ obj-$(CONFIG_ISDN) += isdn/ obj-$(CONFIG_MCA) += mca/ obj-$(CONFIG_EISA) += eisa/ +obj-$(CONFIG_POWEROP) += powerop/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ obj-$(CONFIG_INFINIBAND) += infiniband/ Index: linux-2.6.12/drivers/powerop/Kconfig === --- linux-2.6.12.orig/drivers/powerop/Kconfig 1970-01-01 00:00:00.0 + +++ linux-2.6.12/drivers/powerop/Kconfig2005-07-28 07:27:26.0 + @@ -0,0 +1,17 @@ +# +# powerop +# + +menu "Platform Core Power Management" + +config POWEROP + bool "PowerOP Platform Core Power Management" + help + +config POWEROP_SYSFS + bool " Enable PowerOP sysfs interface" + depends on POWEROP && SYSFS + help + +endmenu + Index: linux-2.6.12/drivers/powerop/Makefile === --- linux-2.6.12.orig/drivers/powerop/Makefile 1970-01-01 00:00:00.0 + +++ linux-2.6.12/drivers/powerop/Makefile 2005-07-28 07:29:04.0 + @@ -0,0 +1,2 @@ +obj-$(CONFIG_POWEROP) += powerop.o + Index: linux-2.6.12/drivers/powerop/powerop.c === --- linux-2.6.12.orig/drivers/powerop/powerop.c 1970-01-01 00:00:00.0 + +++ linux-2.6.12/drivers/powerop/powerop.c 2005-08-04 19:50:38.0 + @@ -0,0 +1,253 @@ +/* + * PowerOP generic core functions + * + * Author: Todd Poynor <[EMAIL PROTECTED]> + * + * 2005 (c) MontaVista Software, Inc. This file is licensed under + * the terms of the GNU General Public License version 2. This program + * is licensed "as is" without any warranty of any kind, whether express + * or implied. + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +static struct powerop_driver *powerop_driver; +static int powerop_subsys_init; + +int +powerop_set_point(struct powerop_point *point) +{ + return powerop_driver && powerop_driver->set_point ? + powerop_driver->set_point(point) : -EINVAL; +} + + +int +powerop_get_point(struct powerop_point *point) +{ + return powerop_driver && powerop_driver->get_point ? + powerop_driver->get_point(point) : -EINVAL; +} + + +EXPORT_SYMBOL_GPL(powerop_set_point); +EXPORT_SYMBOL_GPL(powerop_get_point); + +#ifdef CONFIG_POWEROP_SYSFS +decl_subsys(powerop, NULL, NULL); + +static void powerop_kobj_release(struct kobject *kobj) +{ + return; +} + +struct powerop_param_attribute { + int index; +struct attributeattr; +}; + +#define to_param_attr(_attr) container_of(_attr,\ + struct powerop_param_attribute,attr) + +static ssize_t +powerop_param_attr_show(struct kobject * kobj, struct attribute * attr, + char * buf) +{ + struct powerop_param_attribute * param_attr = to_param_attr(attr); + struct powerop_point point; + ssize_t ret = 0; + + if ((ret = powerop_get_point(&point)) == 0) + ret = sprintf(buf, "%d\n", point.param[param_attr->index]); + return ret; +} + +static ssize_t +powerop_param_attr_store(struct kobject * kobj, struct attribute * attr, +const char * buf, size_t count) +{ + struct powerop_param_attribute * param_attr = to_param_attr(attr); + struct powerop_point point; + int i; + ssize_t ret = 0; + + for (i = 0; i < powerop_driver->nr_params; i++) + if (i == pa
PowerOP 2/3: Intel Centrino support
Minimal PowerOP support for Intel Enhanced Speedstep "Centrino" notebooks. These systems run at an operating point comprised of a cpu frequency and a core voltage. The voltage could be set from the values recommended in the datasheets if left unspecified (-1) in the operating point, as cpufreq does. Index: linux-2.6.12/arch/i386/Kconfig === --- linux-2.6.12.orig/arch/i386/Kconfig 2005-08-02 23:39:05.0 + +++ linux-2.6.12/arch/i386/Kconfig 2005-08-03 01:11:21.0 + @@ -1098,6 +1098,7 @@ endmenu source "arch/i386/kernel/cpu/cpufreq/Kconfig" +source "arch/i386/kernel/cpu/powerop/Kconfig" endmenu Index: linux-2.6.12/arch/i386/kernel/cpu/Makefile === --- linux-2.6.12.orig/arch/i386/kernel/cpu/Makefile 2005-08-02 23:39:05.0 + +++ linux-2.6.12/arch/i386/kernel/cpu/Makefile 2005-08-03 01:11:21.0 + @@ -17,3 +17,4 @@ obj-$(CONFIG_MTRR) += mtrr/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ +obj-$(CONFIG_POWEROP) += powerop/ Index: linux-2.6.12/arch/i386/kernel/cpu/powerop/Kconfig === --- linux-2.6.12.orig/arch/i386/kernel/cpu/powerop/Kconfig 1970-01-01 00:00:00.0 + +++ linux-2.6.12/arch/i386/kernel/cpu/powerop/Kconfig 2005-08-03 01:11:21.0 + @@ -0,0 +1,6 @@ +source "drivers/powerop/Kconfig" + +config POWEROP_CENTRINO + tristate "PowerOP for Intel Centrino Enhanced Speedstep" + depends on POWEROP + default n Index: linux-2.6.12/arch/i386/kernel/cpu/powerop/Makefile === --- linux-2.6.12.orig/arch/i386/kernel/cpu/powerop/Makefile 1970-01-01 00:00:00.0 + +++ linux-2.6.12/arch/i386/kernel/cpu/powerop/Makefile 2005-08-03 01:11:21.0 + @@ -0,0 +1,2 @@ +obj-$(CONFIG_POWEROP_CENTRINO) += powerop-centrino.o + Index: linux-2.6.12/arch/i386/kernel/cpu/powerop/powerop-centrino.c === --- linux-2.6.12.orig/arch/i386/kernel/cpu/powerop/powerop-centrino.c 1970-01-01 00:00:00.0 + +++ linux-2.6.12/arch/i386/kernel/cpu/powerop/powerop-centrino.c 2005-08-03 06:41:37.0 + @@ -0,0 +1,160 @@ +/* + * PowerOP support for Intel Enhanced Speedstep "Centrino" platforms. + * + * Based heavily on speedstep-centrino.c of cpufreq Copyright (c) 2003 + * by Jeremy Fitzhardinge <[EMAIL PROTECTED]> + * + * 2005 (c) MontaVista Software, Inc. This file is licensed under + * the terms of the GNU General Public License version 2. This program + * is licensed "as is" without any warranty of any kind, whether express + * or implied. + */ + +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +static int +powerop_centrino_get_point(struct powerop_point *point) +{ + unsigned l, h; + unsigned cpu_freq; + + rdmsr(MSR_IA32_PERF_STATUS, l, h); + if (unlikely((cpu_freq = ((l >> 8) & 0xff) * 100) == 0)) { + /* +* On some CPUs, we can see transient MSR values (which are +* not present in _PSS), while CPU is doing some automatic +* P-state transition (like TM2). Get the last freq set +* in PERF_CTL. +*/ + rdmsr(MSR_IA32_PERF_CTL, l, h); + cpu_freq = ((l >> 8) & 0xff) * 100; + } + + point->param[POWEROP_CPU + smp_processor_id()] = cpu_freq; + point->param[POWEROP_V + smp_processor_id()] = ((l & 0xff) * 16) + 700; + return 0; +} + +static int +powerop_centrino_set_point(struct powerop_point *point) +{ + unsigned int msr = 0, oldmsr, h, mask = 0; + + if (point->param[POWEROP_CPU + smp_processor_id()] != -1) { + msr |= (point->param[POWEROP_CPU + smp_processor_id()]/100) + << 8; + mask |= 0xff00; + } + + if (point->param[POWEROP_V + smp_processor_id()] != -1) { + msr |= ((point->param[POWEROP_V + smp_processor_id()] - 700) + / 16); + mask |= 0xff; + } + + rdmsr(MSR_IA32_PERF_CTL, oldmsr, h); + + if (msr == (oldmsr & mask)) + return 0; + + /* all but 16 LSB are "reserved", so treat them with + care */ + oldmsr &= ~mask; + msr &= mask; + oldmsr |= msr; + + wrmsr(MSR_IA32_PERF_CTL, oldmsr, h); + return 0; +} + +#ifdef CONFIG_POWEROP_SYSFS +static char *powerop_param_names[POWEROP_DRIVER_MAX_PARAMS]; + +static int __init powerop_centrino_sysfs_init(void) +{ + int i; + + for (i = 0; i < NR_CPUS; i++) { + if (! (powerop_param_names[POWEROP_CPU + i] = + kmalloc(5 + i / 10, GFP_KERNEL)) || +
PowerOP 3/3: Intel Centrino cpufreq integration
A minimal example of modifying cpufreq to use PowerOP for reading and writing power parameters on Intel Centrino platforms. It would be possible to move voltage table lookups to the PowerOP layer. Index: linux-2.6.12/arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c === --- linux-2.6.12.orig/arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c 2005-08-04 19:49:29.0 + +++ linux-2.6.12/arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c 2005-08-05 01:21:45.0 + @@ -24,6 +24,7 @@ #include #include #include +#include #ifdef CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI #include @@ -322,29 +323,21 @@ /* Return the current CPU frequency in kHz */ static unsigned int get_cur_freq(unsigned int cpu) { - unsigned l, h; - unsigned clock_freq; + struct powerop_point point; + unsigned clock_freq = 0; cpumask_t saved_mask; saved_mask = current->cpus_allowed; set_cpus_allowed(current, cpumask_of_cpu(cpu)); if (smp_processor_id() != cpu) return 0; - rdmsr(MSR_IA32_PERF_STATUS, l, h); - clock_freq = extract_clock(l, cpu, 0); + if (powerop_get_point(&point)) + goto out; - if (unlikely(clock_freq == 0)) { - /* -* On some CPUs, we can see transient MSR values (which are -* not present in _PSS), while CPU is doing some automatic -* P-state transition (like TM2). Get the last freq set -* in PERF_CTL. -*/ - rdmsr(MSR_IA32_PERF_CTL, l, h); - clock_freq = extract_clock(l, cpu, 1); - } + clock_freq = point.param[POWEROP_CPU + cpu] * 1000; +out: set_cpus_allowed(current, saved_mask); return clock_freq; } @@ -610,6 +603,7 @@ unsigned intmsr, oldmsr, h, cpu = policy->cpu; struct cpufreq_freqsfreqs; cpumask_t saved_mask; + struct powerop_pointpoint; int retval; if (centrino_model[cpu] == NULL) @@ -650,14 +644,9 @@ cpufreq_notify_transition(&freqs, CPUFREQ_PRECHANGE); - /* all but 16 LSB are "reserved", so treat them with - care */ - oldmsr &= ~0x; - msr &= 0x; - oldmsr |= msr; - - wrmsr(MSR_IA32_PERF_CTL, oldmsr, h); - + point.param[POWEROP_CPU + cpu] = ((msr >> 8) & 0xff) * 100; + point.param[POWEROP_V + cpu] = ((msr & 0xff) * 16) + 700; + powerop_set_point(&point); cpufreq_notify_transition(&freqs, CPUFREQ_POSTCHANGE); retval = 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] PowerOP 1/3: PowerOP core
Geoff Levand wrote: I'm wondering if anything could be gained by having the whole struct powerop_point defined in asm/powerop.h, and treat it as an opaque structure at this level. That way, things other than just ints could be passed between the policy manager and the backend, although I guess that breaks the beauty of the simplicity and would complicate the sys-fs interface, etc. I'm interested to hear your comments. Making the "operating point" data structure entirely platform-specific should be OK. There's a little value to having generic pieces handle some common chores (such as the sysfs interfaces), but even for integers decimal vs. hex formatting is nicer depending on the type of value. Since most values that have been managed using similar interfaces thus far have been flags, register values, voltages, etc. using integers has worked well and nicely simplified the platform backend, but if there's a need for other data types then should be doable. Another point is that a policy manager would need to poll the system and/or get events and then act. Your powerop work here only provides a (one way) piece of the final action. Any comments regarding a more general interface? What's discussed here is probably the bottommost layer of a power management software stack: to read and write the platform-specific system power parameters, optionally arranged into a mutually-consistent set called an "operating point". Power policy management is a large, thorny topic that I wasn't trying to tackle now. So far as kernel-to-userspace event notification goes (assuming the power policy manager is in userspace, which is certainly where I'd recommend), ACPI has a procfs-based communication channel but the kobject_uevent stuff looks like the way I'd go, and it's somewhere on my list to come up with a patch that does that as well. If these general ideas of arbitrary platform power parameters and operating points are deemed worthy of continued consideration, I'll propose what I view is the next step: interfaces to create and activate operating points from userspace. At that point it should be possible to write power policy management applications for systems that can benefit from this generalized notion of operating points: create the operating points that match the system usage models (in the case of many embedded systems, the system is some mode with different power/performance characteristics such as audio playback vs. mobile phone call in progress) and power needs (e.g., low battery strength vs. high strength) and activate operating points based on events received (new app running, low battery warning, etc.). Any opinions on all that? Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] PowerOP 0/3: System power operating point management API
Patrick Mochel wrote: On Mon, 8 Aug 2005, Todd Poynor wrote: (apologies for use of obsolete cpufreq mailing list address in my initial message.) ... PowerOP is intended to leave all power policy decisions to higher layers. What do those higher layers look like? Do you have a userspace component that uses this interface? cpufreq is one example, it manages an abstraction of system power/performance levels based on cpu speed, which maps onto the PowerOP-level hardware capabilities in some fashion, and has both kernel and userspace components to manage the desired policy associated with this. Regardless of whether this notion of configurable operating points would remain a separate layer from cpufreq or was more tightly integrated, the code to set these operating points can handle things such as setting validated voltage levels to match cpu speeds, etc. For embedded systems, I am aware only of the Dynamic Power Management project, which you also mention and does indeed manage power policy based on the notions of power parameters and operating points. The settings of these are configured entirely from userspace via sysfs, using shell scripts or convenience libraries that access the sysfs attributes. A system designer chooses the operating points to be employed in the system based on the information from the processor or board vendor that describes validated, supported operating points and based on the characteristics of the system (how fast it needs to run while in use for different purposes and how much battery power can be spent for those purposes). For example, a designer implementing a system based on an Intel XScale PXA27x processor can choose from among about 16 validated operating points listed in the most recent specification update. Those operating points are comprised of register settings with inscrutable names such as CCCR[L], CCCR[2N], CLKCFG[T], CCCR[A], and two or three others. A few of those operating points run the CPU at identical frequencies, but have other changes in memory clocking, system bus clocking, and the ability to quickly switch between certain cpu frequencies based on other properties of the platform (so-called "Turbo-mode" frequency scaling). A DPM- or PowerOP-based system can be configured with the subset of desired operating points and a particular operating point activated as needed. The policy decision as to what operating point is appropriate to activate is a matter for custom code provided by the designer, tailored to their system. It is also possible to write automated operating point selection algorithms based on such criteria as system busyness. Who is using this code? Are there vendors that are already shipping systems with this enabled? Is this part of the DPM project? If so, what other components are left in DPM? The concepts and general Linux implementation of power parameters and operating points stems from the power-aware computing work done by Bishop Brock and Karthick Rajamani of IBM Research, and a somewhat different implementation is a part of the DPM project, which MontaVista (and reportedly others in the near future) does ship. So far as I understand there are or soon will be mobile phones that use that code as the low- to mid-layers of the power management stack (the high-layer policy management is performed by a custom application of which I have no knowledge). I mentioned in a previous email the next step of creating and activating operating points from userspace. If that were in place, DPM would additionally consist primarily of: 1. Machine-specific backends to set operating points for the systems that DPM has been ported to. If something like PowerOP is accepted into a broader community then that code would come along for the ride. XScale PXA27x and various ARM OMAPs are among the systems supported, as well as potentially others not yet making an appearance in open source. 2. DPM has further concepts of "operating state" (generally, whether the system is idle, processing interrupts, running a normal-power-usage task, running a background task without deadlines that can be assigned a low power/performance level, etc.) and the unfortunately-named "policy" that maps each operating state to an operating point, along with the code to switch in different operating points as the system switches operating states. The "policy" is a bit of a misnomer; a system designer must create the desired operating points and decide upon the state -> point mappings appropriate, as well as make decisions on when to update the mappings based on external events, changing workloads, etc. There are a few extra ramifications of modifying operating points in this fashion, including the need to handle such transitions while in interrupt context or in the idle loop, as well as a general concern for low overhead since switching may occur very freque
Re: PowerOP 0/3: System power operating point management API
Pavel Machek wrote: Depending on the ability of the hardware to make software-controlled power/performance adjustments, this may be useful to select custom voltages, bus speeds, etc. in desktop/server systems. Various embedded systems have several parameters that can be set. For example, an XScale PXA27x could be considered to have six basic power parameters (mainly cpu run mode and memory and bus dividers) that for the most part should This scares me a bit. Is table enough to handle this? I'm afraid that table will get very large on systems that allow you to do "almost anything". Exhaustive tables for all combinations of possible parameters aren't expected (or practical for many systems as you note). In practice, a subset of these possible operating points are created and activated over the lifetime of the system, where the subset is chosen by a system designer according to the needs of the particular system. It's a matter for the higher-layer power management software to decide whether to have in-kernel tables of the possible operating points (as cpufreq does for various platforms) or whether to require userspace to create only the ones wanted (as does DPM). There are cpufreq patches for PXA27x somewhere, for example, and in that case a subset of the supported operating points (and there are still only about 16 of those even for such a complicated piece of hardware) are represented in the kernel tables, choosing one of the possible combinations of memory/bus/etc. parameters for each unique cpu frequency. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 2/3: Intel Centrino support
Bruno Ducrot wrote: > ATM I'm wondering what are the pro for those patches wrt current cpufreq infrastructure (especially cpufreq's notion of notifiers). I still don't find a good one but I'm surely missing something obvious. This is lower layer than cpufreq, intended for use by cpufreq (and potentially DPM and other frameworks, although moving embedded to use cpufreq is certainly an option) to do the actual hardware modifications at the lowest layer. There are no notifiers et al because the higher-layer frameworks handle that portion of it. More on that in a moment. The main goal is to open up interfaces to operating points comprised of arbitrary power parameters, as an alternative to the "cpu frequency" parameter that cpufreq is designed around. In the desktop/server world this may not be a high concern for two reasons: (1) although various folks do want to manage voltages separately (namely so-called "undervolting"), in most cases the safe answer is to stick with a voltage preprogrammed into cpufreq that matches what the silicon vendor as validated and supports; (2) various additional clocks, dividers, etc. potential power parameters might exist in the hardware, but interfaces for software control are often not readily available, and the associated parameters are typically set once by the BIOS at boot time (possibly with a menu to allow you override the defaults). In the embedded world, however, silicon vendors typically produce hardware that are run by, and specify validated operating points that are comprised of, 6-7 basic parameters (clocks, voltages, dividers, etc.), and in many cases it is useful to manage all these parameters for additional battery savings at runtime (and these systems are often powered by much smaller batteries than the notebooks powered by desktop-class processors, and so more aggressive reductions in power consumption may be pursued). Similar interfaces are already in use for managing such hardware, with benefits that largely depend on the capabilities of the hardware (in the case of the IBM PowerPC 405LP for which it was originally developed this was pretty extensively studied). It is possible, of course, to choose operating points according to cpu frequency and hardcode the settings of other power parameters to match, restricting the choices for the other parameters. There are, however, power consumption and performance (primarily latency of transition between operating points) advantages to being able to choose memory bus speeds, core PLL speeds, voltages (yes, some embedded platforms do allow this to be chosen within limits imposed by other clock speeds), etc. The embedded hardware designers could probably express this more eloquently than I can. A secondary goal is to separate the code to perform platform-specific hardware modifications from the upper layers which by necessity carry various assumptions that may not be appropriate for all situations. The notifiers are one example of this: DPM typically minimizes use of notifiers and requires notifiers to be able to operate in interrupt and in idle loop context, due to those additional situations in which DPM manages power parameters. -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] Re: PowerOP 2/3: Intel Centrino support
Dave Jones wrote: I'm glad I'm not the only one who feels he's too dumb to see the advantages of this. The added complexity to expose something that in all cases, we actually don't want to expose seems a little pointless to me. For example, most of the x86 drivers, if you set a speed, and then start fiddling with the voltage, you can pretty much guarantee you'll crash within the next few seconds. They have to match, or at the least, be within a very small margin. I've attempted to extoll the benefits of adding these interfaces in previous emails, and if after that it still seems mystifying why anybody would want to do this then I'll take the heat for doing a lousy job of extolling. I've also admitted that it is primarily of use in embedded-specific hardware, and of less use in x86 and in desktop/server usage for various reasons (unless it turns out there's some fantastic savings to be had in modifying common x86 bus speeds independently of cpu speed, which seems unlikely). In the case of x86, undervolting is in practice by some folks, but yes it is risky and support for it in the basic interfaces probably shouldn't be a high priority. Given how long its taken us to get sane userspace parts for cpufreq, I'm loathe to changing the interfaces yet again unless there's a clear advantage to doing so, as it'll take at least another 12 months for userspace to catch up. Just to be clear, there are no cpufreq userspace interface changes required by this, it simply occupies the bottom layer of code that modifies platform power registers etc. The same speed/policy/governor etc. interfaces are used to specify the cpu speed. Interfaces to the power parameters can optionally be used for diagnostic purposes, and in the near future I'll propose alternative interfaces for setting entire operating points, but the existing cpufreq interfaces will work just fine regardless. -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] PowerOP 1/3: PowerOP core
Daniel Petrini wrote: I'd like to have an idea of how the powerop would evolve to address: a) exporting all operating points to sysfs - that would be useful for a policy manager in user space, or the user policy will already be aware of the ops? For different usage models I'd expect to see both situations. In one situation (classic desktop/server), a canned set of operating points may bee preconfigured (for example, the cpufreq support for Centrino), and automated software or an interactive user knows how to select an appropriate point (like the existing cpufreq algorithms such as "choose the lowest (powersave) point"). In the other situation (classic embedded), a system designer has chosen a number of useful operating points and configures software to choose an appropriate point based on system state ("the MPEG4 video app is running"), and that software knows what points are available and in what situations the points should apply. b) Constraints: if I would like to change to a op and such a transition is not allowed in hardware, what part of the software will test it? The set/get powerop functions, the higher layers or even some lower layer (don't know if expected) ? Any sort of policy on what operating points should be allowed is targeted for a higher layer than PowerOP. As you mention, there are situations where device needs constrain the operating points that can be set (for example, a PXA27x has modes that disable PLLs and/or run clocks at low speeds such that certain devices, if powered on, will wedge). Intelligence on what to do about that situation, if anything, can be placed in the high-layer power policy management application (this is probably the best answer), and/or the mid-layer power management framework code can also attempt to enforce these rules. Stepping up on my soapbox for a moment, it is always recommended to have the power policy management application be aware of what the constraints are on operating points and set an appropriate power policy based on that information. This may entail sending event messages from the drivers to the power policy manager app, to coordinate changes in device state with changes in policy. If the constraints change very rapidly then it may make sense to preconfigure the rules on which operating point to choose in the kernel to avoid userspace interactions (and this is what the DPM device constraints feature is intended to do). Relying on the power management mid-layer to block attempts to set an incompatible operating point is not recommended, but can function reasonably well if the mid layer is smart enough to know what the next best choice is and set that operating point instead. -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 2/3: Intel Centrino support
Jordan Crouse wrote: When it comes to embedded power management concepts, a consistant theme is that people often question the usefulness, redundancy or complexity of a solution. This is perfectly understandable, since such a huge majority of the power management experts and users are concentrating intently on x86 desktops and servers. It also occurs to me that another reason for the disconnect between x86 desktop/server and embedded on this point is the lack of ACPI. We want to do things analogous to the power management performed by ACPI, but entirely in Linux, so we need to expose some of those low-level machine resources to our power management software. In many cases those power management decisions do not revolve around the question of the MHz at which the CPU is to run. Embedded Linux system power management exists for many of the same reasons ACPI exists. -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] PowerOP 1/3: PowerOP core
[EMAIL PROTECTED] wrote: How well would _this_ notion of an operating point scale up? I have this feeling that it's maybe better attuned to "scale down" sorts of problems (maybe cell phones) than to a big NUMA box. I can see how a batch scheduled server might want to fire up only enough components to run the next simulation, but I wonder if maybe systems dynamically managing lots of resources might not be better off with some other model ... where userspace makes higher level decisions, and the kernel is adaptive within a potentially big solution space. (Likewise, maybe some of the smaller systems would too.) If I understand correctly, that does seem to describe how such systems are used today: out of a potentially large number of choices (and some embedded SOCs do have a good variety of clocking, core voltage, and power domain choices), configure only those useful for the problem at hand into the kernel and then userspace gives marching orders to the kernel to activate whatever's appropriate according to system-specific logic in userspace. - Why have any parsing at all? It's opaque data; so insist that the kernel just get raw bytes. That's the traditional solution, not having the kernel parse arrays of integers. - Why try to standardize a data-based abstraction at all? Surely it'd be easier to use modprobe, and have it register operating points that include not just the data, but its interpretation. Configuring the definitions of desired operating points (and updating these on-the-fly) from userspace has its advantages, but inserting into the kernel as a module should work fine and avoids some awkward userspace interfaces. In the current code it is intended that both in-kernel interfaces (without parsing) and userspace interfaces are available, but I think non-diagnostic userspace interfaces could easily be dropped. One of the nice things about having it done from userspace is that apps can be in charge of configuring operating points and activating those operating points in response to changes in system state (with reduced chance of mismatched app + module). And since it can be done in userspace it's nice to simplify the kernel that way, which some folks feel strongly about. But not a big deal to me. Standard data structures have small benefits for implementing some code once instead of per-platform. Another possibility in addition to the configuration or diagnostic interfaces is the concept of "constraints" on operating points according to device needs that you and others have alluded to: the constraints can be expressed in a form that can be checked by platform-independent code ("check the value at this power parameter index in the array for this range of integer values"). Also not a big deal. All in all, it sounds like the next version of this should not have a userspace configuration interface and should not provide a platform-independent structure for the operating points. The operating point get and set functions are in machine-specific code and take an operating point argument defined in header files shared between the machine-specific code and whatever kernel code is creating and activating operating points. Any diagnostic interfaces need to be machine-dependent as well. Now that the operating points are created without a generic layer or structure, if one wants a generic interface to set (activate) one of those operating points, probably identified by name, from userspace then a little extra is needed. Can worry about that later. - If those numbers are needed, having single-valued sysfs attributes (maybe /sys/power/runstate/policy_name/XXX) would be preferable to relying on getting position right within a multivalued input. In case it helps, the code thus far and proposed interfaces use single-valued sysfs interfaces. Since you also mentioned: It's easier for me to see how "echo policy_name > /sys/power/runtime" would scale up and down (given pluggable policy_name components!) than "echo 0 35 98 141 66 -3 0x7efc0 > /sys/power/foo" would. and: That'd also be less error prone than "whoops, there wasn't supposed to be a space between 35 and 98" or "darn, I switched the 141 and 66 around again". It may be worth pointing out that these interfaces wouldn't do that (a two level hierarchy of operating point name and single power parameter attribute would be used, and the ordering into the array is handled by the generic PowerOP core, which knows how to associate parameter names with array indices). Older versions of DPM did use interfaces similar to what you describe, in case you've got that in mind. They weren't intended to be used interactively. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel&q
Re: PowerOP 0/3: System power operating point management API
Dominik Brodowski wrote: First, the table interface you suggest is ugly. If there's indeed the need for such an abstraction, I'd favour something like I'm planning to adopt the previous suggestions of an opaque data structure and stop trying to have any generic structure to it. I'll try to leave dependency checking etc. to the upper layers as much as possible, since platforms vary greatly in this and so do the needs of different PM s/w stacks. Secondly, you do not adress the cross-relationships between operation points correctly. If you change the CPU frequency, you may have to switch other (memory, video) settings; you might even have to validate the frequency settings for these or even additional reasons (thermal and battery reasons - ACPI _PPC). This lowest layer basically assumes that upper-layer software has created an appropriate operating point (for example, in DPM we pretty much require a system designer to create operating points that match the h/w specs and don't go to great lengths to encode rules about this), and/or will call driver notifiers etc. as needed to adapt to the changes. Although there may be some sanity checking appropriate at the PowerOP level, cpufreq, DPM, etc. can for the most part continue to handle the larger issues of how valid operating points are constructed, driver callbacks, etc. If you do want to handle various dependencies at the PowerOP layer then there's nothing that prevents that, but PM frameworks tend to embody assumptions about how frequently operating points will change and in what contexts (interrupt, idle...), and this can influence the code for such things. Thirdly, who is to decide on the power management settings? The first and intuitive answer is the kernel. Therefore, kernel-space cpufreq governors exist. Only under rare circumstances, you want full userspace control -- that's what the userspace cpufreq governor is for. Also something left to the existing upper layers; PowerOP isn't intended to handle any of that. In the embedded space we usually let the system designer choose operating points supported by their h/w vendor and that match their particular system states (hardware enabled at any point in time, type and power/performance needs of software currently running). We do recommend that a userspace power policy manager be the component in charge of PM settings, based on messages from drivers and other apps on the state of the system. And so that userspace component activates the operating point (or set of operating points in the case of DPM) appropriate for current state. Foruthly, the code duplication which your implementation leads to is obvious for the speedstep-centrino case. We could move the tables of valid cpu speeds and corresponding voltages down to the PowerOP level, and there would probably be little duplication at that point (in fact, with the current patch there's not a lot of duplication since the actual MSR access was moved to PowerOP and PowerOP contains little else, but both levels know how to understand the MSR format, and a more aggressive port to PowerOP could do away with that). Your suggestions of changes to cpufreq governors and policies to handle governance of non-cpu-speed parameters sound interesting, and I'd be happy to help figure out what to do about those vs. the lower machine access layer I've discussed up until now. I'll think more about this real soon now. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PowerOP 0/3: System power operating point management API
Dominik Brodowski wrote: A small add-on: We need to make sure that we're capable of handling smart CPUs like Transmeta Crusoe processors in a sane way. This means b) Setting of "values" is optional if the hardware itself can be set to a min/max value (step a above in previous mail). Although I haven't looked into the Crusoe processor support, it may be that there is a different set of power parameters, not cpu speed directly, that are appropriate to manage on these platforms (after a brief look, seems to be a range of frequencies and some sort of flags)? If so, these sorts of machine-specific power parameters are what PowerOP is trying to address, allowing management of the underlying machine-specific stuff to upper layers that may be presenting an abstracted view of power/performance, such as CPU speed or speed ranges, to the user. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Resend:[RFC/Patch] Robust futexes
This is a resend of my patch to add robust futex support to the existing sys_futex system call. The patch applies to 2.6.12. Any comments or discussion will be welcome. Changes since my last posted version: - Applies to 2.6.12, was 2.6.12-rc6 - Added config option CONFIG_ROBUST_FUTEX, depends on existing CONFIG_FUTEX and defaults to no. - Commented functions, using kernel-doc style comments - Cleaned up some CodingStyle violations Sys_futex provides operations on futexes that can be local to a process, or shared between processes by placing the futex in shared memory. However, if a process terminates while it owns a locked shared futex, any other processes that use the same futex will hang. With this patch, if a process terminates while it owns a locked robust futex, the ownership of the lock will be transferred to the next waiting process, the waiting process will be awakened and will receive the status EOWNERDEAD. If there is no waiting process at the time of termination, then the next process that attempts to wait will receive ownership of the futex and the EOWNERDEAD status. The new owner can recover the futex and unlock it, in which case the futex can continue to be used. If the new owner only unlocks the futex, then the futex becomes unrecoverable and any attempt to use the futex will get the status ENOTRECOVERABLE. The patch does not change the existing sys_futex operations on non-robust mutexes, so the patch should not affect existing code that uses futexes. New op codes are added to the sys_futex system call for use by code that requires robust futexes. I have a patch to glibc and the nptl thread library that uses robust futexes. We are in the process of getting copyright assignments to the Free Software Foundation so that we can submit the glibc and nptl patches. Robust futexes have a different format from the non-robust futexes. The non-robust futexes can have the values 0 (unlocked), 1 (locked) or 2 (locked with waiters). In a robust futex the high bit indicates if there are processes waiting on the futex, the next bit indicates if the owning process died, and the next bit indicates if the futex is not recoverable. The rest of the futex contains the pid of the task that owns the futex lock or zero if the futex is not locked. Signed-off-by: Todd Kneisel <[EMAIL PROTECTED]> fs/dcache.c |3 fs/inode.c|2 include/linux/fs.h|4 include/linux/futex.h | 20 + init/Kconfig | 11 kernel/exit.c |3 kernel/futex.c| 686 +- 7 files changed, 728 insertions(+), 1 deletion(-) diff -uprN -X dontdiff linux-2.6.12/fs/dcache.c linux-2.6.12-todd/fs/dcache.c --- linux-2.6.12/fs/dcache.c2005-06-17 12:48:29.0 -0700 +++ linux-2.6.12-todd/fs/dcache.c 2005-06-20 10:44:40.738407891 -0700 @@ -32,6 +32,7 @@ #include #include #include +#include /* #define DCACHE_DEBUG 1 */ @@ -158,6 +159,8 @@ repeat: return; } + futex_free_robust_list(dentry->d_inode); + /* * AV: ->d_delete() is _NOT_ allowed to block now. */ diff -uprN -X dontdiff linux-2.6.12/fs/inode.c linux-2.6.12-todd/fs/inode.c --- linux-2.6.12/fs/inode.c 2005-06-17 12:48:29.0 -0700 +++ linux-2.6.12-todd/fs/inode.c2005-06-20 15:08:36.428029628 -0700 @@ -21,6 +21,7 @@ #include #include #include +#include /* * This is needed for the following functions: @@ -202,6 +203,7 @@ void inode_init_once(struct inode *inode INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear); spin_lock_init(&inode->i_lock); i_size_ordered_init(inode); + futex_init_inode(inode); } EXPORT_SYMBOL(inode_init_once); diff -uprN -X dontdiff linux-2.6.12/include/linux/fs.h linux-2.6.12-todd/include/linux/fs.h --- linux-2.6.12/include/linux/fs.h 2005-06-17 12:48:29.00000 -0700 +++ linux-2.6.12-todd/include/linux/fs.h2005-06-20 14:56:16.977970185 -0700 @@ -350,6 +350,10 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_headprivate_list; /* ditto */ struct address_space*assoc_mapping; /* ditto */ +#ifdef CONFIG_ROBUST_FUTEX + struct list_headrobust_list;/* list of robust futexes */ + struct semaphorerobust_sem; /* protect list of robust futexes */ +#endif } __attribute__((aligned(sizeof(long; /* * On most architectures that alignment is already the case; but diff -uprN -X dontdiff linux-2.6.12/include/linux/futex.h linux-2.6.12-todd/include/linux/futex.h --- linux-2.6.12/include/linux/futex.h 2005-06-17 12:48:29.00000 -0700 +++ linux-2.6.12-todd/include/linux/futex.h 2005-06-20 16:43:46.293122664 -0700 @@ -1,6 +1,8 @@ #ifndef _LINUX_FUTEX_H #define _LINUX_FUTEX_H +#include + /* Second argument to futex syscall */
Re: Resend:[RFC/Patch] Robust futexes
I've made relatively minor changes to the glibc side since posting the patch to the robust mutexes list. - added EOWNERDEAD and ENOTRECOVERABLE to the glibc error reporting mechanism, so that strerror() works. - added function prototypes to pthread.h for pthread_mutexattr_getrobust_np, pthread_mutexattr_setrobust_np, and pthread_mutex_consistent_np. - defined a constant for accessing the pid that's stored in the futex. - Changed copyrights from Bull SA (the French division of our company) to Bull HN (The U.S. division). We expect to receive the FSF contracts for copyright assignment any day now. If anyone would like to see the glibc changes, I can provide the patch under the Bull HN copyright. The patch applies to glibc-2.3.4. Todd. Daniel Walker <[EMAIL PROTECTED]> wrote on 07/05/2005 05:00:32 PM: > > You might want to CC Andrew Morton , and Rusty Russell. > > What is the status of the glibc side of this? > > Daniel > > > On Tue, 2005-07-05 at 16:11 -0700, Todd Kneisel wrote: > > This is a resend of my patch to add robust futex support to the existing > > sys_futex system call. The patch applies to 2.6.12. Any comments or > > discussion will be welcome. > > > > Changes since my last posted version: > > - Applies to 2.6.12, was 2.6.12-rc6 > > - Added config option CONFIG_ROBUST_FUTEX, depends on existing CONFIG_FUTEX > >and defaults to no. > > - Commented functions, using kernel-doc style comments > > - Cleaned up some CodingStyle violations > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PowerOP Take 2 1/3: ARM OMAP1 platform support
PowerOP support for OMAP1 platforms. Currently handles these power parameters: dpllmult DPLL_CTL reg PLL MULT bits dplldiv DPLL_CTL reg PLL DIV bits armdiv ARM_CKCTL reg ARMDIV bits dspdiv ARM_CKCTL reg DSPDIV bits tcdivARM_CKCTL reg TCDIV bits lowpwr 1 = assert ULPD LOW_PWR, voltage scale low Other parameters such as DSPMMUDIV, LCDDIV, and ARM_PERDIV might also be useful. Example usage will be shown with a follow-on sysfs UI patch. Index: linux-2.6.13-rc4/arch/arm/mach-omap1/powerop.c === --- /dev/null +++ linux-2.6.13-rc4/arch/arm/mach-omap1/powerop.c @@ -0,0 +1,157 @@ +/* + * PowerOP support for OMAP1 + * + * Based on DPM OMAP code by Matthew Locke, Dmitry Chigirev, Vladimir + * Barinov, and Todd Poynor. + * + * 2005 (c) MontaVista Software, Inc. This file is licensed under + * the terms of the GNU General Public License version 2. This program + * is licensed "as is" without any warranty of any kind, whether express + * or implied. + */ + +#include +#include +#include + +#include +#include +#include + +/* ARM_CKCTL bit shifts */ +#define CKCTL_PERDIV_OFFSET0 +#define CKCTL_LCDDIV_OFFSET2 +#define CKCTL_ARMDIV_OFFSET4 +#define CKCTL_DSPDIV_OFFSET6 +#define CKCTL_TCDIV_OFFSET 8 +#define CKCTL_DSPMMUDIV_OFFSET 10 + +#define DPLL_CTL_MASK 0xfe0 + +#define ULPD_MIN_MAX_REG (1 << 11) +#define ULPD_DVS_ENABLE (1 << 10) +#define ULPD_LOW_PWR_REQ (1 << 1) +#define LOW_POWER (ULPD_MIN_MAX_REG | ULPD_DVS_ENABLE | ULPD_LOW_PWR_REQ | \ + ULPD_LOW_PWR_EN) + +int +powerop_get_point(struct powerop_point *point) +{ + unsigned long flags; + int dpll_ctl, arm_cktl; + + local_irq_save(flags); + dpll_ctl = omap_readw(DPLL_CTL); + arm_cktl = omap_readw(ARM_CKCTL); + + point->dpllmult = dpll_ctl >> 7 & 0x1f; + point->dplldiv = dpll_ctl >> 5 & 0x3; + point->armdiv = arm_cktl >> CKCTL_ARMDIV_OFFSET & 0x3; + point->dspdiv = arm_cktl >> CKCTL_DSPDIV_OFFSET & 0x3; + point->tcdiv = arm_cktl >> CKCTL_TCDIV_OFFSET & 0x3; + point->lowpwr = (omap_readw(ULPD_POWER_CTRL) & (ULPD_LOW_PWR_REQ)) ? + 1 : 0; + local_irq_restore(flags); + return 0; +} + +static void scale_dpll(int dpll_ctl) +{ + int i; + + omap_writew((omap_readw(DPLL_CTL) & ~DPLL_CTL_MASK) | dpll_ctl, + DPLL_CTL); + + for (i = 0; i < 0x1FF; i++) + nop(); + + /* +* Wait for PLL relock. +*/ + + while ((omap_readw(DPLL_CTL) & 0x1) == 0); +} + +static void set_low_pwr(int lowpwr) +{ + int cur_lowpwr; + + if (lowpwr == -1) + return; + + cur_lowpwr = (omap_readw(ULPD_POWER_CTRL) & (ULPD_LOW_PWR_REQ)) ? + 1 : 0; + + if (cur_lowpwr != lowpwr) { + if (lowpwr) + omap_writew(omap_readw(ULPD_POWER_CTRL) | LOW_POWER, + ULPD_POWER_CTRL); + else + omap_writew(omap_readw(ULPD_POWER_CTRL) & ~LOW_POWER, + ULPD_POWER_CTRL); + } +} + +int +powerop_set_point(struct powerop_point *point) +{ + int dpll_ctl = 0; + int dpll_mod = 0; + int arm_ctl = 0; + int arm_msk = 0; + int cur_dpll_ctl; + unsigned long flags; + + if ((point->dpllmult != -1) && (point->dplldiv != -1)) { + dpll_ctl = (point->dpllmult << 7) | + (point->dplldiv << 5); + dpll_mod = 1; + } + + if (point->armdiv != -1) { + arm_ctl |= (point->armdiv << CKCTL_ARMDIV_OFFSET); + arm_msk |= (3 << CKCTL_ARMDIV_OFFSET); + } + + if (point->dspdiv != -1) { + arm_ctl |= (point->dspdiv << CKCTL_DSPDIV_OFFSET); + arm_msk |= (3 << CKCTL_DSPDIV_OFFSET); + } + + if (point->tcdiv != -1) { + arm_ctl |= (point->tcdiv << CKCTL_TCDIV_OFFSET); + arm_msk |= (3 << CKCTL_TCDIV_OFFSET); + } + + local_irq_save(flags); + cur_dpll_ctl = omap_readw(DPLL_CTL) & DPLL_CTL_MASK; + + if (dpll_mod && (dpll_ctl < cur_dpll_ctl)) + scale_dpll(dpll_ctl); + + if (arm_msk != 0) + omap_writew((omap_readw(ARM_CKCTL) & ~arm_msk) | arm_ctl, + ARM_CKCTL); + + if (dpll_mod && (dpll_ctl > cur_dpll_ctl)) + scale_dpll(dpll_ctl); + + set_low_pwr(point->lowpwr); + local_irq_restore(flags); + return 0; +} + +EXPORT_SYMBOL_GPL(powerop_get_point); +EXPORT_SYMBOL_GPL(powerop_se
PowerOP Take 2 0/3 Intro
PowerOP is a system power parameter management API submitted for discussion. PowerOP writes and reads power "operating points", comprised of arbitrary values, called power parameters, that correspond to registers, clocks, dividers, voltage regulators, etc. that may be modified to set a basic power/performance point for the system. Higher-layer power management software passes a platform-specific struct of power parameters to a backend that makes the requested adjustments. PowerOP can be thought of as a layer below cpufreq that actually accesses the hardware to make cpu frequency, voltage, core bus, and perhaps other modifications to set a power point. The main goal is to open up interfaces to operating points comprised of arbitrary power parameters, as an alternative to the "cpu frequency" parameter that cpufreq is designed around. It pushes the hardware-level power parameter management down to a level that can be shared with other power management policy frameworks that deal with entire operating points as the basic unit of system power management for reasons described below, and that The new layer is intended to leave all power policy decisions, and various other power management chores such as driver notification, to higher layers in a power management software stack. This work is aimed in part at supporting embedded systems, which often have several parameters that can be set and the cpu frequency abstraction specifies only part of the operating point that may be managed from software. For example, an XScale PXA27x could be considered to have six basic power parameters (mainly cpu run mode and memory and bus dividers) that for the most part should be set in tandem to known good sets of values as validated by the silicon vendor. Embedded systems typically handle more hardware clock manipulation directly in Linux than do, for example, desktop/server systems based on ACPI interfaces. Desktop/server systems may also benefit from the ability to select custom voltages, bus speeds, etc., although deviating from values validated by the hardware vendor is risky and controversial. An optional sysfs interface to create and activate operating points is also submitted for discussion. This interface could be used in an actual power management stack, or at least can serve as a starting point for discussing how to manage operating points at the next higher layer in the power management soctware stack. cpufreq is another possible interface. PowerOP can fit below cpufreq to make cpu frequency, voltage, core bus, and perhaps other modifications to set a power point, leaving cpufreq to manage the interfaces based around the "cpu frequency" abstraction, the policies and governors that select the frequency, its notifiers, and so forth. An example hooking up support for one cpufreq platform to PowerOP has been previously submitted. Some discussion regarding incorporating an expanded set of power parameters into the cpufreq interfaces has also taken place. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
PowerOP Take 2 3/3: OMAP1 sysfs UI
A PowerOP sysfs UI backend for OMAP1 platforms, adding sysfs attributes and show/store functions that correspond to the platform power parameters. An example usage on an OMAP1510 Innovator which does not have voltage scaling and ignoring the DSP: # echo -n 168-168-84 > /sys/powerop/new # DPLL 168MHz, ARM 168MHz, TC 84MHz # echo -n 14 > /sys/powerop/168-168-84/dpllmult # echo -n 0 > /sys/powerop/168-168-84/dplldiv # echo -n 0 > /sys/powerop/168-168-84/armdiv # echo -n 1 > /sys/powerop/168-168-84/tcdiv # echo -n -1 > /sys/powerop/168-168-84/dspdiv # echo -n -1 > /sys/powerop/168-168-84/lowpwr # echo -n 60-60-60 > /sys/powerop/new # DPLL 60MHz, ARM 60MHz, TC 60MHz # echo -n 5 > /sys/powerop/60-60-60/dpllmult # echo -n 0 > /sys/powerop/60-60-60/dplldiv # echo -n 0 > /sys/powerop/60-60-60/armdiv # echo -n 0 > /sys/powerop/60-60-60/tcdiv # echo -n -1 > /sys/powerop/60-60-60/dspdiv # echo -n -1 > /sys/powerop/60-60-60/lowpwr # echo -n 60-60-60 > /sys/powerop/active # cat /sys/powerop/hw/dpllmult 5 The lower-powered operating point consumes about .07A less power using my .config (it should be noted this is not a proper low-power evaluation platform). Index: linux-2.6.13-rc4/arch/arm/mach-omap1/Kconfig === --- linux-2.6.13-rc4.orig/arch/arm/mach-omap1/Kconfig +++ linux-2.6.13-rc4/arch/arm/mach-omap1/Kconfig @@ -155,3 +155,5 @@ source "arch/arm/plat-omap/dsp/Kconfig" config POWEROP bool "PowerOP Platform Core Power Management for OMAP1" help + +source "drivers/powerop/Kconfig" Index: linux-2.6.13-rc4/arch/arm/mach-omap1/powerop.c === --- linux-2.6.13-rc4.orig/arch/arm/mach-omap1/powerop.c +++ linux-2.6.13-rc4/arch/arm/mach-omap1/powerop.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -144,13 +145,141 @@ powerop_set_point(struct powerop_point * EXPORT_SYMBOL_GPL(powerop_get_point); EXPORT_SYMBOL_GPL(powerop_set_point); +#ifdef CONFIG_POWEROP_SYSFS + +ssize_t lowpwr_show(struct powerop_point *op, char * buf) +{ + return sprintf(buf,"%d\n",op->lowpwr); +} + +ssize_t lowpwr_store(struct powerop_point *op, const char * buf, size_t count) +{ + int error; + + if ((error = sscanf(buf, "%d", &op->lowpwr)) == 1) + return count; + return error; +} + +powerop_param_attr(lowpwr); + +ssize_t dpllmult_show(struct powerop_point *op, char * buf) +{ + return sprintf(buf,"%d\n",op->dpllmult); +} + +ssize_t dpllmult_store(struct powerop_point *op, const char * buf, size_t count) +{ + int error; + + if ((error = sscanf(buf, "%d", &op->dpllmult)) == 1) + return count; + return error; +} + +powerop_param_attr(dpllmult); + +ssize_t dplldiv_show(struct powerop_point *op, char * buf) +{ + return sprintf(buf,"%d\n",op->dplldiv); +} + +ssize_t dplldiv_store(struct powerop_point *op, const char * buf, size_t count) +{ + int error; + + if ((error = sscanf(buf, "%d", &op->dplldiv)) == 1) + return count; + return error; +} + +powerop_param_attr(dplldiv); + +ssize_t armdiv_show(struct powerop_point *op, char * buf) +{ + return sprintf(buf,"%d\n",op->armdiv); +} + +ssize_t armdiv_store(struct powerop_point *op, const char * buf, size_t count) +{ + int error; + + if ((error = sscanf(buf, "%d", &op->armdiv)) == 1) + return count; + return error; +} + +powerop_param_attr(armdiv); + +ssize_t dspdiv_show(struct powerop_point *op, char * buf) +{ + return sprintf(buf,"%d\n",op->dspdiv); +} + +ssize_t dspdiv_store(struct powerop_point *op, const char * buf, size_t count) +{ + int error; + + if ((error = sscanf(buf, "%d", &op->dspdiv)) == 1) + return count; + return error; +} + +powerop_param_attr(dspdiv); + +ssize_t tcdiv_show(struct powerop_point *op, char * buf) +{ + return sprintf(buf,"%d\n",op->tcdiv); +} + +ssize_t tcdiv_store(struct powerop_point *op, const char * buf, size_t count) +{ + int error; + + if ((error = sscanf(buf, "%d", &op->tcdiv)) == 1) + return count; + return error; +} + +powerop_param_attr(tcdiv); + +static struct attribute *powerop_sysfs_param_attrs[] = { + &lowpwr_attr.attr, + &dpllmult_attr.attr, + &dplldiv_attr.attr, + &armdiv_attr.attr, + &dspdiv_attr.attr, + &tcdiv_attr.attr, + NULL, +}; + +static void powerop_omap1_sysfs_init() +{ + powerop_sysfs_register(powerop_sysfs_param_attrs); +} + +static void powerop_omap1_sysfs_exit() +{ + powerop_sysfs_unregister(powerop_sysfs_param_attrs); +} +#else /* CONFIG_POWEROP_SYSFS */ +static void powerop_omap1_sysfs_init() +{ +} +static void powerop_omap1_sysfs_exit() +{ +} +#endif /* CONFIG_POWEROP_SYSFS */ + static int __init powerop_init(void) { + powerop_omap1_sysfs_init(); return 0; } static
PowerOP Take 2 2/3: sysfs UI core
A sysfs interface for PowerOP that allows operating points to be created and activated from userspace. The platform-specific backend provides the code to read and write sysfs attributes for each power parameter; the core sysfs interface has no knowledge of the struct powerop_point contents. This interface could be used independently of an integrated cpufreq or DPM interface. It is not an integral part of PowerOP and is provided in part to facilitate discussion and experimentation with PowerOP, but could serve as a basis for a basic userspace power policy management stack. Operating points are created by writing the name of the operating point to /sys/powerop/new. This may be a job for configfs. /sys/powerop// will contain an attribute for each power parameter that may be written to set the associated parameter for the new operating point. An operating point may be activated by writing its name to /sys/powerop/active. The hardware power parameters currently set may be read and written via /sys/powerop/hw/, a special operating point that reads and writes parameter attribute values immediately, primarily for diagnostic purposes. Buried in this interface is also the notion of a registry of "named operating points", allowing operating points created by some other interface (such as cpufreq or loading a module with the definitions as suggested previously by David Brownell) to be activated from userspace via /sys/powerop/active. Changing operating points (or other power-policy-based information that triggers changes in operating points) from userspace is a common scenario in some embedded systems, where power/performance needs change based on system state changes that are coordinated by a userspace process (for example, a mobile phone starting a multimedia application). Index: linux-2.6.13-rc4/drivers/powerop/Kconfig === --- /dev/null +++ linux-2.6.13-rc4/drivers/powerop/Kconfig @@ -0,0 +1,4 @@ +config POWEROP_SYSFS + bool " Enable PowerOP sysfs interface" + depends on POWEROP && SYSFS + help Index: linux-2.6.13-rc4/drivers/powerop/Makefile === --- /dev/null +++ linux-2.6.13-rc4/drivers/powerop/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_POWEROP_SYSFS)+= powerop_sysfs.o Index: linux-2.6.13-rc4/drivers/powerop/powerop_sysfs.c === --- /dev/null +++ linux-2.6.13-rc4/drivers/powerop/powerop_sysfs.c @@ -0,0 +1,398 @@ +/* + * PowerOP sysfs UI + * + * Author: Todd Poynor <[EMAIL PROTECTED]> + * + * 2005 (c) MontaVista Software, Inc. This file is licensed under + * the terms of the GNU General Public License version 2. This program + * is licensed "as is" without any warranty of any kind, whether express + * or implied. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +int powerop_register_point(const char *id, struct powerop_point *point); +int powerop_select_point(const char *id); + +struct namedop { + struct kobject kobj; + struct powerop_point *point; + struct list_head node; + struct completion released; +}; + +#define to_namedop(_kobj) container_of(_kobj,\ + struct namedop,kobj) + +static DECLARE_MUTEX(namedop_list_mutex); +static struct list_head namedop_list; +static struct namedop *activeop; + +struct sysfsop { + struct powerop_point point; + struct list_head node; +}; + +static DECLARE_MUTEX(sysfsop_list_mutex); +static struct list_head sysfsop_list; + +static struct powerop_point *hwop; + +#define powerop_attr(_name) \ +static struct subsys_attribute _name##_attr = { \ + .attr = { \ + .name = __stringify(_name), \ + .mode = 0644, \ + .owner = THIS_MODULE, \ + }, \ + .show = _name##_show, \ + .store = _name##_store,\ +} + +static struct attribute_group param_attr_group; + +#define to_powerop_param_attr(_attr) container_of(_attr,\ + struct powerop_param_attribute,attr) + + +decl_subsys(powerop, NULL, NULL); +static int subsys_reg; +static int sysfs_init; + +static void namedop_release(struct kobject *kobj) +{ + struct namedop *op = to_namedop(kobj); + + complete(&op->released); + return; +} + +static struct sysfsop *sysfsop_create(const char *id) +{ + struct sysfsop *op; + int error; + + if ((op = kmalloc(sizeof(struct sysfsop), GFP_KERNEL)) == NULL) + return ERR_PTR(-ENOMEM); + + down(&sysfsop_list_mutex); + list_add_tail(&op->node, &sysfsop_list); + up(&sysfsop_list_mutex); + memset(&op->point, 0, sizeof(struct powerop_
Re: PowerOP Take 2 0/3 Intro
Jordan Crouse wrote: Todd - do you have a ChangeLog from Take 1? :) Right, here's what's changed in this version... The generic structure of an operating point as an array of integers is dropped. A struct powerop_point is now an entirely backend-defined struct of arbitrary fields. There is no more PowerOP core layer; all data structures and functions for core functionality are provided by the machine-specific backend. The diagnostic sysfs UI has been split out into a separate, optional patch. A more full-featured UI allowing operating point creation and activation via sysfs has also been provided in that patch. This UI primarily serves as an example for experimentation purposes, but is pretty close to what a basic userspace-based policy manager might need to switch operating points in response to infrequent changes in system state. The UI also embodies the notion of a list of "named operating points" that could be registered by other means, such as loading a module with data structures that encode the desired operating points (as David Brownell has suggested). The named operating points registered from such other interfaces can also be activated from the sysfs UI (that is, the hardware can be told to run at that operating point), as an example of how to tie in userspace policy managers with such a scheme. The example platform backend this time is for an embedded system: the TI OMAP1 family of processors used for numerous mobile phones and PDAs. It may better illustrate why managing multiple power parameters might be a useful capability. I haven't put out an example of cpufreq integration this time, but the idea has changed little from before. In case it's getting lost in all these details, the main point of all this is to pose the question: "are arbitrary power parameters arranged into a set with mutually consistent values (called here an operating point) a good low-level abstraction for system power management of a wide variety of platforms?" Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] PowerOP Take 2 1/3: ARM OMAP1 platform support
it is that runs on the DSP (such as via shared memory message libs from the silicon vendor). Soon the other core will be running Linux as well, and the two OSes will need to coordinate the system power management, which will be an interesting thing to tackle. lowpwr1 = assert ULPD LOW_PWR, voltage scale low Could you describe the policy effect of this bit? I suspect a good "PCs don't work like that!" example is lurking here. That interacts with some other bits, and code ... when would setting this be good, or bad? This is how Dynamic Voltage Scaling is done on OMAP1 platforms. Assuming you've setup an operating point that is validated to work at the reduced voltage level on your hardware by TI (these are two voltage levels available), you can optionally specify to run at reduced voltage, possibly at an increased cost in latency of transitioning between operating points as voltage ramps up or down. In the case of DPM running on an OMAP 1610 H2, you could tell the system to run at 1.5V when not idle and at 1.1V when idle, although depending on the ramp time (I can't recall for that board, but for some non-OMAPs this can be significant) and the realtime constraints of your app there could be missed deadlines under such a policy. If the system isn't running anything with a tight deadline then it may be fine to stay at 1.1V or voltage scale between the two. Other parameters such as DSPMMUDIV, LCDDIV, and ARM_PERDIV might also be useful. Again, PERDIV changes would need to involve clock.c to cascade the changes through the clock tree. Change PERDIV and you'll need to recalculate the peripheral clocks that derive from it... better not do it while an I/O operation is actively using it! On some other platforms this actually becomes necessary, but for OMAP1's the trouble with doing so probably precludes anybody from using it. As with TCDIV, that makes a useful example of something that is clearly not within the "cpufreq" domain. I'll try to cook up an XScale PXA27x example, which adds multiple memory and system bus frequencies supported per CPU MHz, quick run vs.turbo mode switching of CPU MHz, and some other exotic features. It has a very specific set of "product points" validated by Intel that correspond to the operating point abstraction. If nothing else, it may be instructive to consider the variety of ways embedded platforms are being designed to be power managed. -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.4.x kernel BUG at filemap.c:81
Running slackware 10 and 10.1, with kernels 2.4.26, 2.4.27, 2.4.28, 2.4.29 with highmem 4GB, and highmem i/o support enabled, I get a system lockup. This happens in both X and console. Happens with and without my Nvidia drivers loaded. I cannot determine what makes this bug present it self besides highmem and high i/o support enabled. Im guessing the system is fine until highmem is actually used to some point and then it borks, but I really have no idea and so im just making a random guess. I ran memtest86 for a few hours a while ago thinking that it may be bad memory, but that did not seem to be the problem. If you need anymore information, or have questions, or wish me to test anything, PLEASE feel free to contact me, I would really like to see this bug resolved. =) -- Todd Shetter Feb 8 19:49:31 quark kernel: kernel BUG at filemap.c:81! Feb 8 19:49:31 quark kernel: invalid operand: Feb 8 19:49:31 quark kernel: CPU:0 Feb 8 19:49:31 quark kernel: EIP:0010:[]Tainted: P Feb 8 19:49:31 quark kernel: EFLAGS: 00210206 Feb 8 19:49:31 quark kernel: eax: 4000 ebx: c18012f0 ecx: edx: f7f06e60 Feb 8 19:49:31 quark kernel: esi: 0001 edi: f7f06e60 ebp: f3461e9c esp: f3461e98 Feb 8 19:49:31 quark kernel: ds: 0018 es: 0018 ss: 0018 Feb 8 19:49:31 quark kernel: Process hdparm (pid: 2191, stackpage=f3461000) Feb 8 19:49:31 quark kernel: Stack: c18012f0 f3461eb0 c0128c78 f07d0360 c18012f0 47a0 f3461edc c0128d20 Feb 8 19:49:31 quark kernel:c18012f0 f07d0360 47a0 f7f06e60 f7f06e60 eed4ab34 47a0 Feb 8 19:49:31 quark kernel:eed4ab34 f3461f00 c012941e 001f 479f 0020 012a1f16 f7f06de4 Feb 8 19:49:31 quark kernel: Call Trace:[] [] [] [] [] Feb 8 19:49:31 quark kernel: [] [] [] [] [] Feb 8 19:49:31 quark kernel: Feb 8 19:49:31 quark kernel: Code: 0f 0b 51 00 a8 05 2e c0 6a 01 53 e8 ef 7e 00 00 8b 5d fc 89 00:00.0 Host bridge: Intel Corp. 82875P Memory Controller Hub (rev 02) Subsystem: ABIT Computer Corp.: Unknown device 1014 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- Latency: 0 Region 0: Memory at e800 (32-bit, prefetchable) [size=128M] Capabilities: [e4] #09 [2106] Capabilities: [a0] AGP version 3.0 Status: RQ=32 Iso- ArqSz=2 Cal=2 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3+ Rate=x4,x8 Command: RQ=1 ArqSz=0 Cal=2 SBA+ AGP+ GART64- 64bit- FW- Rate=x8 00:01.0 PCI bridge: Intel Corp. 82875P Processor to AGP Controller (rev 02) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- Latency: 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=32 I/O behind bridge: f000-0fff Memory behind bridge: f800-faff Prefetchable memory behind bridge: f000-f7ff BridgeCtl: Parity- SERR- NoISA+ VGA+ MAbort- >Reset- FastB2B- 00:03.0 PCI bridge: Intel Corp. 82875P Processor to PCI to CSA Bridge (rev 02) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- Latency: 32 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 I/O behind bridge: a000-afff Memory behind bridge: fb00-fb0f Prefetchable memory behind bridge: fff0-000f BridgeCtl: Parity- SERR- NoISA+ VGA- MAbort- >Reset- FastB2B- 00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB/EB/ER Hub interface to PCI Bridge (rev c2) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- Latency: 0 Bus: primary=00, secondary=03, subordinate=03, sec-latency=32 I/O behind bridge: f000-0fff Memory behind bridge: fff0-000f Prefetchable memory behind bridge: fff0-000f BridgeCtl: Parity- SERR+ NoISA+ VGA- MAbort- >Reset- FastB2B- 00:1f.0 ISA bridge: Intel Corp. 82801EB/ER (ICH5/ICH5R) LPC Bridge (rev 02) Control: I/O+ Mem+ BusMaster+ SpecCycle+ MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- Latency: 0 00:1f.2 IDE interface: Intel Corp. 82801EB (ICH5) Serial ATA 150 Storage Controller (rev 02) (prog-if 8a [Master SecP PriP]) Subsystem: ABIT Computer Corp.: Unknown device 1014 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- Latency: 0 Interrupt: pin A routed to IRQ 11 Region 0: I/O ports at
Re: 2.4.x kernel BUG at filemap.c:81
Marcelo Tosatti wrote: On Wed, Feb 09, 2005 at 12:15:03AM -0500, Todd Shetter wrote: Running slackware 10 and 10.1, with kernels 2.4.26, 2.4.27, 2.4.28, 2.4.29 with highmem 4GB, and highmem i/o support enabled, I get a system lockup. This happens in both X and console. Happens with and without my Nvidia drivers loaded. I cannot determine what makes this bug present it self besides highmem and high i/o support enabled. Im guessing the system is fine until highmem is actually used to some point and then it borks, but I really have no idea and so im just making a random guess. I ran memtest86 for a few hours a while ago thinking that it may be bad memory, but that did not seem to be the problem. If you need anymore information, or have questions, or wish me to test anything, PLEASE feel free to contact me, I would really like to see this bug resolved. =) -- Todd Shetter Feb 8 19:49:31 quark kernel: kernel BUG at filemap.c:81! Feb 8 19:49:31 quark kernel: invalid operand: Feb 8 19:49:31 quark kernel: CPU:0 Feb 8 19:49:31 quark kernel: EIP:0010:[]Tainted: P Hi Todd, Why is your kernel tainted ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ I had the nvidia 1.0-6629 driver loaded when I got that error. I compiled the kernel using the slackware 10.1 config, enabled highmem 4GB support, highmem i/o, and then some kernel hacking options including debugging for highmen related things. I booted, loaded X with KDE, opened firefox a few times, and then started running hdparm because some newer 2.4.x kernels dont play nice with my SATA, ICH5, and DMA. hdparm segfaulted while running the drive read access portion of its tests, and things locked up from there in about 30secs. I've gotten the same error with the nvidia driver not loaded, so I dont think that is part of the problem. As I said, if you want me to test or try anything feel free to ask. =) -- Todd Shetter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.x kernel BUG at filemap.c:81
Marcelo Tosatti wrote: On Wed, Feb 09, 2005 at 11:30:05AM -0500, Todd Shetter wrote: Marcelo Tosatti wrote: On Wed, Feb 09, 2005 at 12:15:03AM -0500, Todd Shetter wrote: Running slackware 10 and 10.1, with kernels 2.4.26, 2.4.27, 2.4.28, 2.4.29 with highmem 4GB, and highmem i/o support enabled, I get a system lockup. This happens in both X and console. Happens with and without my Nvidia drivers loaded. I cannot determine what makes this bug present it self besides highmem and high i/o support enabled. Im guessing the system is fine until highmem is actually used to some point and then it borks, but I really have no idea and so im just making a random guess. I ran memtest86 for a few hours a while ago thinking that it may be bad memory, but that did not seem to be the problem. If you need anymore information, or have questions, or wish me to test anything, PLEASE feel free to contact me, I would really like to see this bug resolved. =) -- Todd Shetter Feb 8 19:49:31 quark kernel: kernel BUG at filemap.c:81! Feb 8 19:49:31 quark kernel: invalid operand: Feb 8 19:49:31 quark kernel: CPU:0 Feb 8 19:49:31 quark kernel: EIP:0010:[]Tainted: P Hi Todd, Why is your kernel tainted ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ I had the nvidia 1.0-6629 driver loaded when I got that error. I compiled the kernel using the slackware 10.1 config, enabled highmem 4GB support, highmem i/o, and then some kernel hacking options including debugging for highmen related things. I booted, loaded X with KDE, opened firefox a few times, and then started running hdparm because some newer 2.4.x kernels dont play nice with my SATA, ICH5, and DMA. hdparm segfaulted while running the drive read access portion of its tests, and things locked up from there in about 30secs. I've gotten the same error with the nvidia driver not loaded, so I dont think that is part of the problem. As I said, if you want me to test or try anything feel free to ask. =) Todd, Would be interesting to have the oops output without the kernel nvidia module. Do you have that saved? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Sorry, it took me FOREVER to get this bug to appear again, and this time its a little different. Feb 9 15:20:37 quark kernel: kernel BUG at page_alloc.c:142! Feb 9 15:20:37 quark kernel: invalid operand: Feb 9 15:20:37 quark kernel: CPU:0 Feb 9 15:20:37 quark kernel: EIP:0010:[]Not tainted Feb 9 15:20:37 quark kernel: EFLAGS: 00013206 Feb 9 15:20:37 quark kernel: eax: 0114 ebx: c17e1160 ecx: 4000 edx: Feb 9 15:20:37 quark kernel: esi: edi: eea037f0 ebp: f0f27f24 esp: f0f27ef0 Feb 9 15:20:37 quark kernel: ds: 0018 es: 0018 ss: 0018 Feb 9 15:20:37 quark kernel: Process X (pid: 2206, stackpage=f0f27000) Feb 9 15:20:37 quark kernel: Stack: c0324d68 00037000 c120 c17e11c0 c0324cb8 c1030020 c0324cf0 3207 Feb 9 15:20:37 quark kernel:c17e1160 f0f27f24 2a05c067 001fc000 eea037f0 f0f27f68 c012531c c17e1160 Feb 9 15:20:37 quark kernel:c17e1160 01fd 002f 4980 496f f0eee494 4940 Feb 9 15:20:37 quark kernel: Call Trace:[] [] [] [] Feb 9 15:20:37 quark kernel: Feb 9 15:20:37 quark kernel: Code: 0f 0b 8e 00 bf 06 2e c0 8b 53 08 85 d2 74 08 0f 0b 90 00 bf Feb 9 15:30:41 quark kernel: <6>SysRq : SAK 00:00.0 Host bridge: Intel Corp. 82875P Memory Controller Hub (rev 02) Subsystem: ABIT Computer Corp.: Unknown device 1014 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- Latency: 0 Region 0: Memory at e800 (32-bit, prefetchable) [size=128M] Capabilities: [e4] #09 [2106] Capabilities: [a0] AGP version 3.0 Status: RQ=32 Iso- ArqSz=2 Cal=2 SBA+ ITACoh- GART64- HTrans- 64bit- FW+ AGP3+ Rate=x4,x8 Command: RQ=1 ArqSz=0 Cal=2 SBA+ AGP- GART64- 64bit- FW- Rate= 00:01.0 PCI bridge: Intel Corp. 82875P Processor to AGP Controller (rev 02) (prog-if 00 [Normal decode]) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap- 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- Latency: 64 Bus: primary=00, secondary=01, subordinate=01, sec-latency=32 I/O behind bridge: f000-0fff Memory behind bridge: f800-faff Prefetchable memory behind bridge: f000-f7ff BridgeCtl: Par
Re: 2.4.x kernel BUG at filemap.c:81
Marcelo Tosatti wrote: On Wed, Feb 09, 2005 at 03:47:28PM -0500, Todd Shetter wrote: Running slackware 10 and 10.1, with kernels 2.4.26, 2.4.27, 2.4.28, 2.4.29 with highmem 4GB, and highmem i/o support enabled, I get a system lockup. This happens in both X and console. Happens with and without my Nvidia drivers loaded. I cannot determine what makes this bug present it self besides highmem and high i/o support enabled. Im guessing the system is fine until highmem is actually used to some point and then it borks, but I really have no idea and so im just making a random guess. I ran memtest86 for a few hours a while ago thinking that it may be bad memory, but that did not seem to be the problem. If you need anymore information, or have questions, or wish me to test anything, PLEASE feel free to contact me, I would really like to see this bug resolved. =) -- Todd Shetter Feb 8 19:49:31 quark kernel: kernel BUG at filemap.c:81! Feb 8 19:49:31 quark kernel: invalid operand: Feb 8 19:49:31 quark kernel: CPU:0 Feb 8 19:49:31 quark kernel: EIP:0010:[]Tainted: P Hi Todd, Why is your kernel tainted ? I had the nvidia 1.0-6629 driver loaded when I got that error. I compiled the kernel using the slackware 10.1 config, enabled highmem 4GB support, highmem i/o, and then some kernel hacking options including debugging for highmen related things. I booted, loaded X with KDE, opened firefox a few times, and then started running hdparm because some newer 2.4.x kernels dont play nice with my SATA, ICH5, and DMA. hdparm segfaulted while running the drive read access portion of its tests, and things locked up from there in about 30secs. I've gotten the same error with the nvidia driver not loaded, so I dont think that is part of the problem. As I said, if you want me to test or try anything feel free to ask. =) Todd, Would be interesting to have the oops output without the kernel nvidia module. Do you have that saved? Sorry, it took me FOREVER to get this bug to appear again, and this time its a little different. Hum, both BUGs are due to a page with alive ->buffers mapping. Did it crashed right after hdparm now too? Can you boot your box without SATA drivers, configuring the interface to IDE mode ? Which problems are you facing with newer v2.4.x kernels and SATA? Im waiting for the system to crash, so I figured I might as well get on with the SATA problems Running 2.4.29 neither the CONFIG_BLK_DEV_IDE_SATA nor the CONFIG_SCSI_SATA are set currently and DMA is not enabled on either of my drives, hda: ST380013AS, hdb: WDC WD2500SD-01KCB0, hdc: Maxtor 94610U6. Setting DMA manually on the hard drives yields a HDIO_SET_DMA failed: Operation not permitted error. Using 2.4.26, DMA worked fine on the drives. Under 2.4.27, 2.4.28, and 2.4.29 using CONFIG_SCSI_SATA does not allow setting of DMA on the drives, yielding a HDIO_SET_DMA failed: Operation not permitted error, and the transfer speeds reported by hdparm are at about 3MB/s. Under 2.4.29 using CONFIG_BLK_DEV_IDE_SATA the DMA is set fine upon boot, and I get good transfers, hdparm reports 58MB/s on my Western Digital drive. I have not tested using CONFIG_BLK_DEV_IDE_SATA on any previous kernel versions. Well, still no crash yetAgain, anything else you want me to try or do just let me know. -- Todd Shetter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.x kernel BUG at filemap.c:81
Jeff Garzik wrote: Marcelo Tosatti wrote: On Wed, Feb 09, 2005 at 11:23:42PM -0500, Todd Shetter wrote: Marcelo Tosatti wrote: On Wed, Feb 09, 2005 at 03:47:28PM -0500, Todd Shetter wrote: Running slackware 10 and 10.1, with kernels 2.4.26, 2.4.27, 2.4.28, 2.4.29 with highmem 4GB, and highmem i/o support enabled, I get a system lockup. This happens in both X and console. Happens with and without my Nvidia drivers loaded. I cannot determine what makes this bug present it self besides highmem and high i/o support enabled. Im guessing the system is fine until highmem is actually used to some point and then it borks, but I really have no idea and so im just making a random guess. I ran memtest86 for a few hours a while ago thinking that it may be bad memory, but that did not seem to be the problem. If you need anymore information, or have questions, or wish me to test anything, PLEASE feel free to contact me, I would really like to see this bug resolved. =) Todd Shetter Feb 8 19:49:31 quark kernel: kernel BUG at filemap.c:81! Feb 8 19:49:31 quark kernel: invalid operand: Feb 8 19:49:31 quark kernel: CPU:0 Feb 8 19:49:31 quark kernel: EIP:0010:[] Tainted: P Hi Todd, Why is your kernel tainted ? I had the nvidia 1.0-6629 driver loaded when I got that error. I compiled the kernel using the slackware 10.1 config, enabled highmem 4GB support, highmem i/o, and then some kernel hacking options including debugging for highmen related things. I booted, loaded X with KDE, opened firefox a few times, and then started running hdparm because some newer 2.4.x kernels dont play nice with my SATA, ICH5, and DMA. hdparm segfaulted while running the drive read access portion of its tests, and things locked up from there in about 30secs. I've gotten the same error with the nvidia driver not loaded, so I dont think that is part of the problem. As I said, if you want me to test or try anything feel free to ask. =) Todd, Would be interesting to have the oops output without the kernel nvidia module. Do you have that saved? Sorry, it took me FOREVER to get this bug to appear again, and this time its a little different. Hum, both BUGs are due to a page with alive ->buffers mapping. Did it crashed right after hdparm now too? Can you boot your box without SATA drivers, configuring the interface to IDE mode ? Which problems are you facing with newer v2.4.x kernels and SATA? Im waiting for the system to crash, so I figured I might as well get on with the SATA problems Running 2.4.29 neither the CONFIG_BLK_DEV_IDE_SATA nor the CONFIG_SCSI_SATA are set currently and DMA is not enabled on either of my drives, hda: ST380013AS, hdb: WDC WD2500SD-01KCB0, hdc: Maxtor 94610U6. Setting DMA manually on the hard drives yields a HDIO_SET_DMA failed: Operation not permitted error. Using 2.4.26, DMA worked fine on the drives. Under 2.4.27, 2.4.28, and 2.4.29 using CONFIG_SCSI_SATA does not allow setting of DMA on the drives, yielding a HDIO_SET_DMA failed: Operation not permitted error, and the transfer speeds reported by hdparm are at about 3MB/s. I think thats expected. Jeff? If (a) this is Intel ICH hardware and (b) it is running in combined mode, it will get poor performance. For the SATA drivers, they ALWAYS run in DMA mode, so there is no need to support HDIO_SET_DMA. Jeff Yes, I am running the SATA in combined mode. That, along with Auto mode, is the only way I can get all my devices to work properly under linux. Under Enhanced mode there is no primary IDE, only second and third. My ATA drive and CDROM are hdc and hdd, which is normal, my first SATA drive which would normally be hda should be hde and the other drive should be hdf. But the kernel doesnt see them and I cant boot off my normally hda, I get an invalid root= option kernel panic. I did edit lilo to reflect the change, but the drive wasnt even listed when hdc and hdd were. Under SATA only, I only get the SATA drives, and not my ATA drive nor my CDROM. With SATA diabled, I cant get the SATA drives at all which I expected. Under Auto, which is the same as combined as far as I can tell, the kernel doesnt enable DMA for the drives using libata, and trying to enable it by hand yields the HDIO_SET_DMA failed: Operation not permitted error. So, unless there is better performance from libata with ICH5 combined mode, I am 'forced' to stay with CONFIG_BLK_DEV_IDE_SATA, which isnt a bad thing really. I get good performance from the drives, and havent had any problems yet. On a another note, kernel 2.4.30-pre1 with highmem is working fine for me, using CONFIG_BLK_DEV_IDE_SATA, uptime was a little over 2 days. -- Todd Shetter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Custom power states for non-ACPI systems
Advertise custom sets of system power states for non-ACPI systems. Currently, /sys/power/state shows and accepts a static set of choices that are not necessarily meaningful on all platforms (for example, suspend-to-disk is an option even on diskless embedded systems, and the meaning of standby vs. suspend-to-mem is not well-defined on non-ACPI-systems). This patch allows the platform to register power states with meaningful names that correspond to the platform's conventions (for example, "big sleep" and "deep sleep" on TI OMAP), and only those states that make sense for the platform. For the time being, the canned set of PM_SUSPEND_STANDBY/MEM/DISK etc. symbols are preserved, since knowledge of the meanings of those values have crept into drivers. There is a separate effort underway to divorce driver suspend flags from the platform suspend state identifiers. Once that is accomplished, we can then replace the suspend states available with an entirely custom set. For example, various embedded platforms have multiple power states that roughly correspond to suspend-to-mem, and each could be advertised and requested via the PM interfaces, once drivers no longer look for the one and only PM_SUSPEND_MEM system suspend state. If the platform does not register a custom set of power states then the present-day set remains available as a default. Will send separately a patch for an embedded platform to show usage. Comments appreciated. Index: linux-2.6.10/include/linux/pm.h === --- linux-2.6.10.orig/include/linux/pm.h2005-03-02 00:41:43.0 + +++ linux-2.6.10/include/linux/pm.h 2005-03-02 01:12:14.0 + @@ -216,8 +216,14 @@ #definePM_DISK_REBOOT ((__force suspend_disk_method_t) 4) #definePM_DISK_MAX ((__force suspend_disk_method_t) 5) +struct pm_suspend_method { + char *name; + suspend_state_t state; +}; + struct pm_ops { suspend_disk_method_t pm_disk_mode; + struct pm_suspend_method *pm_suspend_methods; int (*prepare)(suspend_state_t state); int (*enter)(suspend_state_t state); int (*finish)(suspend_state_t state); Index: linux-2.6.10/kernel/power/main.c === --- linux-2.6.10.orig/kernel/power/main.c 2005-03-02 00:41:41.0 + +++ linux-2.6.10/kernel/power/main.c2005-03-02 01:15:21.0 + @@ -228,11 +228,22 @@ -char * pm_states[] = { - [PM_SUSPEND_STANDBY]= "standby", - [PM_SUSPEND_MEM]= "mem", - [PM_SUSPEND_DISK] = "disk", - NULL, +struct pm_suspend_method pm_default_suspend_methods[] = { + { + .name = "standby", + .state = PM_SUSPEND_STANDBY, + }, + { + .name = "mem", + .state = PM_SUSPEND_MEM, + }, + { + .name = "disk", + .state = PM_SUSPEND_DISK, + }, + { + .name = NULL, + }, }; @@ -324,19 +335,22 @@ { int i; char * s = buf; + struct pm_suspend_method *methods = pm_ops->pm_suspend_methods; + + if (! methods) + methods = pm_default_suspend_methods; + + for (i=0; methods[i].name; i++) + s += sprintf(s,"%s ",methods[i].name); - for (i = 0; i < PM_SUSPEND_MAX; i++) { - if (pm_states[i]) - s += sprintf(s,"%s ",pm_states[i]); - } s += sprintf(s,"\n"); return (s - buf); } static ssize_t state_store(struct subsystem * subsys, const char * buf, size_t n) { - suspend_state_t state = PM_SUSPEND_STANDBY; - char ** s; + struct pm_suspend_method *methods = pm_ops->pm_suspend_methods; + int i; char *p; int error; int len; @@ -344,12 +358,15 @@ p = memchr(buf, '\n', n); len = p ? p - buf : n; - for (s = &pm_states[state]; state < PM_SUSPEND_MAX; s++, state++) { - if (*s && !strncmp(buf, *s, len)) + if (! methods) + methods = pm_default_suspend_methods; + + for (i = 0; methods[i].name; i++) { + if (!strncmp(buf, methods[i].name, len)) break; } - if (*s) - error = enter_state(state); + if (methods[i].name) + error = enter_state(methods[i].state); else error = -EINVAL; return error ? error : n; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Custom power states for non-ACPI systems
An example of custom power states for the TI OMAP family. /sys/power/states supports a state named "deepsleep", which corresponds to the platform state actually entered by the present-day system suspend handler. It no longer offers the option of "disk" suspend which would not normally be available in an OMAP-based system, nor does it offer the choices "standby" or "mem", which are currently somewhat arbitrarily mapped to actual platform power states on OMAPs. In the future the OMAP could be extended to offer the choice of "big sleep" as well, another platform-specific low-power mode that falls under the general category of suspend-to-mem, once it is feasible to no longer use the same set of system suspend state values for all platforms and drivers (as mentioned in the base note). Index: linux-2.6.10/arch/arm/mach-omap/pm.c === --- linux-2.6.10.orig/arch/arm/mach-omap/pm.c 2005-03-02 01:10:27.0 + +++ linux-2.6.10/arch/arm/mach-omap/pm.c2005-03-02 01:13:41.0 + @@ -576,8 +576,20 @@ } +static struct pm_suspend_method omap_pm_suspend_methods[] = { + { + .name = "deepsleep", + .state = PM_SUSPEND_MEM, + }, + { + .name = NULL, + }, +}; + + struct pm_ops omap_pm_ops ={ .pm_disk_mode = 0, + .pm_suspend_methods = omap_pm_suspend_methods, .prepare= omap_pm_prepare, .enter = omap_pm_enter, .finish = omap_pm_finish, - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] [PATCH] Custom power states for non-ACPI systems
Pavel Machek wrote: Hi! Advertise custom sets of system power states for non-ACPI systems. Currently, /sys/power/state shows and accepts a static set of choices that are not necessarily meaningful on all platforms (for example, suspend-to-disk is an option even on diskless embedded systems, and the meaning of standby vs. suspend-to-mem is not well-defined on non-ACPI-systems). This patch allows the platform to register power states with meaningful names that correspond to the platform's conventions (for example, "big sleep" and "deep sleep" on TI OMAP), and only those states that make sense for the platform. Maybe this is a bit overdone? Of course you can have suspend-to-disk on most embedded systems; CF flash card looks just like disk, and you should be able to suspend to it. It's possible (on those with CF/PCMCIA etc.), although due to various problems with things like flash size, write speed, and wear leveling it's not very common to do so (I've seen two vendors abandon plans for this, but no doubt somebody does do it) -- that's why I'd like to have the particular platform register the capability if it happens to have it, but no, not a big deal. If OMAP has "big sleep" and "deep sleep", why not simply map them to "standby" and "suspend-to-ram"? In fact that's more or less what happens (or will happen once drivers like USB stop looking for PM_SUSPEND_MEM, etc.). There are other platforms with more than 2 sleep states (say, XScale PXA27x), so this will start to get a bit problematic. And it seens so easy to truly handle the platform's states instead of pretending ACPI S1/S3/S4 are the only methods to suspend any system. If it's preferable, how about replacing the /sys/power/state "standby" and "mem" values to "sleep", and have a /sys/power/sleep attribute that tells the methods of sleep available for the platform, much like suspend-to-disk methods are handled today? So the sleep attribute would handle "standby" and "mem" for ACPI systems, and other values for non-ACPI systems. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] [PATCH] Custom power states for non-ACPI systems
Pavel Machek wrote: ... ...but adding new /sys/power/state might be okay. We should not have introduced "standby" in the first place [but I guess it is not worth removing now]. If something has more than 2 states (does user really want to enter different states in different usage?), I guess we can add something like "deepmem" or whatever. Is there something with more than 3 states? In most of the cases I'm thinking of, it wouldn't be a user requesting a state but rather software (say, a cell phone progressively entering lower power states due to inactivity). I haven't noticed a platform with more than 3 low-power modes so far, but I'm sure it'll happen soon. If the time isn't right for incompatible changes to these interfaces then I guess mapping standby and mem to platform-specific things will work for now, maybe with some tweak to allow a choice of actual state entered. At some more opportune time in the future I'll suggest an attribute that allows a choice of platform-specific method of suspend-to-mem, somewhat like the "disk" attribute for suspend-to-disk. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kernel/power/disk.c trivial cleanups
* Remove duplicate include. * Avoid "mode set to ''" message when error updating /sys/power/disk. Signed-off-by: Todd Poynor <[EMAIL PROTECTED]> --- linux-2.6.11-rc4-orig/kernel/power/disk.c 2005-02-23 09:47:03.0 -0800 +++ linux-2.6.11-rc4-pm/kernel/power/disk.c 2005-03-03 05:00:18.609860968 -0800 @@ -16,7 +16,6 @@ #include #include #include -#include #include "power.h" @@ -321,8 +320,9 @@ } else error = -EINVAL; - pr_debug("PM: suspend-to-disk mode set to '%s'\n", -pm_disk_modes[mode]); + if (mode == pm_disk_mode) + pr_debug("PM: suspend-to-disk mode set to '%s'\n", +pm_disk_modes[mode]); up(&pm_sem); return error ? error : n; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] [PATCH] Custom power states for non-ACPI systems
Pavel Machek wrote: ... In most of the cases I'm thinking of, it wouldn't be a user requesting a state but rather software (say, a cell phone progressively entering lower power states due to inactivity). I haven't noticed a platform with more than 3 low-power modes so far, Are not your power states more like cpu power states? These are expected to be system states, and sleeping system does not take calls, etc... There's a great variety of behaviors and usage models out there, not sure I can draw a useful distinction between cpu power states vs. system states, but the net effect could be considered to be approximately the same in typical embedded uses: the drivers are called to place appropriate devices in a low(er)-power state, various platform thingies are slowed or powered off, and the system stops waiting for something to wake it up. In some cases the system does not wake up until an explicit user action (button press, etc.), but more commonly wake-on-device-activity (including ring from telephony unit) or time-based actions (including wake on alarm from event in user's datebook) is also wanted (rather like wake-on-LAN et al). I don't think this would correspond well to hardware-managed CPU power states like ACPI C states, for example. Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-pm] [PATCH] Custom power states for non-ACPI systems
Nigel Cunningham wrote: ... Two way communication between a userspace policy manager and kernel drivers is implemented via DBus. In this scheme, 'kernel drivers' doesn't just refer to the drivers for hardware. It refers to anything remotely power management related, including code to implement suspend-to-RAM, to disk or the like, ACPI drivers or code to implement system power states. The policy manager can enumerate devices and inter-relationships, capabilities, settings and status information, set and query policies and implementation results. The drivers can notify events. This communication doesn't use complicated structures or type definitions. Rather, all the nous regarding interpretation of the messages that are sent is in the policy manager and the drivers. One driver might say it's capable of states called "D0, D1 and D3", another (system) states called "Deep Sleep" and "Big Sleep". Nothing but the driver itself and userspace manager need to how to interpret & use these states. Inter-relationships between drivers are _not_ included in this information. The policy manager sets policy, the drivers deal with the specifics of implementing it. This all sounds exactly like the way we're headed as well, so I'm definitely interested in anything I can do to help. Was thinking that can start defining kobject_uevent power events and attributes (with enough detail that acpid could use it instead of /proc if the ACPI drivers were to convert to it). Capturing the relationships between drivers is difficult. If nobody's already looking into this then I'll take this up soon. The userspace manager can in turn [en|dis]able capabilites and send a list of run-time states that the driver can move between according to its own logic (eg lack of active children) without notifying the userspace manager. This would fit in with your power modes above, even to the level of "cpu idle". At dynamicpower.sf.net we do something similar for cpufreq-style scaling of platform clocks and voltages, setting up desired policy for various platform clocks/voltages according to changes in low-level system state (primarily scheduler state) from userspace and then letting the state machine run without interaction. Similar policy objects for devices sounds intriguing, although the device-specific nature of event triggers probably makes this quite difficult. Mac OS X support for some of these concepts is documented at developer.apple.com, looking for ideas to steal... Thanks, -- Todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
SMP races in proc with thread_struct
Perhaps this is old news...but... I can easily create a race when reading /proc//stat (fs/proc/{base.c,array.c}) where a rapidly reading application, such as "top", starts reading stats for a thread which goes away during the read. This is easily reproduced with a program that rapidly forks and exits while top is running. On inspection, I don't see how the code can expect the thread_struct to stay around since it is not holding any lock for the duration of its use. The code could hold the thread_struct's lock (after verifying it still exists while holding tasklist_lock I would imagine), but for performance I would think a better solution would be to copy the struct since stale data is probably ok in this case. Dereferencing a non-existent thread_struct is clearly not ok. Would anyone familiar with this code care to comment? -- -todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP races in proc with thread_struct
Alexander Viro wrote: > > On Tue, 1 May 2001, Todd Inglett wrote: > > > Perhaps this is old news...but... > > > > I can easily create a race when reading /proc//stat > > (fs/proc/{base.c,array.c}) where a rapidly reading application, such as > > "top", starts reading stats for a thread which goes away during the > > read. This is easily reproduced with a program that rapidly forks and > > exits while top is running. > > > > On inspection, I don't see how the code can expect the thread_struct to > > stay around since it is not holding any lock for the duration of its > > use. The code could hold the thread_struct's lock (after verifying it > > still exists while holding tasklist_lock I would imagine), but for > > performance I would think a better solution would be to copy the struct > > since stale data is probably ok in this case. > > > > Dereferencing a non-existent thread_struct is clearly not ok. > > > > Would anyone familiar with this code care to comment? > > Code bumps the reference count of couple of pages that hold task_struct > when it opens the file. Yes that's right. [Note that I erroneously wrote thread_struct when I meant task_struct]. On closer inspection I see it is the *parent* task_struct that is the problem here. The task has indeed exited and the memory for the task_struct is correctly held. However, the /proc code will grab tasklist_lock and dereference the parent task_struct to get the parent pid. Grabbing tasklist_lock is good, of course, when the task is live because the parent could be switched at any time. But in this case the child is already done. And so is the parent. So the child's parent task_struct is gone, but the stale held task_struct still points to where it once was. Does this sound like a possible scenerio? -- -todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP races in proc with thread_struct
Ok, I've got this isolated. Here's the sequence of events: 1. Some process T (probably "top") opens /proc/N/stat. 2. While holding tasklist_lock the proc code does a get_task_struct() to add a ref count to the page. 3. Process N exits. 4. The parent of process N exits. 5. Process T reads from the open file. This calls proc_pid_stat() which dereferences N's task_struct. This is ok as Alexander points out because a reference is held. 6. Using N's task_struct process T attempt to dereference the *parent* task struct. It assumes this is ok because: A) it is holding tasklist_lock so N cannot be reparented in a race. B) every process *always* has a valid parent. But this is where hell breaks loose. Every process has a valid parent -- unless it is dead and nobody cares. Process N has already exited and released from the tasklist while its parent was still alive. There was no reason to reparent it. It just got released. So N's task_struct has a dangling ptr to its parent. Nobody is holding the parent task_struct, either. When the parent died memory for its task_struct was released. This is ungood. My opinion here is that this is proc's problem. When we free a task_struct it could be "cleaned up" of dangling ptrs, but this is a hack to cover a bug in proc. This is not isolated to the parent task_struct, either. The task_struct mm is also dereferenced. It is pretty easy to validate a parent task_struct ptr (just hold tasklist_lock and run the list to check if it is still valid -- might not be the *right* task, but it will still be valid). However, how do you validate the mm is ok? It would be nice if there were an easy way to detect when the process gets into this state. Certainly it is in this state if the page reference count is 1. But multiple processes *could* be reading the same proc file holding additional artificial ref counts. Hmm...perhaps if I scan the tasklist for my own task? If I am not in the list I either return an error or faked info. I'll try this -- -todd BTW, by adding code to validate the parent my test has now run for 18 hours on a 4-way without failure. It otherwise would fail within the first 5 minutes. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SMP races in proc with thread_struct
Andreas Ferber wrote: > > On Fri, May 04, 2001 at 10:46:43PM +1000, Keith Owens wrote: > > > For a read only case, the only important > > thing is not to die, one occurrence of bad data is tolerable. > > Strong NACK. The pages where the bad data comes from may in some cases > already be reclaimed for other data, probably something security > relevant, which should never ever be given even read access by an > unauthorized user. Even if this event may be a very rare case, one > single occurrence of this is one to much. Agreed. Worse, it is not readonly. The /proc code task_lock's the task struct, thus writing to it. I'll post a patch shortly once I've tested it. Worse case only if the task is exiting I sweep the tasklist looking for the parent to see if the parent is still valid. I am not verifying if it is the actual parent (it might be a new task allocated at the same spot). I could just report 0 (or 1) for the parent for any process that is exiting, but then you won't be able to see the ppid for zombies. Or is there another state I can look for? What I really need is PF_EXITED :). I am a little concerned also about mm, file, tty and sig fields. These appear to be NULLed in do_exit(), but I haven't tracked down tty and sig yet. -- -todd - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] android: binder: Disable preemption while holding the global binder lock
In Android systems, the display pipeline relies on low latency binder transactions and is therefore sensitive to delays caused by contention for the global binder lock. Jank is siginificantly reduced by disabling preemption while the global binder lock is held. Originally-from: Riley Andrews Signed-off-by: Todd Kjos --- drivers/android/binder.c | 194 +++ 1 file changed, 146 insertions(+), 48 deletions(-) diff --git a/drivers/android/binder.c b/drivers/android/binder.c index 16288e7..c36e420 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -379,6 +379,7 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) struct files_struct *files = proc->files; unsigned long rlim_cur; unsigned long irqs; + int ret; if (files == NULL) return -ESRCH; @@ -389,7 +390,11 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE); unlock_task_sighand(proc->tsk, &irqs); - return __alloc_fd(files, 0, rlim_cur, flags); + preempt_enable_no_resched(); + ret = __alloc_fd(files, 0, rlim_cur, flags); + preempt_disable(); + + return ret; } /* @@ -398,8 +403,11 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) static void task_fd_install( struct binder_proc *proc, unsigned int fd, struct file *file) { - if (proc->files) + if (proc->files) { + preempt_enable_no_resched(); __fd_install(proc->files, fd, file); + preempt_disable(); + } } /* @@ -427,6 +435,7 @@ static inline void binder_lock(const char *tag) { trace_binder_lock(tag); mutex_lock(&binder_main_lock); + preempt_disable(); trace_binder_locked(tag); } @@ -434,8 +443,65 @@ static inline void binder_unlock(const char *tag) { trace_binder_unlock(tag); mutex_unlock(&binder_main_lock); + preempt_enable(); +} + +static inline void *kzalloc_nopreempt(size_t size) +{ + void *ptr; + + ptr = kzalloc(size, GFP_NOWAIT); + if (ptr) + return ptr; + + preempt_enable_no_resched(); + ptr = kzalloc(size, GFP_KERNEL); + preempt_disable(); + + return ptr; +} + +static inline long copy_to_user_nopreempt(void __user *to, + const void *from, long n) +{ + long ret; + + preempt_enable_no_resched(); + ret = copy_to_user(to, from, n); + preempt_disable(); + return ret; +} + +static inline long copy_from_user_nopreempt(void *to, +const void __user *from, +long n) +{ + long ret; + + preempt_enable_no_resched(); + ret = copy_from_user(to, from, n); + preempt_disable(); + return ret; } +#define get_user_nopreempt(x, ptr) \ +({ \ + int __ret; \ + preempt_enable_no_resched(); \ + __ret = get_user(x, ptr); \ + preempt_disable(); \ + __ret; \ +}) + +#define put_user_nopreempt(x, ptr) \ +({ \ + int __ret; \ + preempt_enable_no_resched(); \ + __ret = put_user(x, ptr); \ + preempt_disable(); \ + __ret; \ +}) + static void binder_set_nice(long nice) { long min_nice; @@ -568,6 +634,8 @@ static int binder_update_page_range(struct binder_proc *proc, int allocate, else mm = get_task_mm(proc->tsk); + preempt_enable_no_resched(); + if (mm) { down_write(&mm->mmap_sem); vma = proc->vma; @@ -622,6 +690,9 @@ static int binder_update_page_range(struct binder_proc *proc, int allocate, up_write(&mm->mmap_sem); mmput(mm); } + + preempt_disable(); + return 0; free_range: @@ -644,6 +715,9 @@ err_no_vma: up_write(&mm->mmap_sem); mmput(mm); } + + preempt_disable(); + return -ENOMEM; } @@ -903,7 +977,7 @@ static struct binder_node *binder_new_node(struct binder_proc *proc, return NULL; } - node = kzalloc(sizeof(*node), GFP_KERNEL); + node = kzalloc_nopreempt(sizeof(*node)); if (node == NULL) return NULL; binder_stats_created(BINDER_STAT_NODE); @@ -1040,7 +1114,7 @@ static struct binder_ref *binder_get_ref_for_node(struct binder_proc *proc, else return ref; } - new_ref = kzalloc(sizeof(*ref), GFP_KERNEL); + new_ref = kzalloc_nopreempt(sizeof(*ref)); if (new_ref == NULL) return NULL; binder_stats_created(BINDER_STAT_REF); @@ -1438,14 +1512,14 @@ static void binder_transaction(struct binder_proc *proc, e->to_proc = target_proc->pid; /* TODO: reuse incoming transaction for reply */ - t = kzalloc(sizeof(*t), GFP_KERNEL); + t = kzalloc_nopreempt(sizeof(*t)); if (t == NULL) { return_error = BR_FAILED_REPLY; goto err_alloc_t_failed; } binder_stats_created(BINDER_STAT_TRANSACTION); - tcomplete = kzalloc(sizeof(*tcomplete), GFP_KERNEL); + tcomplete = kzalloc_nopreempt(sizeof(*tcomplete)); if (tcomplete == NULL) { return_error = BR_FAILED_REPLY; goto err_alloc_tcomplete_failed; @@ -1502,15 +1576,16 @@ static void binder_transaction(struct binder_proc *proc, offp = (binder_size_t *)(t->buffer->data + ALIGN(tr->data_size, sizeof(void *))); - if (copy_from_user(t->buffer->data, (const void __user *)(uintptr_t) - tr->data.ptr.buffer, tr-&g
Re: [PATCH] android: binder: Disable preemption while holding the global binder lock
This was introduced in the 2015 Nexus devices and should have been submitted to the kernel then since we keep forward porting it to each new device. On Thu, Sep 8, 2016 at 9:12 AM, Todd Kjos wrote: > In Android systems, the display pipeline relies on low > latency binder transactions and is therefore sensitive to > delays caused by contention for the global binder lock. > Jank is siginificantly reduced by disabling preemption > while the global binder lock is held. > > Originally-from: Riley Andrews > Signed-off-by: Todd Kjos > --- > drivers/android/binder.c | 194 > +++ > 1 file changed, 146 insertions(+), 48 deletions(-) > > diff --git a/drivers/android/binder.c b/drivers/android/binder.c > index 16288e7..c36e420 100644 > --- a/drivers/android/binder.c > +++ b/drivers/android/binder.c > @@ -379,6 +379,7 @@ static int task_get_unused_fd_flags(struct > binder_proc *proc, int flags) > struct files_struct *files = proc->files; > unsigned long rlim_cur; > unsigned long irqs; > + int ret; > > if (files == NULL) > return -ESRCH; > @@ -389,7 +390,11 @@ static int task_get_unused_fd_flags(struct > binder_proc *proc, int flags) > rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE); > unlock_task_sighand(proc->tsk, &irqs); > > - return __alloc_fd(files, 0, rlim_cur, flags); > + preempt_enable_no_resched(); > + ret = __alloc_fd(files, 0, rlim_cur, flags); > + preempt_disable(); > + > + return ret; > } > > /* > @@ -398,8 +403,11 @@ static int task_get_unused_fd_flags(struct > binder_proc *proc, int flags) > static void task_fd_install( > struct binder_proc *proc, unsigned int fd, struct file *file) > { > - if (proc->files) > + if (proc->files) { > + preempt_enable_no_resched(); > __fd_install(proc->files, fd, file); > + preempt_disable(); > + } > } > > /* > @@ -427,6 +435,7 @@ static inline void binder_lock(const char *tag) > { > trace_binder_lock(tag); > mutex_lock(&binder_main_lock); > + preempt_disable(); > trace_binder_locked(tag); > } > > @@ -434,8 +443,65 @@ static inline void binder_unlock(const char *tag) > { > trace_binder_unlock(tag); > mutex_unlock(&binder_main_lock); > + preempt_enable(); > +} > + > +static inline void *kzalloc_nopreempt(size_t size) > +{ > + void *ptr; > + > + ptr = kzalloc(size, GFP_NOWAIT); > + if (ptr) > + return ptr; > + > + preempt_enable_no_resched(); > + ptr = kzalloc(size, GFP_KERNEL); > + preempt_disable(); > + > + return ptr; > +} > + > +static inline long copy_to_user_nopreempt(void __user *to, > + const void *from, long n) > +{ > + long ret; > + > + preempt_enable_no_resched(); > + ret = copy_to_user(to, from, n); > + preempt_disable(); > + return ret; > +} > + > +static inline long copy_from_user_nopreempt(void *to, > +const void __user *from, > +long n) > +{ > + long ret; > + > + preempt_enable_no_resched(); > + ret = copy_from_user(to, from, n); > + preempt_disable(); > + return ret; > } > > +#define get_user_nopreempt(x, ptr) \ > +({ \ > + int __ret; \ > + preempt_enable_no_resched(); \ > + __ret = get_user(x, ptr); \ > + preempt_disable(); \ > + __ret; \ > +}) > + > +#define put_user_nopreempt(x, ptr) \ > +({ \ > + int __ret; \ > + preempt_enable_no_resched(); \ > + __ret = put_user(x, ptr); \ > + preempt_disable(); \ > + __ret; \ > +}) > + > static void binder_set_nice(long nice) > { > long min_nice; > @@ -568,6 +634,8 @@ static int binder_update_page_range(struct > binder_proc *proc, int allocate, > else > mm = get_task_mm(proc->tsk); > > + preempt_enable_no_resched(); > + > if (mm) { > down_write(&mm->mmap_sem); > vma = proc->vma; > @@ -622,6 +690,9 @@ static int binder_update_page_range(struct > binder_proc *proc, int allocate, > up_write(&mm->mmap_sem); > mmput(mm); > } > + > + preempt_disable(); > + > return 0; > > free_range: > @@ -644,6 +715,9 @@ err_no_vma: > up_write(&mm->mmap_sem); > mmput(mm); > } > + > + preempt_disable(); > + > return -ENOMEM; > } > > @@ -903,7 +977,7 @@ static struct binder_node *binder_new_node(struct > binder_proc *proc, > return NULL; > } > > - node = kzalloc(sizeof(*node), GFP_KERNEL); > + node = kzalloc_nopreempt(sizeof(*node)); > if (node == NULL) > return NULL; > binder_stats_created(BINDER_STAT_NODE); > @@ -1040,7 +1114,7 @@ static struct binder_ref > *binder_get_ref_for_node(struct binder_proc *proc, > else &g
[PATCH] sg: recheck MMAP_IO request length with lock held
Commit 1bc0eb044615 ("scsi: sg: protect accesses to 'reserved' page array") adds needed concurrency protection for the "reserve" buffer. Some checks that are initially made outside the lock are replicated once the lock is taken to ensure the checks and resulting decisions are made using consistent state. The check that a request with flag SG_FLAG_MMAP_IO set fits in the reserve buffer also needs to be performed again under the lock to ensure the reserve buffer length compared against matches the value in effect when the request is linked to the reserve buffer. An -ENOMEM should be returned in this case, instead of switching over to an indirect buffer as for non-MMAP_IO requests. Signed-off-by: Todd Poynor --- drivers/scsi/sg.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index d7ff71e0c85c..3a44b4bc872b 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -1735,9 +1735,12 @@ sg_start_req(Sg_request *srp, unsigned char *cmd) !sfp->res_in_use) { sfp->res_in_use = 1; sg_link_reserve(sfp, srp, dxfer_len); - } else if ((hp->flags & SG_FLAG_MMAP_IO) && sfp->res_in_use) { + } else if (hp->flags & SG_FLAG_MMAP_IO) { + res = -EBUSY; /* sfp->res_in_use == 1 */ + if (dxfer_len > rsv_schp->bufflen) + res = -ENOMEM; mutex_unlock(&sfp->f_mutex); - return -EBUSY; + return res; } else { res = sg_build_indirect(req_schp, sfp, dxfer_len); if (res) { -- 2.14.1.480.gb18f417b89-goog
[PATCH] sg: protect against races between mmap() and SG_SET_RESERVED_SIZE
Take f_mutex around mmap() processing to protect against races with the SG_SET_RESERVED_SIZE ioctl. Ensure the reserve buffer length remains consistent during the mapping operation, and set the "mmap called" flag to prevent further changes to the reserved buffer size as an atomic operation with the mapping. Signed-off-by: Todd Poynor --- drivers/scsi/sg.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 3a44b4bc872b..a20718e9f1f4 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -1233,6 +1233,7 @@ sg_mmap(struct file *filp, struct vm_area_struct *vma) unsigned long req_sz, len, sa; Sg_scatter_hold *rsv_schp; int k, length; +int ret = 0; if ((!filp) || (!vma) || (!(sfp = (Sg_fd *) filp->private_data))) return -ENXIO; @@ -1243,8 +1244,11 @@ sg_mmap(struct file *filp, struct vm_area_struct *vma) if (vma->vm_pgoff) return -EINVAL; /* want no offset */ rsv_schp = &sfp->reserve; - if (req_sz > rsv_schp->bufflen) - return -ENOMEM; /* cannot map more than reserved buffer */ + mutex_lock(&sfp->f_mutex); + if (req_sz > rsv_schp->bufflen) { + ret = -ENOMEM; /* cannot map more than reserved buffer */ + goto out; + } sa = vma->vm_start; length = 1 << (PAGE_SHIFT + rsv_schp->page_order); @@ -1258,7 +1262,9 @@ sg_mmap(struct file *filp, struct vm_area_struct *vma) vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; vma->vm_private_data = sfp; vma->vm_ops = &sg_mmap_vm_ops; - return 0; +out: + mutex_unlock(&sfp->f_mutex); + return ret; } static void -- 2.14.1.480.gb18f417b89-goog
[PATCH] binder: fix proc->files use-after-free
proc->files cleanup is initiated by binder_vma_close. Therefore a reference on the binder_proc is not enough to prevent the files_struct from being released while the binder_proc still has a reference. This can lead to an attempt to dereference the stale pointer obtained from proc->files prior to proc->files cleanup. This has been seen once in task_get_unused_fd_flags() when __alloc_fd() is called with a stale "files". The fix is to always use get_files_struct() to obtain struct_files so that the refcount on the files_struct is used to prevent a premature free. proc->files is removed since we get it every time. Signed-off-by: Todd Kjos --- drivers/android/binder.c | 63 +++- 1 file changed, 30 insertions(+), 33 deletions(-) diff --git a/drivers/android/binder.c b/drivers/android/binder.c index fddf76ef5bd6..794e58c14b15 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -458,9 +458,8 @@ struct binder_ref { }; enum binder_deferred_state { - BINDER_DEFERRED_PUT_FILES= 0x01, - BINDER_DEFERRED_FLUSH= 0x02, - BINDER_DEFERRED_RELEASE = 0x04, + BINDER_DEFERRED_FLUSH= 0x01, + BINDER_DEFERRED_RELEASE = 0x02, }; /** @@ -481,8 +480,6 @@ enum binder_deferred_state { *(invariant after initialized) * @tsk task_struct for group_leader of process *(invariant after initialized) - * @files files_struct for process - *(invariant after initialized) * @deferred_work_node: element for binder_deferred_list *(protected by binder_deferred_lock) * @deferred_work:bitmap of deferred work to perform @@ -529,7 +526,6 @@ struct binder_proc { struct list_head waiting_threads; int pid; struct task_struct *tsk; - struct files_struct *files; struct hlist_node deferred_work_node; int deferred_work; bool is_dead; @@ -875,22 +871,34 @@ static void binder_free_thread(struct binder_thread *thread); static void binder_free_proc(struct binder_proc *proc); static void binder_inc_node_tmpref_ilocked(struct binder_node *node); +struct files_struct *binder_get_files_struct(struct binder_proc *proc) +{ + return get_files_struct(proc->tsk); +} + static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) { - struct files_struct *files = proc->files; + struct files_struct *files; unsigned long rlim_cur; unsigned long irqs; + int ret; + files = binder_get_files_struct(proc); if (files == NULL) return -ESRCH; - if (!lock_task_sighand(proc->tsk, &irqs)) - return -EMFILE; + if (!lock_task_sighand(proc->tsk, &irqs)) { + ret = -EMFILE; + goto err; + } rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE); unlock_task_sighand(proc->tsk, &irqs); - return __alloc_fd(files, 0, rlim_cur, flags); + ret = __alloc_fd(files, 0, rlim_cur, flags); +err: + put_files_struct(files); + return ret; } /* @@ -899,8 +907,12 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) static void task_fd_install( struct binder_proc *proc, unsigned int fd, struct file *file) { - if (proc->files) - __fd_install(proc->files, fd, file); + struct files_struct *files = binder_get_files_struct(proc); + + if (files) { + __fd_install(files, fd, file); + put_files_struct(files); + } } /* @@ -908,18 +920,20 @@ static void task_fd_install( */ static long task_close_fd(struct binder_proc *proc, unsigned int fd) { + struct files_struct *files = binder_get_files_struct(proc); int retval; - if (proc->files == NULL) + if (files == NULL) return -ESRCH; - retval = __close_fd(proc->files, fd); + retval = __close_fd(files, fd); /* can't restart close syscall because file table entry was cleared */ if (unlikely(retval == -ERESTARTSYS || retval == -ERESTARTNOINTR || retval == -ERESTARTNOHAND || retval == -ERESTART_RESTARTBLOCK)) retval = -EINTR; + put_files_struct(files); return retval; } @@ -4561,7 +4575,6 @@ static void binder_vma_close(struct vm_area_struct *vma) (vma->vm_end - vma->vm_start) / SZ_1K, vma->vm_flags, (unsigned long)pgprot_val(vma->vm_page_prot)); binder_alloc_vma_close(&proc->alloc); - binder_defer_work(proc, BINDER_DEFERRED_PUT_FILES); } static int binder_vm_fault(struct vm_fault *vmf) @@ -4603,10 +4616,8 @@ static int binder_mmap(struct fil
[PATCH 2/4] staging: gasket: core: device register debug log cleanups
From: Todd Poynor At device/driver registration time, convert a not-very-informative info message to a more informative debug message, drop some not overly helpful debug messages. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index fa477d0c3c74c..91db71c238804 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1735,7 +1735,8 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) int desc_idx = -1; struct gasket_internal_desc *internal; - pr_info("Initializing Gasket framework device\n"); + pr_debug("Loading %s driver version %s\n", driver_desc->name, +driver_desc->driver_version); /* Check for duplicates and find a free slot. */ mutex_lock(&g_mutex); @@ -1764,8 +1765,6 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) return -EBUSY; } - /* Internal structure setup. */ - pr_debug("Performing initial internal structure setup.\n"); internal = &g_descs[desc_idx]; mutex_init(&internal->mutex); memset(internal->devs, 0, sizeof(struct gasket_dev *) * GASKET_DEV_MAX); @@ -1788,7 +1787,6 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) * Not using pci_register_driver() (without underscores), as it * depends on KBUILD_MODNAME, and this is a shared file. */ - pr_debug("Registering PCI driver.\n"); ret = __pci_register_driver(&internal->pci, driver_desc->module, driver_desc->name); if (ret) { @@ -1796,7 +1794,6 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) goto fail1; } - pr_debug("Registering char driver.\n"); ret = register_chrdev_region(MKDEV(driver_desc->major, driver_desc->minor), GASKET_DEV_MAX, driver_desc->name); -- 2.18.0.597.ga71716f1ad-goog
[PATCH 4/4] Revert "staging: gasket: core: hold reference to pci_dev while used"
From: Todd Poynor There's no need to take an additional reference on the pci_dev structure for the pointer copy saved in gasket data structures. This reverts commit: 8dd8a48b9a7d ("staging: gasket: core: hold reference to pci_dev while used") Reported-by: Dmitry Torokhov Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 93a4d9f08eaab..2d209e36cf372 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -255,7 +255,6 @@ static void gasket_free_dev(struct gasket_dev *gasket_dev) internal_desc->devs[gasket_dev->dev_idx] = NULL; mutex_unlock(&internal_desc->mutex); put_device(gasket_dev->dev); - pci_dev_put(gasket_dev->pci_dev); kfree(gasket_dev); } @@ -1477,7 +1476,7 @@ static int gasket_pci_probe(struct pci_dev *pci_dev, ret = gasket_alloc_dev(internal_desc, parent, &gasket_dev, kobj_name); if (ret) return ret; - gasket_dev->pci_dev = pci_dev_get(pci_dev); + gasket_dev->pci_dev = pci_dev; if (IS_ERR_OR_NULL(gasket_dev->dev_info.device)) { pr_err("Cannot create %s device %s [ret = %ld]\n", driver_desc->name, gasket_dev->dev_info.name, -- 2.18.0.597.ga71716f1ad-goog
[PATCH 3/4] staging: gasket: core: add subsystem and device info to logs
From: Todd Poynor Identify gasket as the subsystem printing various messages. Add the driver name to appropriate messages to indicate which driver has a problem. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 91db71c238804..93a4d9f08eaab 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -6,6 +6,9 @@ * * Copyright (C) 2018 Google, Inc. */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include "gasket_core.h" #include "gasket_interrupt.h" @@ -208,7 +211,7 @@ static int gasket_alloc_dev(struct gasket_internal_desc *internal_desc, gasket_dev = *pdev = kzalloc(sizeof(*gasket_dev), GFP_KERNEL); if (!gasket_dev) { - pr_err("no memory for device\n"); + pr_err("no memory for device %s\n", kobj_name); return -ENOMEM; } internal_desc->devs[dev_idx] = gasket_dev; @@ -1760,7 +1763,7 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) mutex_unlock(&g_mutex); if (desc_idx == -1) { - pr_err("Too many Gasket drivers loaded: %d\n", + pr_err("too many drivers loaded, max %d\n", GASKET_FRAMEWORK_DESC_MAX); return -EBUSY; } @@ -1790,7 +1793,8 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) ret = __pci_register_driver(&internal->pci, driver_desc->module, driver_desc->name); if (ret) { - pr_err("cannot register pci driver [ret=%d]\n", ret); + pr_err("cannot register %s pci driver [ret=%d]\n", + driver_desc->name, ret); goto fail1; } @@ -1798,7 +1802,8 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) driver_desc->minor), GASKET_DEV_MAX, driver_desc->name); if (ret) { - pr_err("cannot register char driver [ret=%d]\n", ret); + pr_err("cannot register %s char driver [ret=%d]\n", + driver_desc->name, ret); goto fail2; } -- 2.18.0.597.ga71716f1ad-goog
[PATCH 0/4 v2] staging: gasket: cleanups du jour
From: Todd Poynor More cleanups for the gasket+apex drivers. Patched changed in v2 from v1: staging: gasket: core: print driver version code at registration time staging: gasket: core: move driver loaded log after error cases Above 2 patches replaced by new patch: staging: gasket: core: remove registration logs staging: gasket: core: device register debug log cleanups Drop explicit "gasket:" prefix, use pr_fmt instead (in following patch) staging: gasket: core: add subsystem and device info to error logs Renamed: staging: gasket: core: add subsystem and device info to logs Use pr_fmt for modname prefix on logs Revert "staging: gasket: core: hold reference to pci_dev while used" Fixup SHA format in commit text. Patches dropped from v2, already merged from v1: staging: gasket: apex: enable power save mode by default staging: gasket: remove "reset type" param from framework staging: gasket: apex: drop reset type param Todd Poynor (4): staging: gasket: core: remove registration logs staging: gasket: core: device register debug log cleanups staging: gasket: core: add subsystem and device info to logs Revert "staging: gasket: core: hold reference to pci_dev while used" drivers/staging/gasket/gasket_core.c | 27 --- 1 file changed, 12 insertions(+), 15 deletions(-) -- 2.18.0.597.ga71716f1ad-goog
[PATCH 1/4] staging: gasket: core: remove registration logs
From: Todd Poynor Remove logs for loading gasket drivers. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 4 1 file changed, 4 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 2b75f100da4d3..fa477d0c3c74c 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1758,9 +1758,6 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) } mutex_unlock(&g_mutex); - pr_info("Loaded %s driver, framework version %s\n", - driver_desc->name, GASKET_FRAMEWORK_VERSION); - if (desc_idx == -1) { pr_err("Too many Gasket drivers loaded: %d\n", GASKET_FRAMEWORK_DESC_MAX); @@ -1808,7 +1805,6 @@ int gasket_register_device(const struct gasket_driver_desc *driver_desc) goto fail2; } - pr_info("Driver registered successfully.\n"); return 0; fail2: -- 2.18.0.597.ga71716f1ad-goog
[PATCH 05/15] staging: gasket: core: remove device enable and disable callbacks
From: Todd Poynor Device enable/disable operations are moving from being initiated through the gasket framework to being initiated by the gasket device driver. The driver can perform any processing needed for these operations before or after the calls into the framework. Neither of these callbacks are implemented for the only gasket driver upstream today, apex. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 9 - drivers/staging/gasket/gasket_core.h | 27 ++- 2 files changed, 2 insertions(+), 34 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 2741256eacfe8..b070efaf0d41c 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -648,8 +648,6 @@ static void gasket_disable_dev(struct gasket_dev *gasket_dev) gasket_page_table_cleanup(gasket_dev->page_table[i]); } } - - check_and_invoke_callback(gasket_dev, driver_desc->disable_dev_cb); } /* @@ -1408,13 +1406,6 @@ static int gasket_enable_dev(struct gasket_internal_desc *internal_desc, } gasket_dev->hardware_revision = ret; - ret = check_and_invoke_callback(gasket_dev, driver_desc->enable_dev_cb); - if (ret) { - dev_err(gasket_dev->dev, "Error in enable device cb: %d\n", - ret); - return ret; - } - /* device_status_cb returns a device status, not an error code. */ gasket_dev->status = gasket_get_hw_status(gasket_dev); if (gasket_dev->status == GASKET_STATUS_DEAD) diff --git a/drivers/staging/gasket/gasket_core.h b/drivers/staging/gasket/gasket_core.h index 9f9bc66a0daa0..5d40bc7f52e91 100644 --- a/drivers/staging/gasket/gasket_core.h +++ b/drivers/staging/gasket/gasket_core.h @@ -473,34 +473,11 @@ struct gasket_driver_desc { */ int (*device_close_cb)(struct gasket_dev *dev); - /* -* enable_dev_cb: Callback immediately before enabling the device. -* @dev: Pointer to the gasket_dev struct for this driver instance. -* -* This callback is invoked after the device has been added and all BAR -* spaces mapped, immediately before registering and enabling the -* [character] device via cdev_add. If this call fails (returns -* nonzero), disable_dev_cb will be called. -* -* Note that cdev are initialized but not active -* (cdev_add has not yet been called) when this callback is invoked. -*/ - int (*enable_dev_cb)(struct gasket_dev *dev); - - /* -* disable_dev_cb: Callback immediately after disabling the device. -* @dev: Pointer to the gasket_dev struct for this driver instance. -* -* Called during device shutdown, immediately after disabling device -* operations via cdev_del. -*/ - int (*disable_dev_cb)(struct gasket_dev *dev); - /* * sysfs_setup_cb: Callback to set up driver-specific sysfs nodes. * @dev: Pointer to the gasket_dev struct for this device. * -* Called just before enable_dev_cb. +* Called during the add gasket device call. * */ int (*sysfs_setup_cb)(struct gasket_dev *dev); @@ -509,7 +486,7 @@ struct gasket_driver_desc { * sysfs_cleanup_cb: Callback to clean up driver-specific sysfs nodes. * @dev: Pointer to the gasket_dev struct for this device. * -* Called just before disable_dev_cb. +* Called during device disable processing. * */ int (*sysfs_cleanup_cb)(struct gasket_dev *dev); -- 2.18.0.597.ga71716f1ad-goog
[PATCH 02/15] staging: gasket: core: move core PCI calls to device drivers
From: Todd Poynor Remove gasket wrapping of PCI probe, enable, disable, and remove functions. Replace with calls to add and remove PCI gasket devices, to be called by the gasket device drivers. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 82 drivers/staging/gasket/gasket_core.h | 6 ++ 2 files changed, 28 insertions(+), 60 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 2d209e36cf372..01cafe1ff6605 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -51,9 +51,6 @@ struct gasket_internal_desc { /* Kernel-internal device class. */ struct class *class; - /* PCI subsystem metadata associated with this driver. */ - struct pci_driver pci; - /* Instantiated / present devices of this type. */ struct gasket_dev *devs[GASKET_DEV_MAX]; }; @@ -368,10 +365,10 @@ static void gasket_unmap_pci_bar(struct gasket_dev *dev, int bar_num) } /* - * Setup PCI & set up memory mapping for the specified device. + * Setup PCI memory mapping for the specified device. * - * Enables the PCI device, reads the BAR registers and sets up pointers to the - * device's memory mapped IO space. + * Reads the BAR registers and sets up pointers to the device's memory mapped + * IO space. * * Returns 0 on success and a negative value otherwise. */ @@ -380,14 +377,6 @@ static int gasket_setup_pci(struct pci_dev *pci_dev, { int i, mapped_bars, ret; - ret = pci_enable_device(pci_dev); - if (ret) { - dev_err(gasket_dev->dev, "cannot enable PCI device\n"); - return ret; - } - - pci_set_master(pci_dev); - for (i = 0; i < GASKET_NUM_BARS; i++) { ret = gasket_map_pci_bar(gasket_dev, i); if (ret) { @@ -402,19 +391,16 @@ static int gasket_setup_pci(struct pci_dev *pci_dev, for (i = 0; i < mapped_bars; i++) gasket_unmap_pci_bar(gasket_dev, i); - pci_disable_device(pci_dev); return -ENOMEM; } -/* Unmaps memory and cleans up PCI for the specified device. */ +/* Unmaps memory for the specified device. */ static void gasket_cleanup_pci(struct gasket_dev *gasket_dev) { int i; for (i = 0; i < GASKET_NUM_BARS; i++) gasket_unmap_pci_bar(gasket_dev, i); - - pci_disable_device(gasket_dev->pci_dev); } /* Determine the health of the Gasket device. */ @@ -1443,15 +1429,14 @@ static int gasket_enable_dev(struct gasket_internal_desc *internal_desc, } /* - * PCI subsystem probe function. - * - * Called when a Gasket device is found. Allocates device metadata, maps device - * memory, and calls gasket_enable_dev to prepare the device for active use. + * Add PCI gasket device. * - * Returns 0 if successful and a negative value otherwise. + * Called by Gasket device probe function. + * Allocates device metadata, maps device memory, and calls gasket_enable_dev + * to prepare the device for active use. */ -static int gasket_pci_probe(struct pci_dev *pci_dev, - const struct pci_device_id *id) +int gasket_pci_add_device(struct pci_dev *pci_dev, + struct gasket_dev **gasket_devp) { int ret; const char *kobj_name = dev_name(&pci_dev->dev); @@ -1460,13 +1445,14 @@ static int gasket_pci_probe(struct pci_dev *pci_dev, const struct gasket_driver_desc *driver_desc; struct device *parent; - pr_info("Add Gasket device %s\n", kobj_name); + pr_debug("add PCI device %s\n", kobj_name); mutex_lock(&g_mutex); internal_desc = lookup_internal_desc(pci_dev); mutex_unlock(&g_mutex); if (!internal_desc) { - pr_err("PCI probe called for unknown driver type\n"); + dev_err(&pci_dev->dev, + "PCI add device called for unknown driver type\n"); return -ENODEV; } @@ -1530,6 +1516,7 @@ static int gasket_pci_probe(struct pci_dev *pci_dev, goto fail5; } + *gasket_devp = gasket_dev; return 0; fail5: @@ -1545,14 +1532,10 @@ static int gasket_pci_probe(struct pci_dev *pci_dev, gasket_free_dev(gasket_dev); return ret; } +EXPORT_SYMBOL(gasket_pci_add_device); -/* - * PCI subsystem remove function. - * - * Called to remove a Gasket device. Finds the device in the device list and - * cleans up metadata. - */ -static void gasket_pci_remove(struct pci_dev *pci_dev) +/* Remove a PCI gasket device. */ +void gasket_pci_remove_device(struct pci_dev *pci_dev) { int i; struct gasket_internal_desc *internal_desc; @@ -1583,8 +1566,8 @@ static void gasket_pci_remove(struct pci_dev *pci_dev) if (!gasket_dev
[PATCH 03/15] staging: gasket: apex: move PCI core calls to apex driver
From: Todd Poynor Apex driver moves PCI core calls like probe, enable, and remove from gasket to apex. Call new functions in gasket to register apex as a PCI device to the gasket framework. Signed-off-by: Todd Poynor --- drivers/staging/gasket/apex_driver.c | 49 +++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/drivers/staging/gasket/apex_driver.c b/drivers/staging/gasket/apex_driver.c index 42cef68eb4c19..b47661442009d 100644 --- a/drivers/staging/gasket/apex_driver.c +++ b/drivers/staging/gasket/apex_driver.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -621,6 +622,36 @@ static void apex_pci_fixup_class(struct pci_dev *pdev) DECLARE_PCI_FIXUP_CLASS_HEADER(APEX_PCI_VENDOR_ID, APEX_PCI_DEVICE_ID, PCI_CLASS_NOT_DEFINED, 8, apex_pci_fixup_class); +static int apex_pci_probe(struct pci_dev *pci_dev, + const struct pci_device_id *id) +{ + int ret; + struct gasket_dev *gasket_dev; + + ret = pci_enable_device(pci_dev); + if (ret) { + dev_err(&pci_dev->dev, "error enabling PCI device\n"); + return ret; + } + + pci_set_master(pci_dev); + + ret = gasket_pci_add_device(pci_dev, &gasket_dev); + if (ret) { + dev_err(&pci_dev->dev, "error adding gasket device\n"); + pci_disable_device(pci_dev); + return ret; + } + + return 0; +} + +static void apex_pci_remove(struct pci_dev *pci_dev) +{ + gasket_pci_remove_device(pci_dev); + pci_disable_device(pci_dev); +} + static struct gasket_driver_desc apex_desc = { .name = "apex", .driver_version = APEX_DRIVER_VERSION, @@ -672,13 +703,29 @@ static struct gasket_driver_desc apex_desc = { .device_reset_cb = apex_reset, }; +static struct pci_driver apex_pci_driver = { + .name = "apex", + .probe = apex_pci_probe, + .remove = apex_pci_remove, + .id_table = apex_pci_ids, +}; + static int __init apex_init(void) { - return gasket_register_device(&apex_desc); + int ret; + + ret = gasket_register_device(&apex_desc); + if (ret) + return ret; + ret = pci_register_driver(&apex_pci_driver); + if (ret) + gasket_unregister_device(&apex_desc); + return ret; } static void apex_exit(void) { + pci_unregister_driver(&apex_pci_driver); gasket_unregister_device(&apex_desc); } MODULE_DESCRIPTION("Google Apex driver"); -- 2.18.0.597.ga71716f1ad-goog
[PATCH 04/15] staging: gasket: core: convert remaining info logs to debug
From: Todd Poynor Remaining info-level logs in gasket core converted to debug-level; the information is not needed during normal system operation. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 01cafe1ff6605..2741256eacfe8 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1819,7 +1819,7 @@ void gasket_unregister_device(const struct gasket_driver_desc *driver_desc) g_descs[desc_idx].driver_desc = NULL; mutex_unlock(&g_mutex); - pr_info("removed %s driver\n", driver_desc->name); + pr_debug("removed %s driver\n", driver_desc->name); } EXPORT_SYMBOL(gasket_unregister_device); @@ -1827,7 +1827,7 @@ static int __init gasket_init(void) { int i; - pr_info("Performing one-time init of the Gasket framework.\n"); + pr_debug("%s\n", __func__); /* Check for duplicates and find a free slot. */ mutex_lock(&g_mutex); for (i = 0; i < GASKET_FRAMEWORK_DESC_MAX; i++) { @@ -1843,8 +1843,7 @@ static int __init gasket_init(void) static void __exit gasket_exit(void) { - /* No deinit/dealloc needed at present. */ - pr_info("Removing Gasket framework module.\n"); + pr_debug("%s\n", __func__); } MODULE_DESCRIPTION("Google Gasket driver framework"); MODULE_VERSION(GASKET_FRAMEWORK_VERSION); -- 2.18.0.597.ga71716f1ad-goog
[PATCH 09/15] staging: gasket: core: delete device add and remove callbacks
From: Todd Poynor Gasket device drivers are now in charge of orchestrating the device add and removal sequences, so the callbacks from the framework to the device drivers for these events are no longer needed. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 10 -- drivers/staging/gasket/gasket_core.h | 29 2 files changed, 39 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index fad4883e6332c..0d76e18fcde5b 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1468,12 +1468,6 @@ int gasket_pci_add_device(struct pci_dev *pci_dev, if (ret) goto fail2; - ret = check_and_invoke_callback(gasket_dev, driver_desc->add_dev_cb); - if (ret) { - dev_err(gasket_dev->dev, "Error in add device cb: %d\n", ret); - goto fail2; - } - ret = gasket_sysfs_create_mapping(gasket_dev->dev_info.device, gasket_dev); if (ret) @@ -1512,7 +1506,6 @@ int gasket_pci_add_device(struct pci_dev *pci_dev, gasket_sysfs_remove_mapping(gasket_dev->dev_info.device); fail2: gasket_cleanup_pci(gasket_dev); - check_and_invoke_callback(gasket_dev, driver_desc->remove_dev_cb); device_destroy(internal_desc->class, gasket_dev->dev_info.devt); fail1: gasket_free_dev(gasket_dev); @@ -1559,9 +1552,6 @@ void gasket_pci_remove_device(struct pci_dev *pci_dev) check_and_invoke_callback(gasket_dev, driver_desc->sysfs_cleanup_cb); gasket_sysfs_remove_mapping(gasket_dev->dev_info.device); - - check_and_invoke_callback(gasket_dev, driver_desc->remove_dev_cb); - device_destroy(internal_desc->class, gasket_dev->dev_info.devt); gasket_free_dev(gasket_dev); } diff --git a/drivers/staging/gasket/gasket_core.h b/drivers/staging/gasket/gasket_core.h index 9c143ebeba452..0ef0a2640f0fe 100644 --- a/drivers/staging/gasket/gasket_core.h +++ b/drivers/staging/gasket/gasket_core.h @@ -302,12 +302,6 @@ struct gasket_dev { /* Hardware revision value for this device. */ int hardware_revision; - /* -* Device-specific data; allocated in gasket_driver_desc.add_dev_cb() -* and freed in gasket_driver_desc.remove_dev_cb(). -*/ - void *cb_data; - /* Protects access to per-device data (i.e. this structure). */ struct mutex mutex; @@ -415,29 +409,6 @@ struct gasket_driver_desc { int interrupt_pack_width; /* Driver callback functions - all may be NULL */ - /* -* add_dev_cb: Callback when a device is found. -* @dev: The gasket_dev struct for this driver instance. -* -* This callback should initialize the device-specific cb_data. -* Called when a device is found by the driver, -* before any BAR ranges have been mapped. If this call fails (returns -* nonzero), remove_dev_cb will be called. -* -*/ - int (*add_dev_cb)(struct gasket_dev *dev); - - /* -* remove_dev_cb: Callback for when a device is removed from the system. -* @dev: The gasket_dev struct for this driver instance. -* -* This callback should free data allocated in add_dev_cb. -* Called immediately before a device is unregistered by the driver. -* All framework-managed resources will have been cleaned up by the time -* this callback is invoked (PCI BARs, character devices, ...). -*/ - int (*remove_dev_cb)(struct gasket_dev *dev); - /* * device_open_cb: Callback for when a device node is opened in write * mode. -- 2.18.0.597.ga71716f1ad-goog
[PATCH 00/15] staging: gasket: unwrap pci core and more
From: Todd Poynor Stop wrapping PCI core calls like probe, enable, remove, etc. in the gasket framework, move these calls to the device driver instead. Have gasket drivers call into framework on init, enable, disable, etc. sequences, rather than the other way around. Remove the gasket-to-device callbacks associated with these sequences. Plus a few other fixes and cleanups. Todd Poynor (15): staging: gasket: sysfs: clean up state if ENOMEM removing mapping staging: gasket: core: move core PCI calls to device drivers staging: gasket: apex: move PCI core calls to apex driver staging: gasket: core: convert remaining info logs to debug staging: gasket: core: remove device enable and disable callbacks staging: gasket: apex: remove device enable and disable callbacks staging: gasket: core: let device driver enable/disable gasket device staging: gasket: apex: enable/disable gasket device from apex staging: gasket: core: delete device add and remove callbacks staging: gasket: apex: fold device add/remove logic inline staging: gasket: core: remove sysfs setup and cleanup callbacks staging: gasket: apex: move sysfs setup code to probe function staging: gasket: core: protect against races during unregister staging: gasket: apex: place in low power reset until opened staging: gasket: core: remove incorrect extraneous comment drivers/staging/gasket/apex_driver.c | 145 +- drivers/staging/gasket/gasket_core.c | 140 ++--- drivers/staging/gasket/gasket_core.h | 82 +++ drivers/staging/gasket/gasket_sysfs.c | 13 ++- 4 files changed, 148 insertions(+), 232 deletions(-) -- 2.18.0.597.ga71716f1ad-goog
[PATCH 01/15] staging: gasket: sysfs: clean up state if ENOMEM removing mapping
From: Todd Poynor If kcalloc() returns NULL in put_mapping(), continue to clean up state, including dropping the reference on the struct device and free attribute memory. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_sysfs.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/staging/gasket/gasket_sysfs.c b/drivers/staging/gasket/gasket_sysfs.c index 56d62aea51118..fc45f0d13e87d 100644 --- a/drivers/staging/gasket/gasket_sysfs.c +++ b/drivers/staging/gasket/gasket_sysfs.c @@ -101,13 +101,12 @@ static void put_mapping(struct gasket_sysfs_mapping *mapping) files_to_remove = kcalloc(num_files_to_remove, sizeof(*files_to_remove), GFP_KERNEL); - if (!files_to_remove) { - mutex_unlock(&mapping->mutex); - return; - } - - for (i = 0; i < num_files_to_remove; i++) - files_to_remove[i] = mapping->attributes[i].attr; + if (files_to_remove) + for (i = 0; i < num_files_to_remove; i++) + files_to_remove[i] = + mapping->attributes[i].attr; + else + num_files_to_remove = 0; kfree(mapping->attributes); mapping->attributes = NULL; -- 2.18.0.597.ga71716f1ad-goog
[PATCH 15/15] staging: gasket: core: remove incorrect extraneous comment
From: Todd Poynor A copy-and-pasted comment from another code sequence is removed from gasket core init sequence. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index a6462b6d702f1..d12ab560411f7 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1792,7 +1792,6 @@ static int __init gasket_init(void) int i; pr_debug("%s\n", __func__); - /* Check for duplicates and find a free slot. */ mutex_lock(&g_mutex); for (i = 0; i < GASKET_FRAMEWORK_DESC_MAX; i++) { g_descs[i].driver_desc = NULL; -- 2.18.0.597.ga71716f1ad-goog
[PATCH 07/15] staging: gasket: core: let device driver enable/disable gasket device
From: Todd Poynor Move gasket device enable/disable functions from internal calls to external calls from the gasket device drivers. The device driver will call these functions at appropriate times in its processing, placing the device driver in control of this sequence and reducing the need for callbacks from framework back to the device drivers during the enable/disable sequences. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 22 -- drivers/staging/gasket/gasket_core.h | 6 ++ 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index b070efaf0d41c..fad4883e6332c 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -628,7 +628,7 @@ static int gasket_add_cdev(struct gasket_cdev_info *dev_info, } /* Disable device operations. */ -static void gasket_disable_dev(struct gasket_dev *gasket_dev) +void gasket_disable_device(struct gasket_dev *gasket_dev) { const struct gasket_driver_desc *driver_desc = gasket_dev->internal_desc->driver_desc; @@ -649,6 +649,7 @@ static void gasket_disable_dev(struct gasket_dev *gasket_dev) } } } +EXPORT_SYMBOL(gasket_disable_device); /* * Registered descriptor lookup. @@ -1350,13 +1351,12 @@ static const struct file_operations gasket_file_ops = { }; /* Perform final init and marks the device as active. */ -static int gasket_enable_dev(struct gasket_internal_desc *internal_desc, -struct gasket_dev *gasket_dev) +int gasket_enable_device(struct gasket_dev *gasket_dev) { int tbl_idx; int ret; const struct gasket_driver_desc *driver_desc = - internal_desc->driver_desc; + gasket_dev->internal_desc->driver_desc; ret = gasket_interrupt_init(gasket_dev, driver_desc->name, driver_desc->interrupt_type, @@ -1418,13 +1418,15 @@ static int gasket_enable_dev(struct gasket_internal_desc *internal_desc, return 0; } +EXPORT_SYMBOL(gasket_enable_device); /* * Add PCI gasket device. * * Called by Gasket device probe function. - * Allocates device metadata, maps device memory, and calls gasket_enable_dev - * to prepare the device for active use. + * Allocates device metadata and maps device memory. The device driver must + * call gasket_enable_device after driver init is complete to place the device + * in active use. */ int gasket_pci_add_device(struct pci_dev *pci_dev, struct gasket_dev **gasket_devp) @@ -1500,13 +1502,6 @@ int gasket_pci_add_device(struct pci_dev *pci_dev, goto fail5; } - ret = gasket_enable_dev(internal_desc, gasket_dev); - if (ret) { - pr_err("cannot setup %s device\n", driver_desc->name); - gasket_disable_dev(gasket_dev); - goto fail5; - } - *gasket_devp = gasket_dev; return 0; @@ -1560,7 +1555,6 @@ void gasket_pci_remove_device(struct pci_dev *pci_dev) dev_dbg(gasket_dev->dev, "remove %s PCI gasket device\n", internal_desc->driver_desc->name); - gasket_disable_dev(gasket_dev); gasket_cleanup_pci(gasket_dev); check_and_invoke_callback(gasket_dev, driver_desc->sysfs_cleanup_cb); diff --git a/drivers/staging/gasket/gasket_core.h b/drivers/staging/gasket/gasket_core.h index 5d40bc7f52e91..9c143ebeba452 100644 --- a/drivers/staging/gasket/gasket_core.h +++ b/drivers/staging/gasket/gasket_core.h @@ -590,6 +590,12 @@ int gasket_pci_add_device(struct pci_dev *pci_dev, /* Remove a PCI gasket device. */ void gasket_pci_remove_device(struct pci_dev *pci_dev); +/* Enable a Gasket device. */ +int gasket_enable_device(struct gasket_dev *gasket_dev); + +/* Disable a Gasket device. */ +void gasket_disable_device(struct gasket_dev *gasket_dev); + /* * Reset the Gasket device. * @gasket_dev: Gasket device struct. -- 2.18.0.597.ga71716f1ad-goog
[PATCH 14/15] staging: gasket: apex: place in low power reset until opened
From: Todd Poynor The apex device is left out of reset mode at the end of device probe/initialize processing. Add a call to enter reset at the end of the sequence, triggering power gating and other low power features. Signed-off-by: Todd Poynor --- drivers/staging/gasket/apex_driver.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/staging/gasket/apex_driver.c b/drivers/staging/gasket/apex_driver.c index 55319619b2e66..c747e9ca45186 100644 --- a/drivers/staging/gasket/apex_driver.c +++ b/drivers/staging/gasket/apex_driver.c @@ -644,6 +644,10 @@ static int apex_pci_probe(struct pci_dev *pci_dev, goto remove_device; } + /* Place device in low power mode until opened */ + if (allow_power_save) + apex_enter_reset(gasket_dev); + return 0; remove_device: -- 2.18.0.597.ga71716f1ad-goog
[PATCH 13/15] staging: gasket: core: protect against races during unregister
From: Todd Poynor Keep mutex held across the unregistration operation, until the driver_desc field of the global table is removed, to prevent a concurrent accessor from looking up the driver_desc while gasket_unregister_device() is in the processing of removing it. Reported-by: Guenter Roeck Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index ace92f107ed58..a6462b6d702f1 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1765,9 +1765,9 @@ void gasket_unregister_device(const struct gasket_driver_desc *driver_desc) break; } } - mutex_unlock(&g_mutex); if (!internal_desc) { + mutex_unlock(&g_mutex); pr_err("request to unregister unknown desc: %s, %d:%d\n", driver_desc->name, driver_desc->major, driver_desc->minor); @@ -1780,7 +1780,6 @@ void gasket_unregister_device(const struct gasket_driver_desc *driver_desc) class_destroy(internal_desc->class); /* Finally, effectively "remove" the driver. */ - mutex_lock(&g_mutex); g_descs[desc_idx].driver_desc = NULL; mutex_unlock(&g_mutex); -- 2.18.0.597.ga71716f1ad-goog
[PATCH 11/15] staging: gasket: core: remove sysfs setup and cleanup callbacks
From: Todd Poynor Gasket device drivers now call into the gasket framework to initialize and de-initialize, rather than the other way around. The calling code can perform sysfs setup and cleanup actions without callbacks from the framework. Remove the sysfs setup and cleanup callbacks. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 10 -- drivers/staging/gasket/gasket_core.h | 18 -- 2 files changed, 28 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 0d76e18fcde5b..ace92f107ed58 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1489,18 +1489,9 @@ int gasket_pci_add_device(struct pci_dev *pci_dev, if (ret) goto fail4; - ret = check_and_invoke_callback(gasket_dev, - driver_desc->sysfs_setup_cb); - if (ret) { - dev_err(gasket_dev->dev, "Error in sysfs setup cb: %d\n", ret); - goto fail5; - } - *gasket_devp = gasket_dev; return 0; -fail5: - check_and_invoke_callback(gasket_dev, driver_desc->sysfs_cleanup_cb); fail4: fail3: gasket_sysfs_remove_mapping(gasket_dev->dev_info.device); @@ -1550,7 +1541,6 @@ void gasket_pci_remove_device(struct pci_dev *pci_dev) gasket_cleanup_pci(gasket_dev); - check_and_invoke_callback(gasket_dev, driver_desc->sysfs_cleanup_cb); gasket_sysfs_remove_mapping(gasket_dev->dev_info.device); device_destroy(internal_desc->class, gasket_dev->dev_info.devt); gasket_free_dev(gasket_dev); diff --git a/drivers/staging/gasket/gasket_core.h b/drivers/staging/gasket/gasket_core.h index 0ef0a2640f0fe..275fd0b345b6e 100644 --- a/drivers/staging/gasket/gasket_core.h +++ b/drivers/staging/gasket/gasket_core.h @@ -444,24 +444,6 @@ struct gasket_driver_desc { */ int (*device_close_cb)(struct gasket_dev *dev); - /* -* sysfs_setup_cb: Callback to set up driver-specific sysfs nodes. -* @dev: Pointer to the gasket_dev struct for this device. -* -* Called during the add gasket device call. -* -*/ - int (*sysfs_setup_cb)(struct gasket_dev *dev); - - /* -* sysfs_cleanup_cb: Callback to clean up driver-specific sysfs nodes. -* @dev: Pointer to the gasket_dev struct for this device. -* -* Called during device disable processing. -* -*/ - int (*sysfs_cleanup_cb)(struct gasket_dev *dev); - /* * get_mappable_regions_cb: Get descriptors of mappable device memory. * @gasket_dev: Pointer to the struct gasket_dev for this device. -- 2.18.0.597.ga71716f1ad-goog
[PATCH 12/15] staging: gasket: apex: move sysfs setup code to probe function
From: Todd Poynor The gasket framework no longer provides callbacks to the device driver for sysfs setup and teardown. Move the sysfs setup code to the device probe function. Apex does not implement sysfs cleanup code. Signed-off-by: Todd Poynor --- drivers/staging/gasket/apex_driver.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/drivers/staging/gasket/apex_driver.c b/drivers/staging/gasket/apex_driver.c index 69ca7fb10eddc..55319619b2e66 100644 --- a/drivers/staging/gasket/apex_driver.c +++ b/drivers/staging/gasket/apex_driver.c @@ -568,12 +568,6 @@ static struct gasket_sysfs_attribute apex_sysfs_attrs[] = { GASKET_END_OF_ATTR_ARRAY }; -static int apex_sysfs_setup_cb(struct gasket_dev *gasket_dev) -{ - return gasket_sysfs_create_entries(gasket_dev->dev_info.device, - apex_sysfs_attrs); -} - /* On device open, perform a core reinit reset. */ static int apex_device_open_cb(struct gasket_dev *gasket_dev) { @@ -639,6 +633,11 @@ static int apex_pci_probe(struct pci_dev *pci_dev, goto remove_device; } + ret = gasket_sysfs_create_entries(gasket_dev->dev_info.device, + apex_sysfs_attrs); + if (ret) + dev_err(&pci_dev->dev, "error creating device sysfs entries\n"); + ret = gasket_enable_device(gasket_dev); if (ret) { dev_err(&pci_dev->dev, "error enabling gasket device\n"); @@ -695,9 +694,6 @@ static struct gasket_driver_desc apex_desc = { .interrupts = apex_interrupts, .interrupt_pack_width = 7, - .sysfs_setup_cb = apex_sysfs_setup_cb, - .sysfs_cleanup_cb = NULL, - .device_open_cb = apex_device_open_cb, .device_close_cb = apex_device_cleanup, -- 2.18.0.597.ga71716f1ad-goog
[PATCH 08/15] staging: gasket: apex: enable/disable gasket device from apex
From: Todd Poynor Gasket framework now places device drivers in charge of calling APIs to enable and disable gasket device operations. Make the appropriate calls from the apex driver. Signed-off-by: Todd Poynor --- drivers/staging/gasket/apex_driver.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/staging/gasket/apex_driver.c b/drivers/staging/gasket/apex_driver.c index e2bc06b5244f7..1d8a100c52885 100644 --- a/drivers/staging/gasket/apex_driver.c +++ b/drivers/staging/gasket/apex_driver.c @@ -643,11 +643,23 @@ static int apex_pci_probe(struct pci_dev *pci_dev, return ret; } + pci_set_drvdata(pci_dev, gasket_dev); + ret = gasket_enable_device(gasket_dev); + if (ret) { + dev_err(&pci_dev->dev, "error enabling gasket device\n"); + gasket_pci_remove_device(pci_dev); + pci_disable_device(pci_dev); + return ret; + } + return 0; } static void apex_pci_remove(struct pci_dev *pci_dev) { + struct gasket_dev *gasket_dev = pci_get_drvdata(pci_dev); + + gasket_disable_device(gasket_dev); gasket_pci_remove_device(pci_dev); pci_disable_device(pci_dev); } -- 2.18.0.597.ga71716f1ad-goog
[PATCH 10/15] staging: gasket: apex: fold device add/remove logic inline
From: Todd Poynor Gasket device drivers are now in charge of the device add and remove sequences; the framework callbacks for these are deleted. Move the apex device add callback code to the probe function. Apex did not implement the removal callback. Signed-off-by: Todd Poynor --- drivers/staging/gasket/apex_driver.c | 69 +--- 1 file changed, 32 insertions(+), 37 deletions(-) diff --git a/drivers/staging/gasket/apex_driver.c b/drivers/staging/gasket/apex_driver.c index 1d8a100c52885..69ca7fb10eddc 100644 --- a/drivers/staging/gasket/apex_driver.c +++ b/drivers/staging/gasket/apex_driver.c @@ -448,37 +448,6 @@ static int apex_reset(struct gasket_dev *gasket_dev) return ret; } -static int apex_add_dev_cb(struct gasket_dev *gasket_dev) -{ - ulong page_table_ready, msix_table_ready; - int retries = 0; - - apex_reset(gasket_dev); - - while (retries < APEX_RESET_RETRY) { - page_table_ready = - gasket_dev_read_64(gasket_dev, APEX_BAR_INDEX, - APEX_BAR2_REG_KERNEL_HIB_PAGE_TABLE_INIT); - msix_table_ready = - gasket_dev_read_64(gasket_dev, APEX_BAR_INDEX, - APEX_BAR2_REG_KERNEL_HIB_MSIX_TABLE_INIT); - if (page_table_ready && msix_table_ready) - break; - schedule_timeout(msecs_to_jiffies(APEX_RESET_DELAY)); - retries++; - } - - if (retries == APEX_RESET_RETRY) { - if (!page_table_ready) - dev_err(gasket_dev->dev, "Page table init timed out\n"); - if (!msix_table_ready) - dev_err(gasket_dev->dev, "MSI-X table init timed out\n"); - return -ETIMEDOUT; - } - - return 0; -} - /* * Check permissions for Apex ioctls. * Returns true if the current user may execute this ioctl, and false otherwise. @@ -626,6 +595,8 @@ static int apex_pci_probe(struct pci_dev *pci_dev, const struct pci_device_id *id) { int ret; + ulong page_table_ready, msix_table_ready; + int retries = 0; struct gasket_dev *gasket_dev; ret = pci_enable_device(pci_dev); @@ -644,15 +615,42 @@ static int apex_pci_probe(struct pci_dev *pci_dev, } pci_set_drvdata(pci_dev, gasket_dev); + apex_reset(gasket_dev); + + while (retries < APEX_RESET_RETRY) { + page_table_ready = + gasket_dev_read_64(gasket_dev, APEX_BAR_INDEX, + APEX_BAR2_REG_KERNEL_HIB_PAGE_TABLE_INIT); + msix_table_ready = + gasket_dev_read_64(gasket_dev, APEX_BAR_INDEX, + APEX_BAR2_REG_KERNEL_HIB_MSIX_TABLE_INIT); + if (page_table_ready && msix_table_ready) + break; + schedule_timeout(msecs_to_jiffies(APEX_RESET_DELAY)); + retries++; + } + + if (retries == APEX_RESET_RETRY) { + if (!page_table_ready) + dev_err(gasket_dev->dev, "Page table init timed out\n"); + if (!msix_table_ready) + dev_err(gasket_dev->dev, "MSI-X table init timed out\n"); + ret = -ETIMEDOUT; + goto remove_device; + } + ret = gasket_enable_device(gasket_dev); if (ret) { dev_err(&pci_dev->dev, "error enabling gasket device\n"); - gasket_pci_remove_device(pci_dev); - pci_disable_device(pci_dev); - return ret; + goto remove_device; } return 0; + +remove_device: + gasket_pci_remove_device(pci_dev); + pci_disable_device(pci_dev); + return ret; } static void apex_pci_remove(struct pci_dev *pci_dev) @@ -697,9 +695,6 @@ static struct gasket_driver_desc apex_desc = { .interrupts = apex_interrupts, .interrupt_pack_width = 7, - .add_dev_cb = apex_add_dev_cb, - .remove_dev_cb = NULL, - .sysfs_setup_cb = apex_sysfs_setup_cb, .sysfs_cleanup_cb = NULL, -- 2.18.0.597.ga71716f1ad-goog
[PATCH 06/15] staging: gasket: apex: remove device enable and disable callbacks
From: Todd Poynor These are not implemented for apex, and are now being removed from the gasket framework. Signed-off-by: Todd Poynor --- drivers/staging/gasket/apex_driver.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/staging/gasket/apex_driver.c b/drivers/staging/gasket/apex_driver.c index b47661442009d..e2bc06b5244f7 100644 --- a/drivers/staging/gasket/apex_driver.c +++ b/drivers/staging/gasket/apex_driver.c @@ -688,9 +688,6 @@ static struct gasket_driver_desc apex_desc = { .add_dev_cb = apex_add_dev_cb, .remove_dev_cb = NULL, - .enable_dev_cb = NULL, - .disable_dev_cb = NULL, - .sysfs_setup_cb = apex_sysfs_setup_cb, .sysfs_cleanup_cb = NULL, -- 2.18.0.597.ga71716f1ad-goog
[RFC] vruntime updated incorrectly when rt_mutex boots prio?
This issue was discovered on a 4.9-based android device, but the relevant mainline code appears to be the same. The symptom is that over time the some workloads become sluggish resulting in missed frames or sluggishness. It appears to be the same issue described in http://lists.infradead.org/pipermail/linux-arm-kernel/2018-March/567836.html. Here is the scenario: A task is deactivated while still in the fair class. The task is then boosted to RT, so rt_mutex_setprio() is called. This changes the task to RT and calls check_class_changed(), which eventually calls detach_task_cfs_rq(), which is where vruntime_normalized() sees that the task's state is TASK_WAKING, which results in skipping the subtraction of the rq's min_vruntime from the task's vruntime. Later, when the prio is deboosted and the task is moved back to the fair class, the fair rq's min_vruntime is added to the task's vruntime, resulting in vruntime inflation. When investigating the problem, it was found that the change below fixes the problem by forcing vruntime_normalized() to return false if the sched_class is not CFS (though we're concerned that it might introduce other issues): diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 91f7b3322a15..267056f2e2ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11125,7 +11125,7 @@ static inline bool vruntime_normalized(struct task_struct *p) * - A task which has been woken up by try_to_wake_up() and * waiting for actually being woken up by sched_ttwu_pending(). */ - if (!se->sum_exec_runtime || p->state == TASK_WAKING) + if (!se->sum_exec_runtime || (p->state == TASK_WAKING && p->sched_class == &fair_sched_class)) return true; return false; Do folks agree that this is incorrect behavior? Does this fix look appropriate and safe? Other ideas? -Todd
[PATCH] acpi: bus: Fixed a pointer coding style issue
Fixed a coding style issue. Signed-off-by: Tom Todd --- drivers/acpi/bus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c index 84b4a62018eb..73e43d81ec63 100644 --- a/drivers/acpi/bus.c +++ b/drivers/acpi/bus.c @@ -962,7 +962,7 @@ static int acpi_device_probe(struct device *dev) return 0; } -static int acpi_device_remove(struct device * dev) +static int acpi_device_remove(struct device *dev) { struct acpi_device *acpi_dev = to_acpi_device(dev); struct acpi_driver *acpi_drv = acpi_dev->driver; -- 2.18.0
[PATCH v2] drivers: base: initcall_debug logs for driver probe times
From: Todd Poynor Add initcall_debug logs for each driver device probe call, for example: probe of a380.ramoops returned 1 after 3007 usecs This replaces the previous code added to report times for deferred probes. It also reports OF platform bus device creates that were formerly lumped together in a single entry for function of_platform_default_populate_init, as well as helping to annotate other initcalls that involve device probing. Remove restriction on printing probe times only during initcalls, since initcall_debug now continues to show driver timing info past the boot phase. Signed-off-by: Todd Poynor --- drivers/base/dd.c | 50 +-- 1 file changed, 22 insertions(+), 28 deletions(-) Changes from v1: * unsigned long long -> s64 * use ktime_to_us() instead of ktime_to_ns with shift * print probe return value * drop code that stopped output on end of initcall processing diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 1435d7281c66..6ea9c5cece71 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -53,7 +53,6 @@ static DEFINE_MUTEX(deferred_probe_mutex); static LIST_HEAD(deferred_probe_pending_list); static LIST_HEAD(deferred_probe_active_list); static atomic_t deferred_trigger_count = ATOMIC_INIT(0); -static bool initcalls_done; /* * In some cases, like suspend to RAM or hibernation, It might be reasonable @@ -62,26 +61,6 @@ static bool initcalls_done; */ static bool defer_all_probes; -/* - * For initcall_debug, show the deferred probes executed in late_initcall - * processing. - */ -static void deferred_probe_debug(struct device *dev) -{ - ktime_t calltime, delta, rettime; - unsigned long long duration; - - printk(KERN_DEBUG "deferred probe %s @ %i\n", dev_name(dev), - task_pid_nr(current)); - calltime = ktime_get(); - bus_probe_device(dev); - rettime = ktime_get(); - delta = ktime_sub(rettime, calltime); - duration = (unsigned long long) ktime_to_ns(delta) >> 10; - printk(KERN_DEBUG "deferred probe %s returned after %lld usecs\n", - dev_name(dev), duration); -} - /* * deferred_probe_work_func() - Retry probing devices in the active list. */ @@ -125,11 +104,7 @@ static void deferred_probe_work_func(struct work_struct *work) device_pm_move_to_tail(dev); dev_dbg(dev, "Retrying from deferred list\n"); - if (initcall_debug && !initcalls_done) - deferred_probe_debug(dev); - else - bus_probe_device(dev); - + bus_probe_device(dev); mutex_lock(&deferred_probe_mutex); put_device(dev); @@ -237,7 +212,6 @@ static int deferred_probe_initcall(void) driver_deferred_probe_trigger(); /* Sort as many dependencies as possible before exiting initcalls */ flush_work(&deferred_probe_work); - initcalls_done = true; return 0; } late_initcall(deferred_probe_initcall); @@ -527,6 +501,23 @@ static int really_probe(struct device *dev, struct device_driver *drv) return ret; } +/* + * For initcall_debug, show the driver probe time. + */ +static int really_probe_debug(struct device *dev, struct device_driver *drv) +{ + ktime_t calltime, delta, rettime; + int ret; + + calltime = ktime_get(); + ret = really_probe(dev, drv); + rettime = ktime_get(); + delta = ktime_sub(rettime, calltime); + printk(KERN_DEBUG "probe of %s returned %d after %lld usecs\n", + dev_name(dev), ret, (s64) ktime_to_us(delta)); + return ret; +} + /** * driver_probe_done * Determine if the probe sequence is finished or not. @@ -585,7 +576,10 @@ int driver_probe_device(struct device_driver *drv, struct device *dev) pm_runtime_get_sync(dev->parent); pm_runtime_barrier(dev); - ret = really_probe(dev, drv); + if (initcall_debug) + ret = really_probe_debug(dev, drv); + else + ret = really_probe(dev, drv); pm_request_idle(dev); if (dev->parent) -- 2.18.0.rc1.244.gcf134e6275-goog
Re: [GIT PULL] Staging/IIO driver patches for 4.19-rc1
On Tue, Aug 28, 2018 at 3:38 AM Ahmed S. Darwish wrote: >[...] > On Sat, 18 Aug 2018 17:57:24 +0200, Greg KH wrote: > [...] > > addition of some new IIO drivers. Also added was a "gasket" driver from > > Google that needs loads of work and the erofs filesystem. > > > > Why are we adding __a whole new in-kernel framework__ for > developing basic user-space drivers? > > We already have a frameowrk for that, and it's UIO. [1] The UIO > code is a very stable and simple subsystem; it's also heavily used > in the embedded industry.. Yeah, it's clear all the userspace I/O framework code needs to move to UIO with some patches to add agreed-upon new pieces. I think everyone agrees this code shouldn't move out of staging until that happens. A whole lot of work is needed on these drivers, and UIO conversion is on my list to address soon. > > I've looked at the gasket documentation [2], and the first user > of this new in-kernel API [3], and this is almost replicating UIO > it's not funny. [4] True, the gasket APIs adds some extra new > conveniences (PCI BAR re-mapping, MSI, ..), but there's no > technical reason this cannot be added to the UIO code instead. > > More-over, the exposed user-space API is just some ioctls. So if > google hase some shipped user-space code that is using this, > hopefully the driver can still be re-implemented through UIO > without changing these bits.. > > Thanks, > > [1] https://www.kernel.org/doc/html/v4.18/driver-api/uio-howto.html > [2] drivers/staging/gasket/gasket_core.h :: struct gasket_driver_desc > [3] drivers/staging/gasket/apex_driver.c > [4] include/linux/uio_driver.h > > -- > Darwi > http://darwish.chasingpointers.com Thanks -- Todd
[PATCH] binder: use standard functions to allocate fds
Binder uses internal fs interfaces to allocate and install fds: __alloc_fd __fd_install __close_fd get_files_struct put_files_struct These were used to support the passing of fds between processes as part of a transaction. The actual allocation and installation of the fds in the target process was handled by the sending process so the standard functions, alloc_fd() and fd_install() which assume task==current couldn't be used. This patch refactors this mechanism so that the fds are allocated and installed by the target process allowing the standard functions to be used. The sender now creates a list of fd fixups that contains the struct *file and the address to fixup with the new fd once it is allocated. This list is processed by the target process when the transaction is dequeued. A new error case is introduced by this change. If an async transaction with file descriptors cannot allocate new fds in the target (probably due to out of file descriptors), the transaction is discarded with a log message. In the old implementation this would have been detected in the sender context and failed prior to sending. Signed-off-by: Todd Kjos --- drivers/android/Kconfig| 2 +- drivers/android/binder.c | 387 - drivers/android/binder_trace.h | 36 ++- 3 files changed, 260 insertions(+), 165 deletions(-) diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig index 432e9ad770703..51e8250d113fa 100644 --- a/drivers/android/Kconfig +++ b/drivers/android/Kconfig @@ -10,7 +10,7 @@ if ANDROID config ANDROID_BINDER_IPC bool "Android Binder IPC Driver" - depends on MMU + depends on MMU && !CPU_CACHE_VIVT default n ---help--- Binder is used in Android for both communication between processes, diff --git a/drivers/android/binder.c b/drivers/android/binder.c index d58763b6b0090..50856319bc7da 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -71,6 +71,7 @@ #include #include #include +#include #include @@ -457,9 +458,8 @@ struct binder_ref { }; enum binder_deferred_state { - BINDER_DEFERRED_PUT_FILES= 0x01, - BINDER_DEFERRED_FLUSH= 0x02, - BINDER_DEFERRED_RELEASE = 0x04, + BINDER_DEFERRED_FLUSH= 0x01, + BINDER_DEFERRED_RELEASE = 0x02, }; /** @@ -480,9 +480,6 @@ enum binder_deferred_state { *(invariant after initialized) * @tsk task_struct for group_leader of process *(invariant after initialized) - * @files files_struct for process - *(protected by @files_lock) - * @files_lockmutex to protect @files * @deferred_work_node: element for binder_deferred_list *(protected by binder_deferred_lock) * @deferred_work:bitmap of deferred work to perform @@ -527,8 +524,6 @@ struct binder_proc { struct list_head waiting_threads; int pid; struct task_struct *tsk; - struct files_struct *files; - struct mutex files_lock; struct hlist_node deferred_work_node; int deferred_work; bool is_dead; @@ -611,6 +606,23 @@ struct binder_thread { bool is_dead; }; +/** + * struct binder_txn_fd_fixup - transaction fd fixup list element + * @fixup_entry: list entry + * @file: struct file to be associated with new fd + * @offset: offset in buffer data to this fixup + * + * List element for fd fixups in a transaction. Since file + * descriptors need to be allocated in the context of the + * target process, we pass each fd to be processed in this + * struct. + */ +struct binder_txn_fd_fixup { + struct list_head fixup_entry; + struct file *file; + size_t offset; +}; + struct binder_transaction { int debug_id; struct binder_work work; @@ -628,6 +640,7 @@ struct binder_transaction { longpriority; longsaved_priority; kuid_t sender_euid; + struct list_head fd_fixups; /** * @lock: protects @from, @to_proc, and @to_thread * @@ -920,66 +933,6 @@ static void binder_free_thread(struct binder_thread *thread); static void binder_free_proc(struct binder_proc *proc); static void binder_inc_node_tmpref_ilocked(struct binder_node *node); -static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) -{ - unsigned long rlim_cur; - unsigned long irqs; - int ret; - - mutex_lock(&proc->files_lock); - if (proc->files == NULL) { - ret = -ESRCH; - goto err; - } - if (!lock_task_sighand(proc->tsk, &irqs)) { - ret = -EMFILE; - goto err; - } - rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE); - unlock_task_sighand(proc->tsk, &am
[PATCH v2] binder: use standard functions to allocate fds
Binder uses internal fs interfaces to allocate and install fds: __alloc_fd __fd_install __close_fd get_files_struct put_files_struct These were used to support the passing of fds between processes as part of a transaction. The actual allocation and installation of the fds in the target process was handled by the sending process so the standard functions, alloc_fd() and fd_install() which assume task==current couldn't be used. This patch refactors this mechanism so that the fds are allocated and installed by the target process allowing the standard functions to be used. The sender now creates a list of fd fixups that contains the struct *file and the address to fixup with the new fd once it is allocated. This list is processed by the target process when the transaction is dequeued. A new error case is introduced by this change. If an async transaction with file descriptors cannot allocate new fds in the target (probably due to out of file descriptors), the transaction is discarded with a log message. In the old implementation this would have been detected in the sender context and failed prior to sending. Signed-off-by: Todd Kjos --- v2: use "%zu" printk format for size_t drivers/android/Kconfig| 2 +- drivers/android/binder.c | 387 - drivers/android/binder_trace.h | 36 ++- 3 files changed, 260 insertions(+), 165 deletions(-) diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig index 432e9ad770703..51e8250d113fa 100644 --- a/drivers/android/Kconfig +++ b/drivers/android/Kconfig @@ -10,7 +10,7 @@ if ANDROID config ANDROID_BINDER_IPC bool "Android Binder IPC Driver" - depends on MMU + depends on MMU && !CPU_CACHE_VIVT default n ---help--- Binder is used in Android for both communication between processes, diff --git a/drivers/android/binder.c b/drivers/android/binder.c index d58763b6b0090..50856319bc7da 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -71,6 +71,7 @@ #include #include #include +#include #include @@ -457,9 +458,8 @@ struct binder_ref { }; enum binder_deferred_state { - BINDER_DEFERRED_PUT_FILES= 0x01, - BINDER_DEFERRED_FLUSH= 0x02, - BINDER_DEFERRED_RELEASE = 0x04, + BINDER_DEFERRED_FLUSH= 0x01, + BINDER_DEFERRED_RELEASE = 0x02, }; /** @@ -480,9 +480,6 @@ enum binder_deferred_state { *(invariant after initialized) * @tsk task_struct for group_leader of process *(invariant after initialized) - * @files files_struct for process - *(protected by @files_lock) - * @files_lockmutex to protect @files * @deferred_work_node: element for binder_deferred_list *(protected by binder_deferred_lock) * @deferred_work:bitmap of deferred work to perform @@ -527,8 +524,6 @@ struct binder_proc { struct list_head waiting_threads; int pid; struct task_struct *tsk; - struct files_struct *files; - struct mutex files_lock; struct hlist_node deferred_work_node; int deferred_work; bool is_dead; @@ -611,6 +606,23 @@ struct binder_thread { bool is_dead; }; +/** + * struct binder_txn_fd_fixup - transaction fd fixup list element + * @fixup_entry: list entry + * @file: struct file to be associated with new fd + * @offset: offset in buffer data to this fixup + * + * List element for fd fixups in a transaction. Since file + * descriptors need to be allocated in the context of the + * target process, we pass each fd to be processed in this + * struct. + */ +struct binder_txn_fd_fixup { + struct list_head fixup_entry; + struct file *file; + size_t offset; +}; + struct binder_transaction { int debug_id; struct binder_work work; @@ -628,6 +640,7 @@ struct binder_transaction { longpriority; longsaved_priority; kuid_t sender_euid; + struct list_head fd_fixups; /** * @lock: protects @from, @to_proc, and @to_thread * @@ -920,66 +933,6 @@ static void binder_free_thread(struct binder_thread *thread); static void binder_free_proc(struct binder_proc *proc); static void binder_inc_node_tmpref_ilocked(struct binder_node *node); -static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) -{ - unsigned long rlim_cur; - unsigned long irqs; - int ret; - - mutex_lock(&proc->files_lock); - if (proc->files == NULL) { - ret = -ESRCH; - goto err; - } - if (!lock_task_sighand(proc->tsk, &irqs)) { - ret = -EMFILE; - goto err; - } - rlim_cur = task_rlimit(proc->tsk, RLIMIT_
[PATCH] binder: use standard functions to allocate fds
Binder uses internal fs interfaces to allocate and install fds: __alloc_fd __fd_install __close_fd get_files_struct put_files_struct These were used to support the passing of fds between processes as part of a transaction. The actual allocation and installation of the fds in the target process was handled by the sending process so the standard functions, alloc_fd() and fd_install() which assume task==current couldn't be used. This patch refactors this mechanism so that the fds are allocated and installed by the target process allowing the standard functions to be used. The sender now creates a list of fd fixups that contains the struct *file and the address to fixup with the new fd once it is allocated. This list is processed by the target process when the transaction is dequeued. A new error case is introduced by this change. If an async transaction with file descriptors cannot allocate new fds in the target (probably due to out of file descriptors), the transaction is discarded with a log message. In the old implementation this would have been detected in the sender context and failed prior to sending. Signed-off-by: Todd Kjos --- v2: use "%zu" printk format for size_t drivers/android/Kconfig| 2 +- drivers/android/binder.c | 387 - drivers/android/binder_trace.h | 36 ++- 3 files changed, 260 insertions(+), 165 deletions(-) diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig index 432e9ad770703..51e8250d113fa 100644 --- a/drivers/android/Kconfig +++ b/drivers/android/Kconfig @@ -10,7 +10,7 @@ if ANDROID config ANDROID_BINDER_IPC bool "Android Binder IPC Driver" - depends on MMU + depends on MMU && !CPU_CACHE_VIVT default n ---help--- Binder is used in Android for both communication between processes, diff --git a/drivers/android/binder.c b/drivers/android/binder.c index d58763b6b0090..50856319bc7da 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -71,6 +71,7 @@ #include #include #include +#include #include @@ -457,9 +458,8 @@ struct binder_ref { }; enum binder_deferred_state { - BINDER_DEFERRED_PUT_FILES= 0x01, - BINDER_DEFERRED_FLUSH= 0x02, - BINDER_DEFERRED_RELEASE = 0x04, + BINDER_DEFERRED_FLUSH= 0x01, + BINDER_DEFERRED_RELEASE = 0x02, }; /** @@ -480,9 +480,6 @@ enum binder_deferred_state { *(invariant after initialized) * @tsk task_struct for group_leader of process *(invariant after initialized) - * @files files_struct for process - *(protected by @files_lock) - * @files_lockmutex to protect @files * @deferred_work_node: element for binder_deferred_list *(protected by binder_deferred_lock) * @deferred_work:bitmap of deferred work to perform @@ -527,8 +524,6 @@ struct binder_proc { struct list_head waiting_threads; int pid; struct task_struct *tsk; - struct files_struct *files; - struct mutex files_lock; struct hlist_node deferred_work_node; int deferred_work; bool is_dead; @@ -611,6 +606,23 @@ struct binder_thread { bool is_dead; }; +/** + * struct binder_txn_fd_fixup - transaction fd fixup list element + * @fixup_entry: list entry + * @file: struct file to be associated with new fd + * @offset: offset in buffer data to this fixup + * + * List element for fd fixups in a transaction. Since file + * descriptors need to be allocated in the context of the + * target process, we pass each fd to be processed in this + * struct. + */ +struct binder_txn_fd_fixup { + struct list_head fixup_entry; + struct file *file; + size_t offset; +}; + struct binder_transaction { int debug_id; struct binder_work work; @@ -628,6 +640,7 @@ struct binder_transaction { longpriority; longsaved_priority; kuid_t sender_euid; + struct list_head fd_fixups; /** * @lock: protects @from, @to_proc, and @to_thread * @@ -920,66 +933,6 @@ static void binder_free_thread(struct binder_thread *thread); static void binder_free_proc(struct binder_proc *proc); static void binder_inc_node_tmpref_ilocked(struct binder_node *node); -static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) -{ - unsigned long rlim_cur; - unsigned long irqs; - int ret; - - mutex_lock(&proc->files_lock); - if (proc->files == NULL) { - ret = -ESRCH; - goto err; - } - if (!lock_task_sighand(proc->tsk, &irqs)) { - ret = -EMFILE; - goto err; - } - rlim_cur = task_rlimit(proc->tsk, RLIMIT_
Re: [PATCH v2] binder: use standard functions to allocate fds
Sorry, forgot to bump the version. Ignore this one. On Tue, Aug 28, 2018 at 1:43 PM Todd Kjos wrote: > > Binder uses internal fs interfaces to allocate and install fds: > > __alloc_fd > __fd_install > __close_fd > get_files_struct > put_files_struct > > These were used to support the passing of fds between processes > as part of a transaction. The actual allocation and installation > of the fds in the target process was handled by the sending > process so the standard functions, alloc_fd() and fd_install() > which assume task==current couldn't be used. > > This patch refactors this mechanism so that the fds are > allocated and installed by the target process allowing the > standard functions to be used. > > The sender now creates a list of fd fixups that contains the > struct *file and the address to fixup with the new fd once > it is allocated. This list is processed by the target process > when the transaction is dequeued. > > A new error case is introduced by this change. If an async > transaction with file descriptors cannot allocate new > fds in the target (probably due to out of file descriptors), > the transaction is discarded with a log message. In the old > implementation this would have been detected in the sender > context and failed prior to sending. > > Signed-off-by: Todd Kjos > --- > v2: use "%zu" printk format for size_t > > drivers/android/Kconfig| 2 +- > drivers/android/binder.c | 387 - > drivers/android/binder_trace.h | 36 ++- > 3 files changed, 260 insertions(+), 165 deletions(-) > > diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig > index 432e9ad770703..51e8250d113fa 100644 > --- a/drivers/android/Kconfig > +++ b/drivers/android/Kconfig > @@ -10,7 +10,7 @@ if ANDROID > > config ANDROID_BINDER_IPC > bool "Android Binder IPC Driver" > - depends on MMU > + depends on MMU && !CPU_CACHE_VIVT > default n > ---help--- > Binder is used in Android for both communication between processes, > diff --git a/drivers/android/binder.c b/drivers/android/binder.c > index d58763b6b0090..50856319bc7da 100644 > --- a/drivers/android/binder.c > +++ b/drivers/android/binder.c > @@ -71,6 +71,7 @@ > #include > #include > #include > +#include > > #include > > @@ -457,9 +458,8 @@ struct binder_ref { > }; > > enum binder_deferred_state { > - BINDER_DEFERRED_PUT_FILES= 0x01, > - BINDER_DEFERRED_FLUSH= 0x02, > - BINDER_DEFERRED_RELEASE = 0x04, > + BINDER_DEFERRED_FLUSH= 0x01, > + BINDER_DEFERRED_RELEASE = 0x02, > }; > > /** > @@ -480,9 +480,6 @@ enum binder_deferred_state { > *(invariant after initialized) > * @tsk task_struct for group_leader of process > *(invariant after initialized) > - * @files files_struct for process > - *(protected by @files_lock) > - * @files_lockmutex to protect @files > * @deferred_work_node: element for binder_deferred_list > *(protected by binder_deferred_lock) > * @deferred_work:bitmap of deferred work to perform > @@ -527,8 +524,6 @@ struct binder_proc { > struct list_head waiting_threads; > int pid; > struct task_struct *tsk; > - struct files_struct *files; > - struct mutex files_lock; > struct hlist_node deferred_work_node; > int deferred_work; > bool is_dead; > @@ -611,6 +606,23 @@ struct binder_thread { > bool is_dead; > }; > > +/** > + * struct binder_txn_fd_fixup - transaction fd fixup list element > + * @fixup_entry: list entry > + * @file: struct file to be associated with new fd > + * @offset: offset in buffer data to this fixup > + * > + * List element for fd fixups in a transaction. Since file > + * descriptors need to be allocated in the context of the > + * target process, we pass each fd to be processed in this > + * struct. > + */ > +struct binder_txn_fd_fixup { > + struct list_head fixup_entry; > + struct file *file; > + size_t offset; > +}; > + > struct binder_transaction { > int debug_id; > struct binder_work work; > @@ -628,6 +640,7 @@ struct binder_transaction { > longpriority; > longsaved_priority; > kuid_t sender_euid; > + struct list_head fd_fixups; > /** > * @lock: protects @from, @to_proc
[PATCH v2] binder: use standard functions to allocate fds
Binder uses internal fs interfaces to allocate and install fds: __alloc_fd __fd_install __close_fd get_files_struct put_files_struct These were used to support the passing of fds between processes as part of a transaction. The actual allocation and installation of the fds in the target process was handled by the sending process so the standard functions, alloc_fd() and fd_install() which assume task==current couldn't be used. This patch refactors this mechanism so that the fds are allocated and installed by the target process allowing the standard functions to be used. The sender now creates a list of fd fixups that contains the struct *file and the address to fixup with the new fd once it is allocated. This list is processed by the target process when the transaction is dequeued. A new error case is introduced by this change. If an async transaction with file descriptors cannot allocate new fds in the target (probably due to out of file descriptors), the transaction is discarded with a log message. In the old implementation this would have been detected in the sender context and failed prior to sending. Signed-off-by: Todd Kjos --- v2: use "%zu" printk format for size_t drivers/android/Kconfig| 2 +- drivers/android/binder.c | 387 - drivers/android/binder_trace.h | 36 ++- 3 files changed, 260 insertions(+), 165 deletions(-) diff --git a/drivers/android/Kconfig b/drivers/android/Kconfig index 432e9ad770703..51e8250d113fa 100644 --- a/drivers/android/Kconfig +++ b/drivers/android/Kconfig @@ -10,7 +10,7 @@ if ANDROID config ANDROID_BINDER_IPC bool "Android Binder IPC Driver" - depends on MMU + depends on MMU && !CPU_CACHE_VIVT default n ---help--- Binder is used in Android for both communication between processes, diff --git a/drivers/android/binder.c b/drivers/android/binder.c index d58763b6b0090..50856319bc7da 100644 --- a/drivers/android/binder.c +++ b/drivers/android/binder.c @@ -71,6 +71,7 @@ #include #include #include +#include #include @@ -457,9 +458,8 @@ struct binder_ref { }; enum binder_deferred_state { - BINDER_DEFERRED_PUT_FILES= 0x01, - BINDER_DEFERRED_FLUSH= 0x02, - BINDER_DEFERRED_RELEASE = 0x04, + BINDER_DEFERRED_FLUSH= 0x01, + BINDER_DEFERRED_RELEASE = 0x02, }; /** @@ -480,9 +480,6 @@ enum binder_deferred_state { *(invariant after initialized) * @tsk task_struct for group_leader of process *(invariant after initialized) - * @files files_struct for process - *(protected by @files_lock) - * @files_lockmutex to protect @files * @deferred_work_node: element for binder_deferred_list *(protected by binder_deferred_lock) * @deferred_work:bitmap of deferred work to perform @@ -527,8 +524,6 @@ struct binder_proc { struct list_head waiting_threads; int pid; struct task_struct *tsk; - struct files_struct *files; - struct mutex files_lock; struct hlist_node deferred_work_node; int deferred_work; bool is_dead; @@ -611,6 +606,23 @@ struct binder_thread { bool is_dead; }; +/** + * struct binder_txn_fd_fixup - transaction fd fixup list element + * @fixup_entry: list entry + * @file: struct file to be associated with new fd + * @offset: offset in buffer data to this fixup + * + * List element for fd fixups in a transaction. Since file + * descriptors need to be allocated in the context of the + * target process, we pass each fd to be processed in this + * struct. + */ +struct binder_txn_fd_fixup { + struct list_head fixup_entry; + struct file *file; + size_t offset; +}; + struct binder_transaction { int debug_id; struct binder_work work; @@ -628,6 +640,7 @@ struct binder_transaction { longpriority; longsaved_priority; kuid_t sender_euid; + struct list_head fd_fixups; /** * @lock: protects @from, @to_proc, and @to_thread * @@ -920,66 +933,6 @@ static void binder_free_thread(struct binder_thread *thread); static void binder_free_proc(struct binder_proc *proc); static void binder_inc_node_tmpref_ilocked(struct binder_node *node); -static int task_get_unused_fd_flags(struct binder_proc *proc, int flags) -{ - unsigned long rlim_cur; - unsigned long irqs; - int ret; - - mutex_lock(&proc->files_lock); - if (proc->files == NULL) { - ret = -ESRCH; - goto err; - } - if (!lock_task_sighand(proc->tsk, &irqs)) { - ret = -EMFILE; - goto err; - } - rlim_cur = task_rlimit(proc->tsk, RLIMIT_
Re: [PATCH] binder: use standard functions to allocate fds
On Wed, Aug 29, 2018 at 12:00 AM Christoph Hellwig wrote: > > > config ANDROID_BINDER_IPC > > bool "Android Binder IPC Driver" > > - depends on MMU > > + depends on MMU && !CPU_CACHE_VIVT > > Thats is a purely arm specific symbol which should not be > used in common code. Nevermind that there generally should > be no good reason for it. It is true that the current design (10+ years old) has this flaw. VIVT cache support was the rationale for using the non-standard functions to allocate file descriptors from the sender context. There are no known cases of recent android releases running on a device with a VIVT cache architecture, but I did want to call this out. We're working on refactoring to eliminate this issue. > > > + fixup->offset = (uintptr_t)fdp - (uintptr_t)t->buffer->data; > > This looks completely broken. Why would you care at what exact > place the fd is placed? Oh, because you share an array with fds > with userspace, which is a hell of a bad idea, and then maninpulate > that buffer mapped to userspace from kernel threads. Well, android apps rely on this behavior (and have for 10+ years) so there is no way to avoid the need to manipulate the buffer from the kernel. We are working to refactor the driver to do this using standard mechanisms instead of relying on low-level internal functions. > > I think we just need to rm -rf drivers/android/binder*.c and be done > with it, as this piece of crap should never have been merged to start > with.
[PATCH 04/16] staging: gasket: core: remove kobj_name param from gasket_alloc_dev
From: Todd Poynor gasket_alloc_dev can retrieve the device name from the parent parameter, a separate parameter isn't needed for this. Rename the variable to better reflect its meaning, as the name of the parent device for which a gasket device is being allocated. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 17 - 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 3fb805204d700..5f54b3615f67c 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -189,26 +189,26 @@ static int gasket_find_dev_slot(struct gasket_internal_desc *internal_desc, * Returns 0 if successful, a negative error code otherwise. */ static int gasket_alloc_dev(struct gasket_internal_desc *internal_desc, - struct device *parent, struct gasket_dev **pdev, - const char *kobj_name) + struct device *parent, struct gasket_dev **pdev) { int dev_idx; const struct gasket_driver_desc *driver_desc = internal_desc->driver_desc; struct gasket_dev *gasket_dev; struct gasket_cdev_info *dev_info; + const char *parent_name = dev_name(parent); - pr_debug("Allocating a Gasket device %s.\n", kobj_name); + pr_debug("Allocating a Gasket device, parent %s.\n", parent_name); *pdev = NULL; - dev_idx = gasket_find_dev_slot(internal_desc, kobj_name); + dev_idx = gasket_find_dev_slot(internal_desc, parent_name); if (dev_idx < 0) return dev_idx; gasket_dev = *pdev = kzalloc(sizeof(*gasket_dev), GFP_KERNEL); if (!gasket_dev) { - pr_err("no memory for device %s\n", kobj_name); + pr_err("no memory for device, parent %s\n", parent_name); return -ENOMEM; } internal_desc->devs[dev_idx] = gasket_dev; @@ -217,7 +217,7 @@ static int gasket_alloc_dev(struct gasket_internal_desc *internal_desc, gasket_dev->internal_desc = internal_desc; gasket_dev->dev_idx = dev_idx; - snprintf(gasket_dev->kobj_name, GASKET_NAME_MAX, "%s", kobj_name); + snprintf(gasket_dev->kobj_name, GASKET_NAME_MAX, "%s", parent_name); gasket_dev->dev = get_device(parent); /* gasket_bar_data is uninitialized. */ gasket_dev->num_page_tables = driver_desc->num_page_tables; @@ -1431,13 +1431,12 @@ int gasket_pci_add_device(struct pci_dev *pci_dev, struct gasket_dev **gasket_devp) { int ret; - const char *kobj_name = dev_name(&pci_dev->dev); struct gasket_internal_desc *internal_desc; struct gasket_dev *gasket_dev; const struct gasket_driver_desc *driver_desc; struct device *parent; - pr_debug("add PCI device %s\n", kobj_name); + dev_dbg(&pci_dev->dev, "add PCI gasket device\n"); mutex_lock(&g_mutex); internal_desc = lookup_internal_desc(pci_dev); @@ -1451,7 +1450,7 @@ int gasket_pci_add_device(struct pci_dev *pci_dev, driver_desc = internal_desc->driver_desc; parent = &pci_dev->dev; - ret = gasket_alloc_dev(internal_desc, parent, &gasket_dev, kobj_name); + ret = gasket_alloc_dev(internal_desc, parent, &gasket_dev); if (ret) return ret; gasket_dev->pci_dev = pci_dev; -- 2.18.0.597.ga71716f1ad-goog
[PATCH 01/16] MAINTAINERS: Switch a maintainer for drivers/staging/gasket
From: Todd Poynor Todd Poynor takes over for John Joseph. Signed-off-by: John Joseph Signed-off-by: Todd Poynor --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index af64fe0f0b41f..f3466b5c50482 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5939,7 +5939,7 @@ F:Documentation/gcc-plugins.txt GASKET DRIVER FRAMEWORK M: Rob Springer -M: John Joseph +M: Todd Poynor M: Ben Chan S: Maintained F: drivers/staging/gasket/ -- 2.18.0.597.ga71716f1ad-goog
[PATCH 15/16] staging: gasket: interrupt: simplify interrupt init parameters
From: Todd Poynor Pass the gasket driver descriptor to the interrupt init function, rather than exploding out separate parameters from various fields of that structure. This allows us to make more localized changes to the types of interrupts supported (MSIX vs. wire, etc.) without affecting the calling sequence, and seems nicer for simplification purposes. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.c | 8 +-- drivers/staging/gasket/gasket_interrupt.c | 27 +++ drivers/staging/gasket/gasket_interrupt.h | 24 +--- 3 files changed, 15 insertions(+), 44 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.c b/drivers/staging/gasket/gasket_core.c index 99f3f11d75ce2..f230bec76ae4e 100644 --- a/drivers/staging/gasket/gasket_core.c +++ b/drivers/staging/gasket/gasket_core.c @@ -1357,13 +1357,7 @@ int gasket_enable_device(struct gasket_dev *gasket_dev) const struct gasket_driver_desc *driver_desc = gasket_dev->internal_desc->driver_desc; - ret = gasket_interrupt_init(gasket_dev, driver_desc->name, - driver_desc->interrupt_type, - driver_desc->interrupts, - driver_desc->num_interrupts, - driver_desc->interrupt_pack_width, - driver_desc->interrupt_bar_index, - driver_desc->wire_interrupt_offsets); + ret = gasket_interrupt_init(gasket_dev); if (ret) { dev_err(gasket_dev->dev, "Critical failure to allocate interrupts: %d\n", ret); diff --git a/drivers/staging/gasket/gasket_interrupt.c b/drivers/staging/gasket/gasket_interrupt.c index f94e4ea9a7ded..eb5dfbe08e214 100644 --- a/drivers/staging/gasket/gasket_interrupt.c +++ b/drivers/staging/gasket/gasket_interrupt.c @@ -331,31 +331,30 @@ static struct gasket_sysfs_attribute interrupt_sysfs_attrs[] = { GASKET_END_OF_ATTR_ARRAY, }; -int gasket_interrupt_init(struct gasket_dev *gasket_dev, const char *name, - int type, - const struct gasket_interrupt_desc *interrupts, - int num_interrupts, int pack_width, int bar_index, - const struct gasket_wire_interrupt_offsets *wire_int_offsets) +int gasket_interrupt_init(struct gasket_dev *gasket_dev) { int ret; struct gasket_interrupt_data *interrupt_data; + const struct gasket_driver_desc *driver_desc = + gasket_get_driver_desc(gasket_dev); interrupt_data = kzalloc(sizeof(struct gasket_interrupt_data), GFP_KERNEL); if (!interrupt_data) return -ENOMEM; gasket_dev->interrupt_data = interrupt_data; - interrupt_data->name = name; - interrupt_data->type = type; + interrupt_data->name = driver_desc->name; + interrupt_data->type = driver_desc->interrupt_type; interrupt_data->pci_dev = gasket_dev->pci_dev; - interrupt_data->num_interrupts = num_interrupts; - interrupt_data->interrupts = interrupts; - interrupt_data->interrupt_bar_index = bar_index; - interrupt_data->pack_width = pack_width; + interrupt_data->num_interrupts = driver_desc->num_interrupts; + interrupt_data->interrupts = driver_desc->interrupts; + interrupt_data->interrupt_bar_index = driver_desc->interrupt_bar_index; + interrupt_data->pack_width = driver_desc->interrupt_pack_width; interrupt_data->num_configured = 0; - interrupt_data->wire_interrupt_offsets = wire_int_offsets; + interrupt_data->wire_interrupt_offsets = + driver_desc->wire_interrupt_offsets; - interrupt_data->eventfd_ctxs = kcalloc(num_interrupts, + interrupt_data->eventfd_ctxs = kcalloc(driver_desc->num_interrupts, sizeof(struct eventfd_ctx *), GFP_KERNEL); if (!interrupt_data->eventfd_ctxs) { @@ -363,7 +362,7 @@ int gasket_interrupt_init(struct gasket_dev *gasket_dev, const char *name, return -ENOMEM; } - interrupt_data->interrupt_counts = kcalloc(num_interrupts, + interrupt_data->interrupt_counts = kcalloc(driver_desc->num_interrupts, sizeof(ulong), GFP_KERNEL); if (!interrupt_data->interrupt_counts) { diff --git a/drivers/staging/gasket/gasket_interrupt.h b/drivers/staging/gasket/gasket_interrupt.h index 835af439e96a9..85526a1374a1a 100644 --- a/drivers/staging/gasket/gasket_interrupt.h +++ b/drivers/staging/gask
[PATCH 09/16] staging: gasket: core: switch to relaxed memory-mapped I/O
From: Todd Poynor Use of readl() is deprecated; readl_relaxed() with appropriate memory barriers is preferred. Switch to relaxed reads and writes for better performance as well. Memory barriers required for I/O vs. normal memory access on Apex devices have already been explicitly coded in the page table routines. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.h b/drivers/staging/gasket/gasket_core.h index 275fd0b345b6e..fd7e75b765a6d 100644 --- a/drivers/staging/gasket/gasket_core.h +++ b/drivers/staging/gasket/gasket_core.h @@ -590,25 +590,25 @@ const char *gasket_num_name_lookup(uint num, static inline ulong gasket_dev_read_64(struct gasket_dev *gasket_dev, int bar, ulong location) { - return readq(&gasket_dev->bar_data[bar].virt_base[location]); + return readq_relaxed(&gasket_dev->bar_data[bar].virt_base[location]); } static inline void gasket_dev_write_64(struct gasket_dev *dev, u64 value, int bar, ulong location) { - writeq(value, &dev->bar_data[bar].virt_base[location]); + writeq_relaxed(value, &dev->bar_data[bar].virt_base[location]); } static inline void gasket_dev_write_32(struct gasket_dev *dev, u32 value, int bar, ulong location) { - writel(value, &dev->bar_data[bar].virt_base[location]); + writel_relaxed(value, &dev->bar_data[bar].virt_base[location]); } static inline u32 gasket_dev_read_32(struct gasket_dev *dev, int bar, ulong location) { - return readl(&dev->bar_data[bar].virt_base[location]); + return readl_relaxed(&dev->bar_data[bar].virt_base[location]); } static inline void gasket_read_modify_write_64(struct gasket_dev *dev, int bar, -- 2.18.0.597.ga71716f1ad-goog
[PATCH 10/16] staging: gasket: page table: remove extraneous memory barriers
From: Todd Poynor Some explicit memory barriers in the page table code are not necessary, either because: (a) The barrier follows a non-relaxed MMIO access that already performs a read or write memory barrier. (b) The barrier follows DMA API calls for which the device-visible effects of IOMMU programming are guaranteed to be flushed to the IOMMU prior to the call returning, and doesn't need to sync with normal memory access. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_page_table.c | 15 +-- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c index 4d2499269499b..53492f4fad6aa 100644 --- a/drivers/staging/gasket/gasket_page_table.c +++ b/drivers/staging/gasket/gasket_page_table.c @@ -317,8 +317,6 @@ static void gasket_free_extended_subtable(struct gasket_page_table *pg_tbl, /* Release the page table from the device */ writeq(0, slot); - /* Force sync around the address release. */ - mb(); if (pte->dma_addr) dma_unmap_page(pg_tbl->device, pte->dma_addr, PAGE_SIZE, @@ -504,8 +502,6 @@ static int gasket_perform_mapping(struct gasket_page_table *pg_tbl, (void *)page_to_phys(page)); return -1; } - /* Wait until the page is mapped. */ - mb(); } /* Make the DMA-space address available to the device. */ @@ -604,12 +600,13 @@ static void gasket_perform_unmapping(struct gasket_page_table *pg_tbl, */ for (i = 0; i < num_pages; i++) { /* release the address from the device, */ - if (is_simple_mapping || ptes[i].status == PTE_INUSE) + if (is_simple_mapping || ptes[i].status == PTE_INUSE) { writeq(0, &slots[i]); - else + } else { ((u64 __force *)slots)[i] = 0; - /* Force sync around the address release. */ - mb(); + /* sync above PTE update before updating mappings */ + wmb(); + } /* release the address from the driver, */ if (ptes[i].status == PTE_INUSE) { @@ -898,8 +895,6 @@ static int gasket_alloc_extended_subtable(struct gasket_page_table *pg_tbl, /* Map the page into DMA space. */ pte->dma_addr = dma_map_page(pg_tbl->device, pte->page, 0, PAGE_SIZE, DMA_BIDIRECTIONAL); - /* Wait until the page is mapped. */ - mb(); /* make the addresses available to the device */ dma_addr = (pte->dma_addr + pte->offset) | GASKET_VALID_SLOT_FLAG; -- 2.18.0.597.ga71716f1ad-goog
[PATCH 16/16] staging: gasket: interrupt: remove unimplemented interrupt types
From: Todd Poynor Interrupt types PCI_MSI and PLATFORM_WIRE are unused and unimplemented. Remove these. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_core.h | 11 drivers/staging/gasket/gasket_interrupt.c | 34 +-- 2 files changed, 1 insertion(+), 44 deletions(-) diff --git a/drivers/staging/gasket/gasket_core.h b/drivers/staging/gasket/gasket_core.h index fd7e75b765a6d..0203460f48957 100644 --- a/drivers/staging/gasket/gasket_core.h +++ b/drivers/staging/gasket/gasket_core.h @@ -50,8 +50,6 @@ enum gasket_interrupt_packing { /* Type of the interrupt supported by the device. */ enum gasket_interrupt_type { PCI_MSIX = 0, - PCI_MSI = 1, - PLATFORM_WIRE = 2, }; /* @@ -69,12 +67,6 @@ struct gasket_interrupt_desc { int packing; }; -/* Offsets to the wire interrupt handling registers */ -struct gasket_wire_interrupt_offsets { - u64 pending_bit_array; - u64 mask_array; -}; - /* * This enum is used to identify memory regions being part of the physical * memory that belongs to a device. @@ -384,9 +376,6 @@ struct gasket_driver_desc { */ struct gasket_coherent_buffer_desc coherent_buffer_description; - /* Offset of wire interrupt registers. */ - const struct gasket_wire_interrupt_offsets *wire_interrupt_offsets; - /* Interrupt type. (One of gasket_interrupt_type). */ int interrupt_type; diff --git a/drivers/staging/gasket/gasket_interrupt.c b/drivers/staging/gasket/gasket_interrupt.c index eb5dfbe08e214..2cd262be65ca0 100644 --- a/drivers/staging/gasket/gasket_interrupt.c +++ b/drivers/staging/gasket/gasket_interrupt.c @@ -45,9 +45,6 @@ struct gasket_interrupt_data { /* The width of a single interrupt in a packed interrupt register. */ int pack_width; - /* offset of wire interrupt registers */ - const struct gasket_wire_interrupt_offsets *wire_interrupt_offsets; - /* * Design-wise, these elements should be bundled together, but * pci_enable_msix's interface requires that they be managed @@ -92,19 +89,6 @@ static void gasket_interrupt_setup(struct gasket_dev *gasket_dev) dev_dbg(gasket_dev->dev, "Running interrupt setup\n"); - if (interrupt_data->type == PLATFORM_WIRE || - interrupt_data->type == PCI_MSI) { - /* Nothing needs to be done for platform or PCI devices. */ - return; - } - - if (interrupt_data->type != PCI_MSIX) { - dev_dbg(gasket_dev->dev, - "Cannot handle unsupported interrupt type %d\n", - interrupt_data->type); - return; - } - /* Setup the MSIX table. */ for (i = 0; i < interrupt_data->num_interrupts; i++) { @@ -351,8 +335,6 @@ int gasket_interrupt_init(struct gasket_dev *gasket_dev) interrupt_data->interrupt_bar_index = driver_desc->interrupt_bar_index; interrupt_data->pack_width = driver_desc->interrupt_pack_width; interrupt_data->num_configured = 0; - interrupt_data->wire_interrupt_offsets = - driver_desc->wire_interrupt_offsets; interrupt_data->eventfd_ctxs = kcalloc(driver_desc->num_interrupts, sizeof(struct eventfd_ctx *), @@ -379,12 +361,7 @@ int gasket_interrupt_init(struct gasket_dev *gasket_dev) force_msix_interrupt_unmasking(gasket_dev); break; - case PCI_MSI: - case PLATFORM_WIRE: default: - dev_err(gasket_dev->dev, - "Cannot handle unsupported interrupt type %d\n", - interrupt_data->type); ret = -EINVAL; } @@ -439,12 +416,7 @@ int gasket_interrupt_reinit(struct gasket_dev *gasket_dev) force_msix_interrupt_unmasking(gasket_dev); break; - case PCI_MSI: - case PLATFORM_WIRE: default: - dev_dbg(gasket_dev->dev, - "Cannot handle unsupported interrupt type %d\n", - gasket_dev->interrupt_data->type); ret = -EINVAL; } @@ -489,12 +461,8 @@ void gasket_interrupt_cleanup(struct gasket_dev *gasket_dev) gasket_interrupt_msix_cleanup(interrupt_data); break; - case PCI_MSI: - case PLATFORM_WIRE: default: - dev_dbg(gasket_dev->dev, - "Cannot handle unsupported interrupt type %d\n", - interrupt_data->type); + break; } kfree(interrupt_data->interrupt_counts); -- 2.18.0.597.ga71716f1ad-goog
[PATCH 08/16] staging: gasket: page table: use dma_mapping_error for error detection
From: Todd Poynor gasket_perform_mapping() call dma_mapping_error() to determine if mapping failed. Signed-off-by: Todd Poynor --- drivers/staging/gasket/gasket_page_table.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/staging/gasket/gasket_page_table.c b/drivers/staging/gasket/gasket_page_table.c index bd921dc6094de..4d2499269499b 100644 --- a/drivers/staging/gasket/gasket_page_table.c +++ b/drivers/staging/gasket/gasket_page_table.c @@ -493,7 +493,8 @@ static int gasket_perform_mapping(struct gasket_page_table *pg_tbl, (void *)page_to_pfn(page), (unsigned long long)ptes[i].dma_addr); - if (ptes[i].dma_addr == -1) { + if (dma_mapping_error(pg_tbl->device, + ptes[i].dma_addr)) { dev_dbg(pg_tbl->device, "%s i %d -> fail to map page %llx " "[pfn %p ohys %p]\n", -- 2.18.0.597.ga71716f1ad-goog