Re: 2.6.20.6 vanilla does't boot
I showed demsg output at the current running kernel. When booting kernel 2.6.20.6 I see only lines I have described above -- Regards, Denis - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] Apple SMC driver - standardize and sanitize sysfs tree + minor features addition
Hi again, Jean Delvare wrote: >> >> However, I'm not really satisfied with the way sysfs files are created: >> I use a lot of preprocessor macros to avoid repetition of code. >> The files created with these macros in /sys/devices/platform/applesmc are >> the following (on a Macbook Pro): >> fan0_actual_speed >> fan0_manual >> fan0_maximum_speed >> fan0_minimum_speed >> fan0_safe_speed >> fan0_target_speed >> fan1_actual_speed >> fan1_manual >> fan1_maximum_speed >> fan1_minimum_speed >> fan1_safe_speed >> fan1_target_speed >> temperature_0 >> temperature_1 >> temperature_2 >> temperature_3 >> temperature_4 >> temperature_5 >> temperature_6 >> > > First of all, please read Documentation/hwmon/sysfs-documentation, and > rename the entries to match the standard names whenever possible. Also > make sure that you use the standard units. If you use the standard > names and units and if you register your device with the hwmon class, > standard monitoring application will be able to support your driver. > Fixed. [snip] >> Also, I never call any sysfs_remove_* function, as the files are >> deleted when the module is unloaded. Is it safe to do so? Doesn't it >> cause any memory leak? >> > > This is considered a bad practice, as in theory you driver shouldn't > create the device by itself, and the files are associated to the device, > not the driver. All hardware monitoring drivers have been fixed now, so > please add the file removal calls in your driver too. You might find it > easier to use file groups rather than individual files. Again, see for > example the f71805f driver, and in particular the f71805f_attributes > array and f71805f_group structure, and the sysfs_create_group() and > sysfs_remove_group() calls. > Fixed too. I also added some sanity checks, and some minor features I discovered using key enumeration (see next patch). Best regards, Nicolas - Standardize applesmc to use sysfs filenames recommended by Documentation/hwmon/sysfs-interface, and register the device with the hwmon class. - Use snprintf instead of sprintf in sysfs show handlers. - Remove the sysfs files properly in case of initialisation problem, and when the driver is unloaded. - Add data buffer length sanity checks. - Improvements of SMC keys' comments (add data type reported by the device). - Add temperature sensors to Macbook Pro. - Add support for reading fan physical position (e.g. "Left Side") Signed-off-by: Nicolas Boichat <[EMAIL PROTECTED]> --- drivers/hwmon/applesmc.c | 280 -- 1 files changed, 192 insertions(+), 88 deletions(-) diff --git a/drivers/hwmon/applesmc.c b/drivers/hwmon/applesmc.c index f7b59fc..531bc9a 100644 --- a/drivers/hwmon/applesmc.c +++ b/drivers/hwmon/applesmc.c @@ -37,40 +37,48 @@ #include #include #include +#include -/* data port used by apple SMC */ +/* data port used by Apple SMC */ #define APPLESMC_DATA_PORT 0x300 -/* command/status port used by apple SMC */ +/* command/status port used by Apple SMC */ #define APPLESMC_CMD_PORT 0x304 -#define APPLESMC_NR_PORTS 5 /* 0x300-0x304 */ +#define APPLESMC_NR_PORTS 32 /* 0x300-0x31f */ + +#define APPLESMC_MAX_DATA_LENGTH 32 #define APPLESMC_STATUS_MASK 0x0f #define APPLESMC_READ_CMD 0x10 #define APPLESMC_WRITE_CMD 0x11 -#define LIGHT_SENSOR_LEFT_KEY "ALV0" /* r-o length 6 */ -#define LIGHT_SENSOR_RIGHT_KEY "ALV1" /* r-o length 6 */ -#define BACKLIGHT_KEY "LKSB" /* w-o */ +#define LIGHT_SENSOR_LEFT_KEY "ALV0" /* r-o {alv (6 bytes) */ +#define LIGHT_SENSOR_RIGHT_KEY "ALV1" /* r-o {alv (6 bytes) */ +#define BACKLIGHT_KEY "LKSB" /* w-o {lkb (2 bytes) */ -#define CLAMSHELL_KEY "MSLD" /* r-o length 1 (unused) */ +#define CLAMSHELL_KEY "MSLD" /* r-o ui8 (unused) */ -#define MOTION_SENSOR_X_KEY"MO_X" /* r-o length 2 */ -#define MOTION_SENSOR_Y_KEY"MO_Y" /* r-o length 2 */ -#define MOTION_SENSOR_Z_KEY"MO_Z" /* r-o length 2 */ -#define MOTION_SENSOR_KEY "MOCN" /* r/w length 2 */ +#define MOTION_SENSOR_X_KEY"MO_X" /* r-o sp78 (2 bytes) */ +#define MOTION_SENSOR_Y_KEY"MO_Y" /* r-o sp78 (2 bytes) */ +#define MOTION_SENSOR_Z_KEY"MO_Z" /* r-o sp78 (2 bytes) */ +#define MOTION_SENSOR_KEY "MOCN" /* r/w ui16 */ -#define FANS_COUNT "FNum" /* r-o length 1 */ -#define FANS_MANUAL"FS! " /* r-w length 2 */ -#define FAN_ACTUAL_SPEED "F0Ac" /* r-o length 2 */ -#define FAN_MIN_SPEED "F0Mn" /* r-o length 2 */ -#define FAN_MAX_SPEED "F0Mx" /* r-o length 2 */ -#define FAN_SAFE_SPEED "F0Sf" /* r-o length 2 */ -#define FAN_TARGET_SPEED "F0Tg" /* r-w length 2 */ +#define FANS_COUNT "FNum" /* r-o ui8 */ +#define FANS_MANUAL"FS! " /* r-w ui16 */ +#define FAN_ACTUAL_SPEED "F0Ac" /* r-o fpe2 (2 bytes) */ +#define FAN_MIN_SPEED "F0Mn" /* r-o fpe2 (2 bytes) */ +#define FAN_MAX_SPEED
RE: sched_yield proposals/rationale
> From: Bill Davidsen > > And having gotten same, are you going to code up what appears to be a > solution, based on this feedback? The feedback was helpful in verifying whether there are any arguments against my approach. The real proof is in the pudding. I'm running a kernel with these changes, as we speak. Overall system throughput is about up 20%. With 'system throughput' I mean measured performance of a rather large (experimental) system. The patch isn't even 24h old... Also the application latency has improved. Additional settings: my patch is running *also* with a kernel modified to have only 8 default time slices at 250Hz setting. And no, the overall number of context switches per second hasn't blown up. The kernel was compiled with low latency and in-kernel preemption enabled, BKL preemption enabled. I haven't checked the patch stand alone yet. > I'm curious how well it would run poorly written programs, having > recently worked with a company which seemed to have a whole part of > purchasing dedicated to buying same. :-( So first signs are positive; note that it requires much more run time and a slew of other tests/scrutiny before we can be really sure. W.r.t. the remarks; I am most interested in possibilities of DOS attacks that could exploit this change in sched_yield. Therefore the comments of Andi were interesting, but I haven't heard back from him yet. I'm still not sure how a task could juggle more slices from the system because of these changes. Last remark on the O(1)'ness being violated. I think it's a mooth point. The sched_yield is executed on the CPU time of the yielder. Being O(1) is most important for the scheduler proper at each timer tick (interrupt). That being O(1) is crucial. Steven Buytaert -- La perfection est atteinte non quand il ne reste rien ajouter, mais quand il ne reste rien à enlever. (Antoine de Saint-Exupéry) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.20.6 vanilla does't boot
Hi all! I installed a new kernel 2.6.20.6 and it is unable to boot. During loading, I get some messages from the kernel, similar to the following PCI: BIOS Bug: MCFG area at x is not E820-reserved PCI: Not using MMCONFIG udevplug: make_queue: unable to create /dev/.udev/queue: No such file or directory udevplug: make_queue: unable to create /dev/.udev/queue: File exists sda: assuming drive cache: write through Then loading stops and after about 2 minutes boots BusyBox. Kernel version 2.6.18.8 and 2.6.16.44-rc2 loaded properly. the output of dmesg: [17179569.184000] Linux version 2.6.15-26-686 ([EMAIL PROTECTED]) (gcc version 4.0.3 (Ubuntu 4.0.3-1ubuntu5)) #1 SMP PREEMPT Fri Sep 8 20:16:40 UTC 2006 [17179569.184000] BIOS-provided physical RAM map: [17179569.184000] BIOS-e820: - 0009fc00 (usable) [17179569.184000] BIOS-e820: 0009fc00 - 000a (reserved) [17179569.184000] BIOS-e820: 000e4000 - 0010 (reserved) [17179569.184000] BIOS-e820: 0010 - 7ffa (usable) [17179569.184000] BIOS-e820: 7ffa - 7ffae000 (ACPI data) [17179569.184000] BIOS-e820: 7ffae000 - 7ffe (ACPI NVS) [17179569.184000] BIOS-e820: 7ffe - 8000 (reserved) [17179569.184000] BIOS-e820: ffb0 - 0001 (reserved) [17179569.184000] 1151MB HIGHMEM available. [17179569.184000] 896MB LOWMEM available. [17179569.184000] found SMP MP-table at 000ff780 [17179569.184000] On node 0 totalpages: 524192 [17179569.184000] DMA zone: 4096 pages, LIFO batch:0 [17179569.184000] DMA32 zone: 0 pages, LIFO batch:0 [17179569.184000] Normal zone: 225280 pages, LIFO batch:31 [17179569.184000] HighMem zone: 294816 pages, LIFO batch:31 [17179569.184000] DMI 2.3 present. [17179569.184000] ACPI: RSDP (v000 ACPIAM ) @ 0x000fad00 [17179569.184000] ACPI: RSDT (v001 A M I OEMRSDT 0x05000504 MSFT 0x0097) @ 0x7ffa [17179569.184000] ACPI: FADT (v001 A M I OEMFACP 0x05000504 MSFT 0x0097) @ 0x7ffa0200 [17179569.184000] ACPI: MADT (v001 A M I OEMAPIC 0x05000504 MSFT 0x0097) @ 0x7ffa0390 [17179569.184000] ACPI: OEMB (v001 A M I AMI_OEM 0x05000504 MSFT 0x0097) @ 0x7ffae040 [17179569.184000] >>> ERROR: Invalid checksum [17179569.184000] ACPI: MCFG (v001 A M I OEMMCFG 0x05000504 MSFT 0x0097) @ 0x7ffa8810 [17179569.184000] ACPI: DSDT (v001 A0229 A0229000 0x INTL 0x02002026) @ 0x [17179569.184000] ACPI: PM-Timer IO Port: 0x808 [17179569.184000] ACPI: Local APIC address 0xfee0 [17179569.184000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled) [17179569.184000] Processor #0 15:4 APIC version 20 [17179569.184000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled) [17179569.184000] Processor #1 15:4 APIC version 20 [17179569.184000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x82] disabled) [17179569.184000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x83] disabled) [17179569.184000] ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0]) [17179569.184000] IOAPIC[0]: apic_id 2, version 32, address 0xfec0, GSI 0-23 [17179569.184000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [17179569.184000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) [17179569.184000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) [17179569.184000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) [17179569.184000] ACPI: IRQ0 used by override. [17179569.184000] ACPI: IRQ2 used by override. [17179569.184000] ACPI: IRQ9 used by override. [17179569.184000] Enabling APIC mode: Flat. Using 1 I/O APICs [17179569.184000] Using ACPI (MADT) for SMP configuration information [17179569.184000] Allocating PCI resources starting at 8800 (gap: 8000:7fb0) [17179569.184000] Built 1 zonelists [17179569.184000] Kernel command line: root=/dev/sda2 ro quiet vga=795 [17179569.184000] mapped APIC to d000 (fee0) [17179569.184000] mapped IOAPIC to c000 (fec0) [17179569.184000] Initializing CPU#0 [17179569.184000] PID hash table entries: 4096 (order: 12, 65536 bytes) [17179569.184000] Detected 3011.186 MHz processor. [17179569.184000] Using pmtmr for high-res timesource [17179569.184000] Console: colour dummy device 80x25 [17179572.764000] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) [17179572.764000] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) [17179572.868000] Memory: 2065764k/2096768k available (2115k kernel code, 29864k reserved, 595k data, 332k init, 1179264k highmem) [17179572.868000] Checking if this processor honours the WP bit even in supervisor mode... Ok. [17179572.948000] Calibrating delay using timer specific routine.. 6029.29 BogoMIPS (lpj=12058592) [17179572.948000] Security Framework v1.0.0 initialized [17179572.948000] SELinux: Disabled at boot. [17179572.948000] Mount-cache hash table entries: 512 [17179572.948000] CPU: After generic identify, caps: bfebfbff 2000
[PATCH]Fix parsing kernelcore boot option for ia64
Hello. cmdline_parse_kernelcore() should return the next pointer of boot option like memparse() doing. If not, it is cause of eternal loop on ia64 box. This patch is for 2.6.21-rc6-mm1. Signed-off-by: Yasunori Goto <[EMAIL PROTECTED]> arch/ia64/kernel/efi.c |2 +- include/linux/mm.h |2 +- mm/page_alloc.c|4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) Index: current_test/arch/ia64/kernel/efi.c === --- current_test.orig/arch/ia64/kernel/efi.c2007-04-12 17:33:28.0 +0900 +++ current_test/arch/ia64/kernel/efi.c 2007-04-13 12:13:21.0 +0900 @@ -424,7 +424,7 @@ efi_init (void) } else if (memcmp(cp, "max_addr=", 9) == 0) { max_addr = GRANULEROUNDDOWN(memparse(cp + 9, )); } else if (memcmp(cp, "kernelcore=",11) == 0) { - cmdline_parse_kernelcore(cp+11); + cmdline_parse_kernelcore(cp+11, ); } else if (memcmp(cp, "min_addr=", 9) == 0) { min_addr = GRANULEROUNDDOWN(memparse(cp + 9, )); } else { Index: current_test/mm/page_alloc.c === --- current_test.orig/mm/page_alloc.c 2007-04-12 18:25:37.0 +0900 +++ current_test/mm/page_alloc.c2007-04-13 12:12:58.0 +0900 @@ -3736,13 +3736,13 @@ void __init free_area_init_nodes(unsigne * kernelcore=size sets the amount of memory for use for allocations that * cannot be reclaimed or migrated. */ -int __init cmdline_parse_kernelcore(char *p) +int __init cmdline_parse_kernelcore(char *p, char **retp) { unsigned long long coremem; if (!p) return -EINVAL; - coremem = memparse(p, ); + coremem = memparse(p, retp); required_kernelcore = coremem >> PAGE_SHIFT; /* Paranoid check that UL is enough for required_kernelcore */ Index: current_test/include/linux/mm.h === --- current_test.orig/include/linux/mm.h2007-04-11 14:15:33.0 +0900 +++ current_test/include/linux/mm.h 2007-04-13 12:12:20.0 +0900 @@ -1051,7 +1051,7 @@ extern unsigned long find_max_pfn_with_a extern void free_bootmem_with_active_regions(int nid, unsigned long max_low_pfn); extern void sparse_memory_present_with_active_regions(int nid); -extern int cmdline_parse_kernelcore(char *p); +extern int cmdline_parse_kernelcore(char *p, char **retp); #ifndef CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID extern int early_pfn_to_nid(unsigned long pfn); #endif /* CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID */ -- Yasunori Goto - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/10] add "permit user mounts in new namespace" clone flag
"Serge E. Hallyn" <[EMAIL PROTECTED]> writes: > Quoting Miklos Szeredi ([EMAIL PROTECTED]): >> From: Miklos Szeredi <[EMAIL PROTECTED]> >> >> If CLONE_NEWNS and CLONE_NEWNS_USERMNT are given to clone(2) or >> unshare(2), then allow user mounts within the new namespace. >> >> This is not flexible enough, because user mounts can't be enabled for >> the initial namespace. >> >> The remaining clone bits also getting dangerously few... >> >> Alternatives are: >> >> - prctl() flag >> - setting through the containers filesystem > > Sorry, I know I had mentioned it, but this is definately my least > favorite approach. > > Curious whether are any other suggestions/opinions from the containers > list? Given the existence of shared subtrees allowing/denying this at the mount namespace level is silly and wrong. If we need more than just the filesystem permission checks can we make it a mount flag settable with mount and remount that allows non-privileged users the ability to create mount points under it in directories they have full read/write access to. I don't like the use of clone flags for this purpose but in this case the shared subtress are a much more fundamental reasons for not doing this at the namespace level. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH Trivial Resend] [DOC] Add webpages' URL and summarize 3 lines.
Trivial patch, against -rc6. Please apply, thanks. --- CREDITS: - Summarize 3 lines into one. - Add webpage. MAINTAINERS: - Add auxdisplay drivers/tree webpages. CREDITS |7 +++ MAINTAINERS |4 2 files changed, 7 insertions(+), 4 deletions(-) Signed-off-by: Miguel Ojeda Sandonis <[EMAIL PROTECTED]> --- diff --git a/CREDITS b/CREDITS index 6bd8ab8..f990730 100644 --- a/CREDITS +++ b/CREDITS @@ -2573,10 +2573,9 @@ S: Australia N: Miguel Ojeda Sandonis E: [EMAIL PROTECTED] -D: Author: Auxiliary LCD Controller driver (ks0108) -D: Author: Auxiliary LCD driver (cfag12864b) -D: Author: Auxiliary LCD framebuffer driver (cfag12864bfb) -D: Maintainer: Auxiliary display drivers tree (drivers/auxdisplay/*) +W: http://maxextreme.googlepages.com/ +D: Author of the ks0108, cfag12864b and cfag12864bfb auxiliary display drivers. +D: Maintainer of the auxiliary display drivers tree (drivers/auxdisplay/*) S: C/ Mieses 20, 9-B S: Valladolid 47009 S: Spain diff --git a/MAINTAINERS b/MAINTAINERS index 829407f..2a658ef 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -672,6 +672,7 @@ AUXILIARY DISPLAY DRIVERS P: Miguel Ojeda Sandonis M: [EMAIL PROTECTED] L: [EMAIL PROTECTED] +W: http://auxdisplay.googlepages.com/ S: Maintained AVR32 ARCHITECTURE @@ -884,12 +885,14 @@ CFAG12864B LCD DRIVER P: Miguel Ojeda Sandonis M: [EMAIL PROTECTED] L: [EMAIL PROTECTED] +W: http://auxdisplay.googlepages.com/ S: Maintained CFAG12864BFB LCD FRAMEBUFFER DRIVER P: Miguel Ojeda Sandonis M: [EMAIL PROTECTED] L: [EMAIL PROTECTED] +W: http://auxdisplay.googlepages.com/ S: Maintained COMMON INTERNET FILE SYSTEM (CIFS) @@ -2020,6 +2023,7 @@ KS0108 LCD CONTROLLER DRIVER P: Miguel Ojeda Sandonis M: [EMAIL PROTECTED] L: [EMAIL PROTECTED] +W: http://auxdisplay.googlepages.com/ S: Maintained LAPB module -- Miguel Ojeda http://maxextreme.googlepages.com/index.htm - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [KJ][PATCH 02/03]ROUND_UP|DOWN macro cleanup in arch/ia64,x86_64
On 14:13 Thu 12 Apr , Luck, Tony wrote: > On Fri, Apr 13, 2007 at 02:01:40AM +0530, Milind Arun Choudhary wrote: > > - size = ROUNDUP(size, iovp_size); > > + size = ALIGN(size, iovp_size); > > Why is "ALIGN" better than "ROUNDUP"? I can't see any point > to this change. Its a janitorial work. I'm trying to celanup all the corners where ROUNDUP/DOWN & likes are defined. Kernel.h currently has macros like ALIGN roundup DIV_ROUND_UP. in this patch series I've added ALIGN_DOWN & round_down [waiting for comments on the same.] So as ALIGN macro does the same work as ROUNDUP, is at a common place & is accessible to everyone it should be used instead...i think -- Milind Arun Choudhary - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Feature Request?] Inline compression of process core dumps
On Thu, Apr 12, 2007 at 10:57:37PM -0400, Christopher S. Aker wrote: > The process is a UML instance (skas mode, so at least a kernel, > userspace, and io thread), which will generate a single, usable, core > file just fine with a non-pipe core_pattern... Yeah, but can you get a core file without the .pid on the end? I just tried, with core_pattern == core and core_uses_pid == 0, and I still got core.pid. I can fix this on my end - just have to kill off a bunch of things before aborting. Jeff -- Work email - jdike at linux dot intel dot com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 05/10] add "permit user mounts in new namespace" clone flag
On Thu, Apr 12, 2007 at 03:32:08PM -0500, Serge E. Hallyn wrote: > Quoting Miklos Szeredi ([EMAIL PROTECTED]): > > From: Miklos Szeredi <[EMAIL PROTECTED]> > > > > If CLONE_NEWNS and CLONE_NEWNS_USERMNT are given to clone(2) or > > unshare(2), then allow user mounts within the new namespace. > > This is not flexible enough, because user mounts can't be enabled for > > the initial namespace. > > > > The remaining clone bits also getting dangerously few... ATM I think we do not have that many CLONE flags available, so that this feature will have to wait for a clone2/64 or similar ... > > Alternatives are: > > > > - prctl() flag > > - setting through the containers filesystem > Sorry, I know I had mentioned it, but this is definately my least > favorite approach. > > Curious whether are any other suggestions/opinions from the containers > list? question: how is mounting filesystems (loopback, fuse, etc) secured in such way that the user cannot 'create' device nodes with 'unfortunate' permissions? TIA, Herbert > thanks, > -serge > > > Signed-off-by: Miklos Szeredi <[EMAIL PROTECTED]> > > --- > > > > Index: linux/fs/namespace.c > > === > > --- linux.orig/fs/namespace.c 2007-04-12 13:46:19.0 +0200 > > +++ linux/fs/namespace.c2007-04-12 13:54:36.0 +0200 > > @@ -1617,6 +1617,8 @@ struct mnt_namespace *copy_mnt_ns(int fl > > return ns; > > > > new_ns = dup_mnt_ns(ns, new_fs); > > + if (new_ns && (flags & CLONE_NEWNS_USERMNT)) > > + new_ns->flags |= MNT_NS_PERMIT_USERMOUNTS; > > > > put_mnt_ns(ns); > > return new_ns; > > Index: linux/include/linux/sched.h > > === > > --- linux.orig/include/linux/sched.h2007-04-12 13:26:48.0 > > +0200 > > +++ linux/include/linux/sched.h 2007-04-12 13:54:36.0 +0200 > > @@ -26,6 +26,7 @@ > > #define CLONE_STOPPED 0x0200 /* Start in stopped > > state */ > > #define CLONE_NEWUTS 0x0400 /* New utsname group? */ > > #define CLONE_NEWIPC 0x0800 /* New ipcs */ > > +#define CLONE_NEWNS_USERMNT0x1000 /* Allow user mounts in > > ns? */ > > > > /* > > * Scheduling policies > > Index: linux/kernel/fork.c > > === > > --- linux.orig/kernel/fork.c2007-04-11 18:27:46.0 +0200 > > +++ linux/kernel/fork.c 2007-04-12 13:59:10.0 +0200 > > @@ -1586,7 +1586,7 @@ asmlinkage long sys_unshare(unsigned lon > > err = -EINVAL; > > if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND| > > CLONE_VM|CLONE_FILES|CLONE_SYSVSEM| > > - CLONE_NEWUTS|CLONE_NEWIPC)) > > + CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNS_USERMNT)) > > goto bad_unshare_out; > > > > if ((err = unshare_thread(unshare_flags))) > > > > -- > > - > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to [EMAIL PROTECTED] > > More majordomo info at http://vger.kernel.org/majordomo-info.html > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.linux-foundation.org/mailman/listinfo/containers - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xmon debugger doc?
On Thu, 2007-04-12 at 22:48 -0500, Olof Johansson wrote: > On Thu, Apr 12, 2007 at 03:44:07PM -0500, Steve Wise wrote: > > Can someone please point me at ppc64 xmon debugger usage / > > documentation? I've had little luck finding info on-line. > > The help output from it is pretty much all there is. > > You might have better luck asking on [EMAIL PROTECTED] though > (adding as Cc). > > There's also an old writeup at > http://mbligh.org/linuxdocs/Kernel/DebuggingPPC64 for the very basics > of digging through a crash. Some of it is likely out of date by now. A good trick which the help output doesn't mention is that % and $ are special in input, so you can do: 0:mon> di %pc disassemble instructions at address pointed to by register PC Other regs are eg: %lr, %r1, %r12. And it works with di, d and other commands. Also: 0:mon> di $.xmon_register_spus disassemble instructions at address of symbol .xmon_register_spus. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person signature.asc Description: This is a digitally signed message part
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? And my problem isn't with the hardcoded pagetable walker. Yeah, we'd probably still keep the pagetable callback walker thingy with Matt's associated cleanups (and my subsequent ones to clean it up more and move it to mm/): there are other in-kernel users for that anyway. The point is the proc API, and exposing random little parts of deep kernel internals that some people happen to find useful at the time. (which is why we have an incredible proliferation of these things). With systemtap scripts, you could walk pagetables and print *the exact page information you want*, or you could walk pfns, or LRU, or page_tree, or walk the page tree then the rmap structures. And you can selectively cull out items you don't care about if you only care about a subset of items, based on arbitrary criteria. And you can most likely do all that more efficiently than with a conglomeration of various /proc files (assuming they even provide what you want in the first place). -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: xmon debugger doc?
On Thu, Apr 12, 2007 at 03:44:07PM -0500, Steve Wise wrote: > Can someone please point me at ppc64 xmon debugger usage / > documentation? I've had little luck finding info on-line. The help output from it is pretty much all there is. You might have better luck asking on [EMAIL PROTECTED] though (adding as Cc). There's also an old writeup at http://mbligh.org/linuxdocs/Kernel/DebuggingPPC64 for the very basics of digging through a crash. Some of it is likely out of date by now. -Olof - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: HPA patches
On Fri, Mar 23, 2007 at 01:03:15PM -0700, Randy Dunlap wrote: > > It's 0x40. Its a "command dependant bit" - no useful name. > > dependent. OK, thanks. > Hi, Pondering about this, it's ATA_LBA according to the docs, specifying that the address is an LBA. Cheers, Kyle - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Nick Piggin wrote: Andrew Morton wrote: Then you just end up with the same thing, don't you? Well _you_ do, because that happens to be exactly what you want. Bill ends up with something that displays page_mapcount instead. And I end up with something that traverses LRU lists rather than pfns. And none of it goes in /proc/ or linux-2.6/. Oh, and you get to change it without recompiling and rebooting your kernel. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] USB: BandRich BandLuxe HSDPA Data Card Driver
This patch adds the detection for the BandRich BandLuxe C100/C100S/C120 HSDPA Data Card. With the vendor and product IDs are set properly, the data card can be detected and works fine. It was patched based on Kernel 2.6.20.1. Signed-off-by: Leon Leong <[EMAIL PROTECTED]> --- Index: drivers/usb/serial/option.c === --- linux-2.6.20.1/drivers/usb/serial/option.c.orig 2007-02-05 02:44:54.0 +0800 +++ linux-2.6.20.1/drivers/usb/serial/option.c 2007-04-13 10:36:33.0 +0800 @@ -72,6 +72,7 @@ static int option_send_setup(struct usb #define AUDIOVOX_VENDOR_ID 0x0F3D #define NOVATELWIRELESS_VENDOR_ID 0x1410 #define ANYDATA_VENDOR_ID 0x16d5 +#define BANDRICH_VENDOR_ID 0x1A8D #define OPTION_PRODUCT_OLD 0x5000 #define OPTION_PRODUCT_FUSION 0x6000 @@ -84,6 +85,8 @@ static int option_send_setup(struct usb #define AUDIOVOX_PRODUCT_AIRCARD0x0112 #define NOVATELWIRELESS_PRODUCT_U7400x1400 #define ANYDATA_PRODUCT_ID 0x6501 +#define BANDRICH_PRODUCT_C100_1 0x1002 +#define BANDRICH_PRODUCT_C100_2 0x1003 static struct usb_device_id option_ids[] = { { USB_DEVICE(OPTION_VENDOR_ID, OPTION_PRODUCT_OLD) }, @@ -97,6 +100,8 @@ static struct usb_device_id option_ids[] { USB_DEVICE(AUDIOVOX_VENDOR_ID, AUDIOVOX_PRODUCT_AIRCARD) }, { USB_DEVICE(NOVATELWIRELESS_VENDOR_ID,NOVATELWIRELESS_PRODUCT_U740) }, { USB_DEVICE(ANYDATA_VENDOR_ID, ANYDATA_PRODUCT_ID) }, + { USB_DEVICE(BANDRICH_VENDOR_ID, BANDRICH_PRODUCT_C100_1) }, + { USB_DEVICE(BANDRICH_VENDOR_ID, BANDRICH_PRODUCT_C100_2) }, { } /* Terminating entry */ }; @@ -112,6 +117,8 @@ static struct usb_device_id option_ids1[ { USB_DEVICE(AUDIOVOX_VENDOR_ID, AUDIOVOX_PRODUCT_AIRCARD) }, { USB_DEVICE(NOVATELWIRELESS_VENDOR_ID,NOVATELWIRELESS_PRODUCT_U740) }, { USB_DEVICE(ANYDATA_VENDOR_ID, ANYDATA_PRODUCT_ID) }, + { USB_DEVICE(BANDRICH_VENDOR_ID, BANDRICH_PRODUCT_C100_1) }, + { USB_DEVICE(BANDRICH_VENDOR_ID, BANDRICH_PRODUCT_C100_2) }, { } /* Terminating entry */ }; === - Leon Leong [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Feature Request?] Inline compression of process core dumps
Randy Dunlap wrote: On Thu, 12 Apr 2007 22:22:18 -0400 Christopher S. Aker wrote: Alan Cox wrote: > Indeed. So useful that in current kernels you can set the core dump > path to be > > "|application" Cool stuff! However, it's not working (2.6.20.6): Core dump to |/home/caker/bin/dumper.pl.4442 pipe failed even though... # cat /proc/sys/kernel/core_uses_pid 0 # cat /proc/sys/kernel/core_pattern |/home/caker/bin/dumper.pl Looking at the code, it seems to me that format_corename() is appending .pid, regardless if !core_uses_pid and corename[0]=='|', in which case it creates an invalid path for call_usermodehelper_pipe(). Bug in the code, or bug in my methods? What are you trying to dump? is it a multi-thread group app, not a "simple" app? I ask because of this (I'm looking at 2.6.21-rc6) reference (not that I know what that is): if (!pid_in_pattern && (core_uses_pid || atomic_read(>mm->mm_users) != 1)) { rc = snprintf(out_ptr, out_end - out_ptr, ".%d", current->tgid); if (rc > out_end - out_ptr) goto out; out_ptr += rc; } I saw that too, and unfortunately I don't know what what that condition represents, either. It's the only other element in that if statement that could make it take that path, so I'm assuming that's part of the problem. The process is a UML instance (skas mode, so at least a kernel, userspace, and io thread), which will generate a single, usable, core file just fine with a non-pipe core_pattern... -Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Stop pmac_zilog from abusing 8250's device numbers.
On Thu, 12 Apr 2007, Gerhard Mack wrote: Sometimes it's not the speed it's the cost.. The best I've ever done is 5.5 interfaces per u/ Although with a better motherboard and case it might have been different. I have a bunch of servers from rackable, dual core cpu, 2G ram 2xgigE on the motherboard (1x100M on motherboard), 4x Intel E1000 quad port cards, 120G SATA drive, DVD burner, floppy 3u, 18 gig ports, just under $5k if you have 36" deep racks you can put them back to back and have two of these in 3u (12 gig ports per u) not nessasarily the cheapest available, but they've been reliable, and there's pleanty of CPU and ram to handle firewall tasks. besides, sometimes you don't want to trust the closed-source vlan implementations on the switches ;-) David Lang http://innerfire.net/pics/projects/21portfirewall_2.jpg (assigns each port it's ip range and blocks any address not assigned to that port) On Thu, 12 Apr 2007, Roland Dreier wrote: Date: Thu, 12 Apr 2007 08:34:40 -0700 From: Roland Dreier <[EMAIL PROTECTED]> To: Benny Amorsen <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: [PATCH] Stop pmac_zilog from abusing 8250's device numbers. > Indeed, port density is disappointingly poor in modern servers. Do you > know any with more than 14 ports per U? (That's an MBX 1U server with > 8 on-board and a 6-port expansion). If you really need a ton of ports you could probably build a 1U server with 2 * 2-port 10gig NICs, and use VLAN-capable switches with 10gig and 1gig ports to fan out each 10gig link from your server to 10 1-gig ports. That would get you 40 ports of 1-gig from each server (plus whatever the server has on board). - R. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Gerhard Mack [EMAIL PROTECTED] <>< As a computer I find your faith in technology amusing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? I guess I'm not clear on what you're proposing. From my understanding of kprobes (admittedly not an expert), this is hard to do and not a very good match. But you have an idea that it is bad for exposing lots of data. Why? (I'm not a kprobes expert either, these are not rhetorical questions) From what it looks like, you can traverse data structures and copy data back to userspace. Which is what makes me think it might be suitable (or could be made suitable). Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. Traverse the LRU? Which files do they belong to? What process maps them? -ENOPARSE. Basically, any "stuff" other than what you're exposing. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. It looks like you can traverse arbitrary data structures, yes. It definitely seems like you can use some kernel functions, but the ones I saw may just be systemtap facilities. But what is so surprising about being able to call a kernel function when running in kernel context? Perhaps there is some fundamental limitation of kprobes that I don't understand. Then you just end up with the same thing, don't you? Well _you_ do, because that happens to be exactly what you want. Bill ends up with something that displays page_mapcount instead. And I end up with something that traverses LRU lists rather than pfns. And none of it goes in /proc/ or linux-2.6/. So it isn't really the same thing at all. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Stop pmac_zilog from abusing 8250's device numbers.
Sometimes it's not the speed it's the cost.. The best I've ever done is 5.5 interfaces per u/ Although with a better motherboard and case it might have been different. http://innerfire.net/pics/projects/21portfirewall_2.jpg (assigns each port it's ip range and blocks any address not assigned to that port) On Thu, 12 Apr 2007, Roland Dreier wrote: > Date: Thu, 12 Apr 2007 08:34:40 -0700 > From: Roland Dreier <[EMAIL PROTECTED]> > To: Benny Amorsen <[EMAIL PROTECTED]> > Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] > Subject: Re: [PATCH] Stop pmac_zilog from abusing 8250's device numbers. > > > Indeed, port density is disappointingly poor in modern servers. Do you > > know any with more than 14 ports per U? (That's an MBX 1U server with > > 8 on-board and a 6-port expansion). > > If you really need a ton of ports you could probably build a 1U server > with 2 * 2-port 10gig NICs, and use VLAN-capable switches with 10gig > and 1gig ports to fan out each 10gig link from your server to 10 1-gig > ports. That would get you 40 ports of 1-gig from each server (plus > whatever the server has on board). > > - R. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Gerhard Mack [EMAIL PROTECTED] <>< As a computer I find your faith in technology amusing. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Cheap lock for user mode processes release when process exits
On Fri, 13 Apr 2007 01:54:28 + (GMT) <[EMAIL PROTECTED]> wrote: > Hi all, > > Maybe someone here knows better. > > I have several user-mode processes using shared mmap. There can be several > reader processes and only one writer. Readers access the shared region > frequently, writer seldom. > > Naturally, multi-reader/single-writer locks works best. I tried this with > futex on 2.6.9-42.EL. However, if one of the processes is killed/exits, the > lock doesn't get released. > > I can trap the signal to release the lock, but not all signals like kill. > > Anyway I can achieve this without a potential deadlock? > Robust futexes: http://lwn.net/Articles/172149/ But I don't know whether RH backported them into RHEL4. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Kernel-discuss] Re: [PATCH 3/7] [RFC] Battery monitoring class
On Thu, Apr 12, 2007 at 10:34:06PM -0400, Shem Multinymous wrote: > Hi, > > On 4/12/07, Henrique de Moraes Holschuh <[EMAIL PROTECTED]> wrote: > >On Fri, 13 Apr 2007, Anton Vorontsov wrote: > >> * Yup, I've read last discussion regarding batteries, and I've seen > >> objections against "charge" term, quoting Shem Multinymous: > >> > >> "And, for the reasons I explained earlier, I strongly suggest not using > >> the term "charge" except when referring to the action of charging. > >> Hence: > >> s/charge_rate/rate/; s/charge/capacity/" > >> > >> But lets think about it once again? We'll make things much cleaner > >> if we'll drop "capacity" at all. > > > >I stand with Shem on this one. The people behind the SBS specification > >seems to agree... that specification is aimed at *engineers* and still > >avoids the obvious trap of using "charge" due to its high potential for > >confusion. > > > >I don't even want to know how much of a mess the people writing applets > >woudl make of it... > > With fixed-units files, having *_energy and *_capacity isn't too clear > either... Nor is it consistent with SBS, since SBS uses "capacity" to > refer to either energy or charge, depending on a units attribute. > > As a compromise, how about using "energy" and "charge" for quantities, > and "charging" (i.e., a verb) when referring to the operation? It would be great compromise! Please please please! -- Anton Vorontsov email: [EMAIL PROTECTED] backup email: [EMAIL PROTECTED] irc://irc.freenode.org/bd2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Feature Request?] Inline compression of process core dumps
On Thu, 12 Apr 2007 22:22:18 -0400 Christopher S. Aker wrote: > Alan Cox wrote: > > Indeed. So useful that in current kernels you can set the core dump > > path to be > > > >"|application" > > Cool stuff! However, it's not working (2.6.20.6): > > Core dump to |/home/caker/bin/dumper.pl.4442 pipe failed > > even though... > > # cat /proc/sys/kernel/core_uses_pid > 0 > # cat /proc/sys/kernel/core_pattern > |/home/caker/bin/dumper.pl > > Looking at the code, it seems to me that format_corename() is appending > .pid, regardless if !core_uses_pid and corename[0]=='|', in which case > it creates an invalid path for call_usermodehelper_pipe(). > > Bug in the code, or bug in my methods? What are you trying to dump? is it a multi-thread group app, not a "simple" app? I ask because of this (I'm looking at 2.6.21-rc6) reference (not that I know what that is): if (!pid_in_pattern && (core_uses_pid || atomic_read(>mm->mm_users) != 1)) { rc = snprintf(out_ptr, out_end - out_ptr, ".%d", current->tgid); if (rc > out_end - out_ptr) goto out; out_ptr += rc; } --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 12:21:25PM +1000, Nick Piggin wrote: > Matt Mackall wrote: > >On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: > > >>If kprobes is simply crappy and doesn't work properly for this, then I > >>could accept that. I'm not someone trying to get this info. So why can't > >>it be used? (not just for kpagemap, but for clear_refs and all that gunk > >>too). > > > > > >kprobes is good for looking at events, but bad for looking at state. > >Especially metric shitloads of state. > > Why? Why is a kprobes trap significantly more expensive than a read > syscall? I guess I'm not clear on what you're proposing. From my understanding of kprobes (admittedly not an expert), this is hard to do and not a very good match. > >>Maybe. How about LRU? Reclaim performance is bad, and you want to work out > >>which pages keep going off the end of it, or which pages keep getting > >>written out via it, or who's pages are on the active list, forcing mine > >>out. > > > > > >Those are actually probably a good match for systemtap as they're all > >events. > > Traverse the LRU? Which files do they belong to? What process maps them? -ENOPARSE. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Kernel-discuss] Re: [PATCH 3/7] [RFC] Battery monitoring class
Hi, On 4/12/07, Henrique de Moraes Holschuh <[EMAIL PROTECTED]> wrote: On Fri, 13 Apr 2007, Anton Vorontsov wrote: > * Yup, I've read last discussion regarding batteries, and I've seen > objections against "charge" term, quoting Shem Multinymous: > > "And, for the reasons I explained earlier, I strongly suggest not using > the term "charge" except when referring to the action of charging. > Hence: > s/charge_rate/rate/; s/charge/capacity/" > > But lets think about it once again? We'll make things much cleaner > if we'll drop "capacity" at all. I stand with Shem on this one. The people behind the SBS specification seems to agree... that specification is aimed at *engineers* and still avoids the obvious trap of using "charge" due to its high potential for confusion. I don't even want to know how much of a mess the people writing applets woudl make of it... With fixed-units files, having *_energy and *_capacity isn't too clear either... Nor is it consistent with SBS, since SBS uses "capacity" to refer to either energy or charge, depending on a units attribute. As a compromise, how about using "energy" and "charge" for quantities, and "charging" (i.e., a verb) when referring to the operation? BTW, tp_smapi uses "charge" and "charging" interchangeably; that was a mistake. Shem - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 12:18:56 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > I guess one could generate an answer to the static question with systemtap, > > by accumulating running counts across the application lifetime and then > > snapshotting them. Sounds hard though. > > Can't you just traverse arbitrary kernel data structures at a given point > in time, exactly like the /proc/ call is doing? Do a full pagetable walk, with all the associated locking from within a systemtap script? I'd be surprised. Maybe if it's mostly hand-coded in C, perhaps. Then you just end up with the same thing, don't you? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
D-Link DFE-580TX 4 port Server Adapter problem: only 2 of 4 ports
I've got a problem with my DFE-580TX cards when I installed thoose into a new server box. One card has been worked before in a test box, it's sure, here is a dmesg snippet when everything was OK: Apr 3 22:10:38 cyrax kernel: sundance.c:v1.2 11-Sep-2006 Written by Donald Becker Apr 3 22:10:38 cyrax kernel: http://www.scyld.com/network/sundance.html Apr 3 22:10:38 cyrax kernel: ACPI: PCI Interrupt :02:04.0[A] -> GSI 21 (level, low) -> IRQ 23 Apr 3 22:10:38 cyrax kernel: eth2: D-Link DFE-580TX 4 port Server Adapter at 0001a800, 00:0d:88:cc:da:dc, IRQ 23. Apr 3 22:10:38 cyrax kernel: eth2: MII PHY found at address 1, status 0x7809 advertising 01e1. Apr 3 22:10:38 cyrax kernel: ACPI: PCI Interrupt :02:05.0[A] -> GSI 22 (level, low) -> IRQ 18 Apr 3 22:10:39 cyrax kernel: eth3: D-Link DFE-580TX 4 port Server Adapter at 0001b000, 00:0d:88:cc:da:dd, IRQ 18. Apr 3 22:10:39 cyrax kernel: eth3: MII PHY found at address 1, status 0x7809 advertising 01e1. Apr 3 22:10:39 cyrax kernel: ACPI: PCI Interrupt :02:06.0[A] -> GSI 23 (level, low) -> IRQ 19 Apr 3 22:10:39 cyrax kernel: eth4: D-Link DFE-580TX 4 port Server Adapter at 0001b400, 00:0d:88:cc:da:de, IRQ 19. Apr 3 22:10:39 cyrax kernel: eth4: MII PHY found at address 1, status 0x7809 advertising 01e1. Apr 3 22:10:39 cyrax kernel: ACPI: PCI Interrupt :02:07.0[A] -> GSI 20 (level, low) -> IRQ 20 Apr 3 22:10:39 cyrax kernel: eth5: D-Link DFE-580TX 4 port Server Adapter at 0001b800, 00:0d:88:cc:da:df, IRQ 20. Apr 3 22:10:39 cyrax kernel: eth5: MII PHY found at address 1, status 0x7809 advertising 01e1. And the current dmesg from the new box, when I've only 2 of 4 ports on each card: sundance.c:v1.2 11-Sep-2006 Written by Donald Becker http://www.scyld.com/network/sundance.html ACPI: PCI Interrupt :05:04.0[A] -> GSI 21 (level, low) -> IRQ 22 eth2: D-Link DFE-580TX 4 port Server Adapter at 00012180, 00:00:00:00:00:00, IRQ 22. eth2: No MII transceiver found, aborting. ASIC status ACPI: PCI Interrupt :05:05.0[A] -> GSI 22 (level, low) -> IRQ 23 eth2: D-Link DFE-580TX 4 port Server Adapter at 00012100, 00:00:00:00:00:00, IRQ 23. eth2: No MII transceiver found, aborting. ASIC status ACPI: PCI Interrupt :05:06.0[A] -> GSI 23 (level, low) -> IRQ 19 eth2: D-Link DFE-580TX 4 port Server Adapter at 00012080, 00:0d:88:cc:da:ee, IRQ 19. eth2: MII PHY found at address 1, status 0x7809 advertising 01e1. ACPI: PCI Interrupt :05:07.0[A] -> GSI 20 (level, low) -> IRQ 21 eth3: D-Link DFE-580TX 4 port Server Adapter at 00012000, 00:0d:88:cc:da:ef, IRQ 21. eth3: MII PHY found at address 1, status 0x7809 advertising 01e1. ACPI: PCI Interrupt :06:04.0[A] -> GSI 22 (level, low) -> IRQ 23 eth4: D-Link DFE-580TX 4 port Server Adapter at 00011180, 00:00:00:00:00:00, IRQ 23. eth4: No MII transceiver found, aborting. ASIC status ACPI: PCI Interrupt :06:05.0[A] -> GSI 21 (level, low) -> IRQ 22 eth4: D-Link DFE-580TX 4 port Server Adapter at 00011100, 00:00:00:00:00:00, IRQ 22. eth4: No MII transceiver found, aborting. ASIC status ACPI: PCI Interrupt :06:06.0[A] -> GSI 20 (level, low) -> IRQ 21 eth4: D-Link DFE-580TX 4 port Server Adapter at 00011080, 00:0d:88:cc:da:de, IRQ 21. eth4: MII PHY found at address 1, status 0x7809 advertising 01e1. ACPI: PCI Interrupt :06:07.0[A] -> GSI 23 (level, low) -> IRQ 19 eth5: D-Link DFE-580TX 4 port Server Adapter at 00011000, 00:0d:88:cc:da:df, IRQ 19. eth5: MII PHY found at address 1, status 0x7809 advertising 01e1. Kernel version is vanilla 2.6.20.3, Sundance MMIO disabled in the config. I can send lspci, full dmesg, .config if anyone interested in. Maybe it's a BIOS problem? What should I try? thanks, -- d - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote: I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. You'd have to do it from boot onward to get a complete system image. One way to look at it is that systemtap can give you the derivative of the information, and you have to integrate it. So everyone keeps saying. Would you tell me why you can't just traverse the data structures in the same way as your proc handler? From the systemtap example scripts it seems like you can traverse arbitrary kernel data structures. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] i386 - pte update optimizations
H. Peter Anvin wrote: Zachary Amsden wrote: Some PTE optimizations for native and paravirt-ops kernels; this provides a huge win for shadow mode hypervisors and gets rid of some unnecessary atomic instructions in native kernels, saving even more on UP by getting rid of implicit LOCK on xchg instruction. You do know that P6 and higher don't do locked bus references as long as the value is in the cache, right? Yes. Even then, last time I clocked instructions, xchg was still slower than read / write, although I could be misremembering. And it's not totally clear that they will always be in cached state, however, and for SMP, we still want to drop the implicit lock in cases where the processor might not know they are cached exclusive, but we know there are no other racing users. And there are plenty of old processors out there to still make it worthwhile. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Feature Request?] Inline compression of process core dumps
Alan Cox wrote: > Indeed. So useful that in current kernels you can set the core dump > path to be > >"|application" Cool stuff! However, it's not working (2.6.20.6): Core dump to |/home/caker/bin/dumper.pl.4442 pipe failed even though... # cat /proc/sys/kernel/core_uses_pid 0 # cat /proc/sys/kernel/core_pattern |/home/caker/bin/dumper.pl Looking at the code, it seems to me that format_corename() is appending .pid, regardless if !core_uses_pid and corename[0]=='|', in which case it creates an invalid path for call_usermodehelper_pipe(). Bug in the code, or bug in my methods? -Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] NET: [UPDATED] Multiqueue network device support implementation.
Waskiewicz Jr, Peter P wrote: >>Still leaks the device > > > I explained this in a previous response, and you seemed to be ok with > the explanation. Can you elaborate if this is still an issue? I'm OK with allocating subqueues even for single queue devices, not with leaking memory on error. >>I stand by my point, this needs to be explicitly enabled by >>the user since it changes the behaviour of prio on multiqueue >>capable device. > > > The user can enable this by using TC filters. I do understand what you > mean that PRIO's behavior somewhat changes because of how the queues > turn off and on, but how is this different than today? Today, if the > queue on the NIC shuts down, all the PRIO queues are down. Right. And if the queue is enabled again, bands continue to be dequeued in strict priority order. > This way > it's actually helping get traffic out. I don't see how the user can > control which queue is shut down; that is a function of how congested > the network is. So if I can clarify what you're saying, are you asking > that the user actually setup the band to queue mapping? Because if so, > I don't see how that would help since queues today don't have any > priority, and you would have no control which one stops over another > one. No, I'm asking that the users explicitly states that he wants the driver to control which bands are dequeued (by stopping and starting subqueues) and not the established strict priority order. You assume everyone using prio on e1000 wants to use multiple HW queues, which is not necessarily true. Additionally the prio qdisc might be used as child of a classful qdisc that assumes it can always dequeue packets as long as q.qlen > 0 (HFSC for example will complain if it can't since that is a configuration error). So I'm asking that you only enable this behaviour if the user does something like this: tc qdisc add dev eth0 root handle 1: prio bands N multiqueue Ideally the band2queue mapping would be supplied by userspace as well, but currently that doesn't seem to be possible in a clean way since userspace has no way of finding out how many queues the HW supports. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. Why? Why is a kprobes trap significantly more expensive than a read syscall? Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. Those are actually probably a good match for systemtap as they're all events. Traverse the LRU? Which files do they belong to? What process maps them? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. I guess we have static analysis versus dynamic. The interfaces which Matt is proposing are suited to answering the question "what is my memory being used for" (static). They're unlikely to be useful for answering the question "what's happening in the VM" (dynamic). Systemtap is probably better for the dynamic analysis. "what is my memory being used for *now*" ;) I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. Can't you just traverse arbitrary kernel data structures at a given point in time, exactly like the /proc/ call is doing? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Kernel-discuss] Re: [PATCH 3/7] [RFC] Battery monitoring class
On Thu, Apr 12, 2007 at 09:51:12PM -0300, Henrique de Moraes Holschuh wrote: > On Fri, 13 Apr 2007, Anton Vorontsov wrote: > > Let's name attributes with mWh units as {min_,max_,design_,}energy, > > and attributes with mAh units as {min_,max_,design_,}charge. > > [...] > > > * Yup, I've read last discussion regarding batteries, and I've seen > > objections against "charge" term, quoting Shem Multinymous: > > > > "And, for the reasons I explained earlier, I strongly suggest not using > > the term "charge" except when referring to the action of charging. > > Hence: > > s/charge_rate/rate/; s/charge/capacity/" > > > > But lets think about it once again? We'll make things much cleaner > > if we'll drop "capacity" at all. > > I stand with Shem on this one. The people behind the SBS specification > seems to agree... that specification is aimed at *engineers* and still > avoids the obvious trap of using "charge" due to its high potential for > confusion. > > I don't even want to know how much of a mess the people writing applets > woudl make of it... :-( Okay, term "charge" is out of scope, I guess. But can we use "capacity" for xAh, and "energy" for xWh? I just trying to separate these terms somehow, and avoid "_units" stuff. > > > > That said, you may need to use uWh and uAh instead of mAh and mWh, though. > > > > Not sure. Is there any existing chip that can report uAh/uWh? That is > > great precision. > > The way things are going, it should be feasible for small embedded systems > quite soon. Refer to the previous thread. I see... is it also applicable to currents and voltages? I.e. should we use uA and uV from the start? -- Anton Vorontsov email: [EMAIL PROTECTED] backup email: [EMAIL PROTECTED] irc://irc.freenode.org/bd2 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 06:57:23PM -0700, Andrew Morton wrote: > I guess one could generate an answer to the static question with systemtap, > by accumulating running counts across the application lifetime and then > snapshotting them. Sounds hard though. You'd have to do it from boot onward to get a complete system image. One way to look at it is that systemtap can give you the derivative of the information, and you have to integrate it. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote: Basically: to show what the hell's going on in the VM. kprobes / systemtap isn't good enough? It's not really a good match to the kprobes model. I'm not interested in events, per se. I don't want to need to know about every single alloc/free of N different varieties integrated from boot onward to build up an image of the state of the system. Instead, I want to take snapshots of the state of the VM. Systemtap can't output a large set of values? Why can't you attach a kprobe to a dummy syscall, and from there iterate over pgdat/zone/memmap and output what you want? Actually I'm surprised that kind of data querying facility isn't already in there (I haven't used it seriously though). The main goal here is to be able to answer the question "where's my memory going?". Currently you can't really give a good answer to that question from userspace because of shared mappings, etc. There are lots of secondary questions that follow on very quickly from that, like "what parts of my shared mappings are or aren't shared, and why?", "what's actually in my application's working set?" and "how much of this crap can I ditch?". I understand roughly what you want, and that you can't easily get it from /proc currently. My question at this point is just why can we not use systemtap. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 11:42:29AM +1000, Nick Piggin wrote: > >Instead, one says "what pages are being used by my application", then, for > > That includes unmapped pagecache being used by my application, doesn't it? > Maybe that's too hard to do via /proc so we forget about it... It'd be really nice to have a window into the pagecache too. But I for one couldn't come up with a sensible scheme for it. > >each of those pages "what is that page's state". So the first step is to > >collect all the pfns from /proc/$(pidof my-application)/pagemap and then to > >use those pfns to look the individual pages up in /proc/kpagemap. > > OK I realise you could do it that way, but systemtap can definitely be > used as a tool for understanding application behaviour in the context of > the kernel, I think? The purpose for it is so that various little bits > of deep kernel internals do not have to be exposed on a case by case basis. > > If kprobes is simply crappy and doesn't work properly for this, then I > could accept that. I'm not someone trying to get this info. So why can't > it be used? (not just for kpagemap, but for clear_refs and all that gunk > too). kprobes is good for looking at events, but bad for looking at state. Especially metric shitloads of state. > > If you really want to know "who is using page 123435" then you'd need to > > search /proc/*/pagemap. There are possibly legitimate reasons why an > > application developer would want to at least pertially perform such an > > operation ("who am I sharing with"), but I doubt if it's the common case. > > Maybe. How about LRU? Reclaim performance is bad, and you want to work out > which pages keep going off the end of it, or which pages keep getting > written out via it, or who's pages are on the active list, forcing mine > out. Those are actually probably a good match for systemtap as they're all events. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH 2/3] NET: [UPDATED] Multiqueue network device support implementation.
> -Original Message- > From: Patrick McHardy [mailto:[EMAIL PROTECTED] > Sent: Thursday, April 12, 2007 5:16 PM > To: Waskiewicz Jr, Peter P > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED]; [EMAIL PROTECTED]; cramerj; > Kok, Auke-jan H; Leech, Christopher > Subject: Re: [PATCH 2/3] NET: [UPDATED] Multiqueue network > device support implementation. > > Peter P Waskiewicz Jr wrote: > > diff --git a/net/core/dev.c b/net/core/dev.c index 219a57f..3ce449e > > 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -1471,6 +1471,8 @@ gso: > > q = dev->qdisc; > > if (q->enqueue) { > > rc = q->enqueue(skb, q); > > + /* reset queue_mapping to zero */ > > + skb->queue_mapping = 0; > > > This must be done before enqueueing. At this point you don't > even have a valid reference to the skb anymore. Agreed, this is a transcription error on my part between my dev box and this tree. > > > @@ -3326,12 +3330,23 @@ struct net_device *alloc_netdev(int > sizeof_priv, const char *name, > > if (sizeof_priv) > > dev->priv = netdev_priv(dev); > > > > + alloc_size = (sizeof(struct net_device_subqueue) * queue_count); > > + > > + p = kzalloc(alloc_size, GFP_KERNEL); > > + if (!p) { > > + printk(KERN_ERR "alloc_netdev: Unable to > allocate queues.\n"); > > + return NULL; > > > Still leaks the device I explained this in a previous response, and you seemed to be ok with the explanation. Can you elaborate if this is still an issue? > > > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c index > > 5cfe60b..6a38905 100644 > > --- a/net/sched/sch_prio.c > > +++ b/net/sched/sch_prio.c > > @@ -144,11 +152,17 @@ prio_dequeue(struct Qdisc* sch) > > struct Qdisc *qdisc; > > > > for (prio = 0; prio < q->bands; prio++) { > > - qdisc = q->queues[prio]; > > - skb = qdisc->dequeue(qdisc); > > - if (skb) { > > - sch->q.qlen--; > > - return skb; > > + /* Check if the target subqueue is available before > > +* pulling an skb. This way we avoid excessive requeues > > +* for slower queues. > > +*/ > > + if (!netif_subqueue_stopped(sch->dev, > q->band2queue[prio])) { > > + qdisc = q->queues[prio]; > > + skb = qdisc->dequeue(qdisc); > > + if (skb) { > > + sch->q.qlen--; > > + return skb; > > + } > > } > > } > > return NULL; > > @@ -200,6 +214,10 @@ static int prio_tune(struct Qdisc > *sch, struct rtattr *opt) > > struct prio_sched_data *q = qdisc_priv(sch); > > struct tc_prio_qopt *qopt = RTA_DATA(opt); > > int i; > > + int queue; > > + int qmapoffset; > > + int offset; > > + int mod; > > > > if (opt->rta_len < RTA_LENGTH(sizeof(*qopt))) > > return -EINVAL; > > @@ -242,6 +260,30 @@ static int prio_tune(struct Qdisc > *sch, struct rtattr *opt) > > } > > } > > } > > + /* setup queue to band mapping */ > > + if (q->bands < sch->dev->egress_subqueue_count) { > > + qmapoffset = 1; > > + mod = sch->dev->egress_subqueue_count; > > + } else { > > + mod = q->bands % sch->dev->egress_subqueue_count; > > + qmapoffset = q->bands / > sch->dev->egress_subqueue_count + > > + ((mod) ? 1 : 0); > > + } > > + > > + queue = 0; > > + offset = 0; > > + for (i = 0; i < q->bands; i++) { > > + q->band2queue[i] = queue; > > + if ( ((i + 1) - offset) == qmapoffset) { > > + queue++; > > + offset += qmapoffset; > > + if (mod) > > + mod--; > > + qmapoffset = q->bands / > > + sch->dev->egress_subqueue_count + > > + ((mod) ? 1 : 0); > > + } > > + } > > return 0; > > } > > > I stand by my point, this needs to be explicitly enabled by > the user since it changes the behaviour of prio on multiqueue > capable device. > The user can enable this by using TC filters. I do understand what you mean that PRIO's behavior somewhat changes because of how the queues turn off and on, but how is this different than today? Today, if the queue on the NIC shuts down, all the PRIO queues are down. This way it's actually helping get traffic out. I don't see how the user can control which queue is shut down; that is a function of how congested the network is. So if I can clarify what you're saying, are you asking that the user actually setup the band to queue mapping? Because if so, I don't see how that would help since queues today don't have any priority, and you would have no control which one stops
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:42:29 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > > > > >>Andrew Morton wrote: > > >>>It *will* be viable. If the application wants to know if a page is dirty, > >>>it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's > >>>numerical offset when inspecting fields in /proc/kpagemap. If correctly > >>>designed, such a monitoring application will be able to report upon page > >>>flags which we haven't even thought up yet. > >> > >>Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. > >>Still seems like a basically hit and miss affair to just use flags. What if > >>you want to know the process mapping a page? With systemtap or something you > >>could walk the rmap structures. What if you want to look at pages along the > >>LRU list rather than per-pfn? What about connecting pages to inodes? > > > > > > Well hang on. This isn't a tool for understanding kernel behaviour. It's > > a tool for understanding applciation behaviour. > > > > So one doesn't ask "who is mapping that page" - that's a kernel developer > > thing. > > > > Instead, one says "what pages are being used by my application", then, for > > That includes unmapped pagecache being used by my application, doesn't it? > Maybe that's too hard to do via /proc so we forget about it... Yes, harder. I'm hoping that sampling of /proc/pid/io can be used to determine pagecache use sufficiently accurately. I know of one large hosting company who are using it ("BTW, we are making great use of taskstats!! Its GREAT!") > > > each of those pages "what is that page's state". So the first step is to > > collect all the pfns from /proc/$(pidof my-application)/pagemap and then to > > use those pfns to look the individual pages up in /proc/kpagemap. > > OK I realise you could do it that way, but systemtap can definitely be > used as a tool for understanding application behaviour in the context of > the kernel, I think? The purpose for it is so that various little bits > of deep kernel internals do not have to be exposed on a case by case basis. > > If kprobes is simply crappy and doesn't work properly for this, then I > could accept that. I'm not someone trying to get this info. So why can't > it be used? (not just for kpagemap, but for clear_refs and all that gunk > too). > > > If you really want to know "who is using page 123435" then you'd need to > > search /proc/*/pagemap. There are possibly legitimate reasons why an > > application developer would want to at least pertially perform such an > > operation ("who am I sharing with"), but I doubt if it's the common case. > > Maybe. How about LRU? Reclaim performance is bad, and you want to work out > which pages keep going off the end of it, or which pages keep getting > written out via it, or who's pages are on the active list, forcing mine > out. I guess we have static analysis versus dynamic. The interfaces which Matt is proposing are suited to answering the question "what is my memory being used for" (static). They're unlikely to be useful for answering the question "what's happening in the VM" (dynamic). Systemtap is probably better for the dynamic analysis. I guess one could generate an answer to the static question with systemtap, by accumulating running counts across the application lifetime and then snapshotting them. Sounds hard though. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Cheap lock for user mode processes release when process exits
Hi all, Maybe someone here knows better. I have several user-mode processes using shared mmap. There can be several reader processes and only one writer. Readers access the shared region frequently, writer seldom. Naturally, multi-reader/single-writer locks works best. I tried this with futex on 2.6.9-42.EL. However, if one of the processes is killed/exits, the lock doesn't get released. I can trap the signal to release the lock, but not all signals like kill. Anyway I can achieve this without a potential deadlock? Thanks, Michael - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 11:01:41AM +1000, Nick Piggin wrote: > >Basically: to show what the hell's going on in the VM. > > kprobes / systemtap isn't good enough? It's not really a good match to the kprobes model. I'm not interested in events, per se. I don't want to need to know about every single alloc/free of N different varieties integrated from boot onward to build up an image of the state of the system. Instead, I want to take snapshots of the state of the VM. The main goal here is to be able to answer the question "where's my memory going?". Currently you can't really give a good answer to that question from userspace because of shared mappings, etc. There are lots of secondary questions that follow on very quickly from that, like "what parts of my shared mappings are or aren't shared, and why?", "what's actually in my application's working set?" and "how much of this crap can I ditch?". -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: Andrew Morton wrote: It *will* be viable. If the application wants to know if a page is dirty, it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? Well hang on. This isn't a tool for understanding kernel behaviour. It's a tool for understanding applciation behaviour. So one doesn't ask "who is mapping that page" - that's a kernel developer thing. Instead, one says "what pages are being used by my application", then, for That includes unmapped pagecache being used by my application, doesn't it? Maybe that's too hard to do via /proc so we forget about it... each of those pages "what is that page's state". So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. OK I realise you could do it that way, but systemtap can definitely be used as a tool for understanding application behaviour in the context of the kernel, I think? The purpose for it is so that various little bits of deep kernel internals do not have to be exposed on a case by case basis. If kprobes is simply crappy and doesn't work properly for this, then I could accept that. I'm not someone trying to get this info. So why can't it be used? (not just for kpagemap, but for clear_refs and all that gunk too). > If you really want to know "who is using page 123435" then you'd need to > search /proc/*/pagemap. There are possibly legitimate reasons why an > application developer would want to at least pertially perform such an > operation ("who am I sharing with"), but I doubt if it's the common case. Maybe. How about LRU? Reclaim performance is bad, and you want to work out which pages keep going off the end of it, or which pages keep getting written out via it, or who's pages are on the active list, forcing mine out. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] i386 - pte update optimizations
Zachary Amsden wrote: Some PTE optimizations for native and paravirt-ops kernels; this provides a huge win for shadow mode hypervisors and gets rid of some unnecessary atomic instructions in native kernels, saving even more on UP by getting rid of implicit LOCK on xchg instruction. You do know that P6 and higher don't do locked bus references as long as the value is in the cache, right? -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 11:14:20 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > Andrew Morton wrote: > > On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > > + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { > + ppage = pfn_to_page(pfn); > + if (!ppage) { > + page[i] = 0; > + page[i + 1] = 0; > + } else { > + page[i] = ppage->flags; > + page[i + 1] = atomic_read(>_count); > + } > + } > >>> > >>> > >>>Not a good idea to expose raw flags in this manner - it changes at the drop > >>>of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > >>>mapping to make this viable. > >> > >>I don't think it is viable because that makes the flags part of the > >>userspace ABI. > > > > > > It *will* be viable. If the application wants to know if a page is dirty, > > it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's > > numerical offset when inspecting fields in /proc/kpagemap. If correctly > > designed, such a monitoring application will be able to report upon page > > flags which we haven't even thought up yet. > > Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. > Still seems like a basically hit and miss affair to just use flags. What if > you want to know the process mapping a page? With systemtap or something you > could walk the rmap structures. What if you want to look at pages along the > LRU list rather than per-pfn? What about connecting pages to inodes? Well hang on. This isn't a tool for understanding kernel behaviour. It's a tool for understanding applciation behaviour. So one doesn't ask "who is mapping that page" - that's a kernel developer thing. Instead, one says "what pages are being used by my application", then, for each of those pages "what is that page's state". So the first step is to collect all the pfns from /proc/$(pidof my-application)/pagemap and then to use those pfns to look the individual pages up in /proc/kpagemap. If you really want to know "who is using page 123435" then you'd need to search /proc/*/pagemap. There are possibly legitimate reasons why an application developer would want to at least pertially perform such an operation ("who am I sharing with"), but I doubt if it's the common case. > > But I was going to say > that satisfying an Oracle requirement is a good reason _not_ to merge it ;) > hm, yes, there's plenty of precedent for that. > (I joke!) I akpm! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage->flags; + page[i + 1] = atomic_read(>_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. Ooh, you wanted a _runtime_ mapping of flags, yeah then I guess that works. Still seems like a basically hit and miss affair to just use flags. What if you want to know the process mapping a page? With systemtap or something you could walk the rmap structures. What if you want to look at pages along the LRU list rather than per-pfn? What about connecting pages to inodes? I thought this type of deep poking was the whole reason the probles thingies were merged. I'm saddened that they're no good for this. I thought it would be an ideal usage :( I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Yeah it is close, they need page_mapcount I think. But I was going to say that satisfying an Oracle requirement is a good reason _not_ to merge it ;) (I joke!) -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] convert aio event reap to use atomic-op instead of spin_lock
On 4/12/07, Ken Chen <[EMAIL PROTECTED]> wrote: On 4/12/07, Jeff Moyer <[EMAIL PROTECTED]> wrote: > I didn't see any response to Zach's request for code that actually > tests out the shared ring buffer. Do you have such code? Yes, I do. I was stress testing the code since last night. After 20+ hours of stress run with fio and aio-stress, now I'm posting it with confidence. I modified libaio's io_getevents to take advantage of new user level reap function. The feature is exported out via ring->compat_features. btw, is compat_feature suppose to be a version number or a bit mask? I think bitmask make more sense and more flexible. Additional patch on the kernel side to export the new features. On top of patch posted at: http://marc.info/?l=linux-kernel=117636401818057=2 --- a/include/linux/aio.h +++ b/include/linux/aio.h @@ -138,8 +138,11 @@ #define init_sync_kiocb(x, filp) \ init_wait((&(x)->ki_wait)); \ } while (0) +#define AIO_RING_BASE 1 +#define AIO_RING_USER_REAP 2 + #define AIO_RING_MAGIC 0xa10a10a1 -#define AIO_RING_COMPAT_FEATURES 1 +#define AIO_RING_COMPAT_FEATURES (AIO_RING_BASE | AIO_RING_USER_REAP) #define AIO_RING_INCOMPAT_FEATURES 0 struct aio_ring { unsignedid; /* kernel internal index number */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Matt Mackall wrote: On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote: Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: + while (count > 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). If I send you this file dumped from a random machine, you won't know what to make of it. That's a good reason ;) I'm planning to write a trivial server to sit on, say, my embedded target and spew this over the wire to a client. Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. I wonder what they are needed for. Basically: to show what the hell's going on in the VM. kprobes / systemtap isn't good enough? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] convert aio event reap to use atomic-op instead of spin_lock
On 4/12/07, Jeff Moyer <[EMAIL PROTECTED]> wrote: I didn't see any response to Zach's request for code that actually tests out the shared ring buffer. Do you have such code? Yes, I do. I was stress testing the code since last night. After 20+ hours of stress run with fio and aio-stress, now I'm posting it with confidence. I modified libaio's io_getevents to take advantage of new user level reap function. The feature is exported out via ring->compat_features. btw, is compat_feature suppose to be a version number or a bit mask? I think bitmask make more sense and more flexible. (warning: some lines are extremely long in the patch and my email client will probably mangle it badly). diff -Nurp libaio-0.3.104/src/io_getevents.c libaio-0.3.104-new/src/io_getevents.c --- libaio-0.3.104/src/io_getevents.c 2003-06-18 12:58:21.0 -0700 +++ libaio-0.3.104-new/src/io_getevents.c 2007-04-12 17:35:06.0 -0700 @@ -21,10 +21,13 @@ #include #include #include "syscall.h" +#include io_syscall5(int, __io_getevents_0_4, io_getevents, io_context_t, ctx, long, min_nr, long, nr, struct io_event *, events, struct timespec *, timeout) #define AIO_RING_MAGIC 0xa10a10a1 +#define AIO_RING_BASE 1 +#define AIO_RING_USER_REAP 2 /* Ben will hate me for this */ struct aio_ring { @@ -41,7 +44,11 @@ struct aio_ring { int io_getevents_0_4(io_context_t ctx, long min_nr, long nr, struct io_event * events, struct timespec * timeout) { + long i = 0, ret; + unsigned head; + struct io_event *evt_base; struct aio_ring *ring; + ring = (struct aio_ring*)ctx; if (ring==NULL || ring->magic != AIO_RING_MAGIC) goto do_syscall; @@ -49,9 +56,35 @@ int io_getevents_0_4(io_context_t ctx, l if (ring->head == ring->tail) return 0; } - + + if (!(ring->compat_features & AIO_RING_USER_REAP)) + goto do_syscall; + + if (min_nr > nr || min_nr < 0 || nr < 0) + return -EINVAL; + + evt_base = (struct io_event *) (ring + 1); + while (i < nr) { + head = ring->head; + if (head == ring->tail) + break; + + *events = evt_base[head & (ring->nr - 1)]; + if (head == cmpxchg(>head, head, head + 1)) { + events++; + i++; + } + } + + if (i >= min_nr) + return i; + do_syscall: - return __io_getevents_0_4(ctx, min_nr, nr, events, timeout); + ret = __io_getevents_0_4(ctx, min_nr - i, nr - i, events, timeout); + if (ret >= 0) + return i + ret; + else + return i ? i : ret; } DEFSYMVER(io_getevents_0_4, io_getevents, 0.4) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Kernel-discuss] Re: [PATCH 3/7] [RFC] Battery monitoring class
On Fri, 13 Apr 2007, Anton Vorontsov wrote: > Let's name attributes with mWh units as {min_,max_,design_,}energy, > and attributes with mAh units as {min_,max_,design_,}charge. [...] > * Yup, I've read last discussion regarding batteries, and I've seen > objections against "charge" term, quoting Shem Multinymous: > > "And, for the reasons I explained earlier, I strongly suggest not using > the term "charge" except when referring to the action of charging. > Hence: > s/charge_rate/rate/; s/charge/capacity/" > > But lets think about it once again? We'll make things much cleaner > if we'll drop "capacity" at all. I stand with Shem on this one. The people behind the SBS specification seems to agree... that specification is aimed at *engineers* and still avoids the obvious trap of using "charge" due to its high potential for confusion. I don't even want to know how much of a mess the people writing applets woudl make of it... > > That said, you may need to use uWh and uAh instead of mAh and mWh, though. > > Not sure. Is there any existing chip that can report uAh/uWh? That is > great precision. The way things are going, it should be feasible for small embedded systems quite soon. Refer to the previous thread. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, 13 Apr 2007 10:15:24 +1000 Nick Piggin <[EMAIL PROTECTED]> wrote: > >>+ ((char *)page)[1] = PAGE_SHIFT; > > > > > > OK. > > Shouldn't we just expose page size and endianness by other means? (another > file or > syscall). I don't think so - this file exposes fairly deep kernel internals and that's unavoidable, really - it's *supposed* to do that. It is explicitly designed for monitoring kernel behaviour. So it needs special handling by userspace. Keeping the number of files which need such special handling to a minimum will keep the number of applications which are exposed to kernel changes to a minimum. > >>+ for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { > >>+ ppage = pfn_to_page(pfn); > >>+ if (!ppage) { > >>+ page[i] = 0; > >>+ page[i + 1] = 0; > >>+ } else { > >>+ page[i] = ppage->flags; > >>+ page[i + 1] = atomic_read(>_count); > >>+ } > >>+ } > > > > > > Not a good idea to expose raw flags in this manner - it changes at the drop > > of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > > mapping to make this viable. > > I don't think it is viable because that makes the flags part of the > userspace ABI. It *will* be viable. If the application wants to know if a page is dirty, it looks up "PG_dirty" in /proc/pg_foo-to-bitnumber and uses PG_dirty's numerical offset when inspecting fields in /proc/kpagemap. If correctly designed, such a monitoring application will be able to report upon page flags which we haven't even thought up yet. > I wonder what they are needed for. Poking deeply into the kernel to provide information about kernel state. There are real-world needs for this, and the people who develop tools to process this information will have decent kernel understanding and will know that the file's contents may alter across kernel versions. It sure beats poking around in /dev/kmem. I doubt if there's a sensible way in which we can prettify this interface without losing information. But we should aim to make it as robust as possible agaisnt future kenrel changes, of course. And we should satisfy ourselves that all the required information has been made available. The fact that it will satisfy the Oracle requirement is encouraging. Matt, these changes make the new field in /proc/pid/smaps redundant, don't they? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Fri, Apr 13, 2007 at 10:15:24AM +1000, Nick Piggin wrote: > Andrew Morton wrote: > >On Thu, 12 Apr 2007 16:10:50 -0700 > >William Lee Irwin III <[EMAIL PROTECTED]> wrote: > > >>+ while (count > 0) { > >>+ chunk = min_t(size_t, count, PAGE_SIZE); > >>+ i = 0; > >>+ > >>+ if (pfn == -1) { > >>+ page[0] = 0; > >>+ page[1] = 0; > >>+ ((char *)page)[0] = (ntohl(1) != 1); > > > > > >OK. > > > > > >>+ ((char *)page)[1] = PAGE_SHIFT; > > > > > >OK. > > Shouldn't we just expose page size and endianness by other means? (another > file or > syscall). If I send you this file dumped from a random machine, you won't know what to make of it. I'm planning to write a trivial server to sit on, say, my embedded target and spew this over the wire to a client. > >Not a good idea to expose raw flags in this manner - it changes at the drop > >of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > >mapping to make this viable. > > I don't think it is viable because that makes the flags part of the > userspace ABI. I wonder what they are needed for. Basically: to show what the hell's going on in the VM. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH][RFC] Kill off legacy power management stuff.
just something i threw together, not in final form, but it represents tossing the legacy PM stuff. at the moment, the menuconfig entry for PM_LEGACY lists it as "DEPRECATED", while the help screen calls it "obsolete." that's a good sign that it's getting close to the time for it to go, and the removal is fairly straightforward, but there's no mention of its removal in the feature removal schedule file. NOTE: this is not a working patch as it will fail on a MIPS or FR-V build, as i didn't remove the final vestiges from those two architectures. that would require simply killing off the remaining calls to pm_send_all(), that's all. (i think.) anyway, this has been compile-tested on x86 with "make allyesconfig." Documentation/pm.txt | 123 --- arch/i386/kernel/apm.c | 27 drivers/acpi/bus.c | 14 -- drivers/net/3c509.c |1 drivers/serial/68328serial.c | 59 - include/linux/pm.h | 70 --- include/linux/pm_legacy.h| 41 -- kernel/power/Kconfig | 10 - kernel/power/Makefile|1 kernel/power/pm.c| 209 - 10 files changed, 1 insertion(+), 554 deletions(-) diff --git a/Documentation/pm.txt b/Documentation/pm.txt index da8589a..d0fcfe2 100644 --- a/Documentation/pm.txt +++ b/Documentation/pm.txt @@ -36,93 +36,6 @@ system the associated daemon will exit gracefully. apmd: http://worldvisions.ca/~apenwarr/apmd/ acpid: http://acpid.sf.net/ -Driver Interface -- OBSOLETE, DO NOT USE! -* - -Note: pm_register(), pm_access(), pm_dev_idle() and friends are -obsolete. Please do not use them. Instead you should properly hook -your driver into the driver model, and use its suspend()/resume() -callbacks to do this kind of stuff. - -If you are writing a new driver or maintaining an old driver, it -should include power management support. Without power management -support, a single driver may prevent a system with power management -capabilities from ever being able to suspend (safely). - -Overview: -1) Register each instance of a device with "pm_register" -2) Call "pm_access" before accessing the hardware. - (this will ensure that the hardware is awake and ready) -3) Your "pm_callback" is called before going into a - suspend state (ACPI D1-D3) or after resuming (ACPI D0) - from a suspend. -4) Call "pm_dev_idle" when the device is not being used - (optional but will improve device idle detection) -5) When unloaded, unregister the device with "pm_unregister" - -/* - * Description: Register a device with the power-management subsystem - * - * Parameters: - * type - device type (PCI device, system device, ...) - * id - instance number or unique identifier - * cback - request handler callback (suspend, resume, ...) - * - * Returns: Registered PM device or NULL on error - * - * Examples: - * dev = pm_register(PM_SYS_DEV, PM_SYS_VGA, vga_callback); - * - * struct pci_dev *pci_dev = pci_find_dev(...); - * dev = pm_register(PM_PCI_DEV, PM_PCI_ID(pci_dev), callback); - */ -struct pm_dev *pm_register(pm_dev_t type, unsigned long id, pm_callback cback); - -/* - * Description: Unregister a device with the power management subsystem - * - * Parameters: - * dev - PM device previously returned from pm_register - */ -void pm_unregister(struct pm_dev *dev); - -/* - * Description: Unregister all devices with a matching callback function - * - * Parameters: - * cback - previously registered request callback - * - * Notes: Provided for easier porting from old APM interface - */ -void pm_unregister_all(pm_callback cback); - -/* - * Power management request callback - * - * Parameters: - * dev - PM device previously returned from pm_register - * rqst - request type - * data - data, if any, associated with the request - * - * Returns: 0 if the request is successful - * EINVAL if the request is not supported - * EBUSY if the device is now busy and cannot handle the request - * ENOMEM if the device was unable to handle the request due to memory - * - * Details: The device request callback will be called before the - * device/system enters a suspend state (ACPI D1-D3) or - * or after the device/system resumes from suspend (ACPI D0). - * For PM_SUSPEND, the ACPI D-state being entered is passed - * as the "data" argument to the callback. The device - * driver should save (PM_SUSPEND) or restore (PM_RESUME) - * device context when the request callback is called. - * - * Once a driver returns 0 (success) from a suspend - * request, it should not process any further requests or - * access the device hardware until a call to "pm_access" is made. - */ -typedef int (*pm_callback)(struct pm_dev *dev, pm_request_t rqst, void *data); - Driver Details -- This is just a quick Q as a stopgap
Re: [PATCH] make MADV_FREE lazily free memory
Rik van Riel wrote: Nick Piggin wrote: The lazy freeing is aimed at avoiding page faults on memory that is freed and later realloced, which is quite a common thing in many workloads. I would be interested to see how it performs and what these workloads look like, although we do need to fix the basic glibc and madvise locking problems first. The attached graph are results of running the MySQL sysbench workload on my quad core system. As you can see, performance with #threads == #cpus (4) almost doubles from 1070 transactions per second to 2014 transactions/second. On the high end (16 threads on 4 cpus), performance increases from 778 transactions/second on vanilla to 1310 transactions/second. I have also benchmarked running Ulrich's changed glibc on a vanilla kernel, which gives results somewhere in-between, but much closer to just the vanilla kernel. Looks like the idle time issue is still biting for those guys. Hmm, maybe MySQL is actually _touching_ the memory inside a more critical lock, so the faults get tangled up on mmap_sem there. I wonder if making malloc call memset right afterwards would hide that ;) Or the madvise exclusive mmap_sem avoidance. Seems like with perfect scaling we should get to the 2400 mark. It would be nice to be able to not degrade under load. Of course some of that will be MySQL scaling issues. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: > On Thu, 12 Apr 2007 16:10:50 -0700 > William Lee Irwin III <[EMAIL PROTECTED]> wrote: > > > On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: > > > This patch series introduces /proc/pid/pagemap and /proc/kpagemap, > > > which allow detailed run-time examination of process memory usage at a > > > page granularity. > > > The first several patches whip the page-walking code introduced for > > > /proc/pid/smaps and clear_refs into a more generic form, the next > > > couple make those interfaces optional, and the last two introduce the > > > new interfaces, also optional. > > > > This solves a real-life problem for Oracle system monitoring software > > (specifically EM). Among the tasks it must carry out is determining > > per-process memory footprint of a set of cooperating tasks (i.e. Oracle > > processes). RSS is inadequate for this due to page sharing; this work > > provides sufficient information to determine what EM needs. > > I'm still dying to see what the human-readable output from this > thing looks like. Still a work-in-progress. It's a monstrous amount of data and it basically requires a GUI to really get a handle on. Here's a couple apps I've been tinkering with (aka My First GTK Apps): http://selenic.com/Screenshot-pagemap.png That's a snapshot of a live-updating image of memory usage for a running process (Galeon). Each pixel is a page. Each 32x32 block is 4MB. Mappings are dark red. Pages that are actually faulted in are bright red. You can poke around in the memory map with the mouse and highlight mappings (blue). And pages that get faulted in flash green (hard to capture in a screenshot). http://selenic.com/Screenshot-kpagemap.png And that's a live-updating image of system-wide memory usage. Bright red are pages with a count of 1, dark red are pages with higher counts. Next is to visualize slab/page cache/buddy/active/lru data as well as highlight changing pages. This isn't terribly interesting yet. It can tell you things about page cache usage and fragmentation and readahead and so on. But correlating across the two sources, we'll be able to show information like "what pages in a process are actually shared/active/lru/etc." You can take it even further by correlating the above data with symbol info from nm, /proc/pid/clear_refs, etc. Also, something I immediately noticed on looking at the raw data (cat /proc/`pidof`/pagemap | hexdump -C | less): 002c8fd0 ff ff ff ff ff ff ff ff ff ff ff ff 6d f8 03 00 |m...| 002c8fe0 6c f8 03 00 b9 f8 03 00 6b f8 03 00 6a f8 03 00 |l...k...j...| 002c8ff0 b8 f8 03 00 69 f8 03 00 68 f8 03 00 b7 f8 03 00 |i...h...| 002c9000 67 f8 03 00 66 f8 03 00 b6 f8 03 00 65 f8 03 00 |g...f...e...| 002c9010 64 f8 03 00 b5 f8 03 00 63 f8 03 00 62 f8 03 00 |d...c...b...| 002c9020 b4 f8 03 00 61 f8 03 00 60 f8 03 00 b3 f8 03 00 |a...`...| 002c9030 7f f8 03 00 7e f8 03 00 b2 f8 03 00 7d f8 03 00 |~...}...| 002c9040 7c f8 03 00 b1 f8 03 00 5f f8 03 00 5e f8 03 00 ||..._...^...| 002c9050 b0 f8 03 00 5d f8 03 00 5c f8 03 00 af f8 03 00 |]...\...| Most of the consecutive page frames are allocated in descending order (6d 6c 6b 6a ...). That's pessimal for physical merging of block I/O. Given that we theoretically fixed this long-standing problem in 2.6 but it's obviously still happening, it's clear that a little more visibility into the VM would be useful. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
William Lee Irwin III wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. Not a good idea to use page->_count: page_count() will be more stable. Otherwise OK, I guess: the interpretation of the page refcount is unlikely to change much over time. EM wants to determine page_mapcount() for the most part for the purposes of determining "uniquely attributable RSS" (my ca. 2004 nomenclature) or "proportional RSS" (mpm's more recent nomenclature); as things now stand it will have to infer them by maintaining a table of pfn's and mappings thereof, but at least that can be done with it. I don't know whether you can easily determine page_mapcount with page_count and flags, though (count gives you an educated guess, but mapcount is the real thing). page_mapcount sounds very reasonable to export. It is directly tied with the userspace concept of mapping pages. page_count doesn't seem very useful (and if you must have it, please use page_count), neither does page flags. You could have a bit indicating whether the page is free or not (but that doesn't tell you much that meminfo or zoneinfo or buddyinfo does not). Dirty/writeback/referenced/uptodate maybe?... I'm stumped, what's flags for? -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: intermittant petabyte usage reported with broadcom nic
Roland Dreier <[EMAIL PROTECTED]> writes: > [Adding Michael Chan, who seems to look after bnx2, to the cc list] > > > To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of > > them all as amd64s). Network card driver in use is the one defined by > > CONFIG_BNX2. Kernel's monolithic. > > From a quick look at bnx2.c, it seems that the driver gives the NIC > (firmware?) a block of memory to DMA stats into, and just reads from > that memory in its get_stats method. So if you're seeing wonky stats > from the NIC intermittently, my best guess would be that firmware is > occasionally writing junk into the stats block. When only the firmware is writing to that area it could be put into an own page and then write protected with change_page_attr() That would catch any corruption coming from the rest of the kernel. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
Andrew Morton wrote: On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: + while (count > 0) { + chunk = min_t(size_t, count, PAGE_SIZE); + i = 0; + + if (pfn == -1) { + page[0] = 0; + page[1] = 0; + ((char *)page)[0] = (ntohl(1) != 1); OK. + ((char *)page)[1] = PAGE_SHIFT; OK. Shouldn't we just expose page size and endianness by other means? (another file or syscall). + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { + ppage = pfn_to_page(pfn); + if (!ppage) { + page[i] = 0; + page[i + 1] = 0; + } else { + page[i] = ppage->flags; + page[i + 1] = atomic_read(>_count); + } + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. I don't think it is viable because that makes the flags part of the userspace ABI. I wonder what they are needed for. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] NET: [UPDATED] Multiqueue network device support implementation.
Peter P Waskiewicz Jr wrote: > diff --git a/net/core/dev.c b/net/core/dev.c > index 219a57f..3ce449e 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1471,6 +1471,8 @@ gso: > q = dev->qdisc; > if (q->enqueue) { > rc = q->enqueue(skb, q); > + /* reset queue_mapping to zero */ > + skb->queue_mapping = 0; This must be done before enqueueing. At this point you don't even have a valid reference to the skb anymore. > @@ -3326,12 +3330,23 @@ struct net_device *alloc_netdev(int sizeof_priv, > const char *name, > if (sizeof_priv) > dev->priv = netdev_priv(dev); > > + alloc_size = (sizeof(struct net_device_subqueue) * queue_count); > + > + p = kzalloc(alloc_size, GFP_KERNEL); > + if (!p) { > + printk(KERN_ERR "alloc_netdev: Unable to allocate queues.\n"); > + return NULL; Still leaks the device > diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c > index 5cfe60b..6a38905 100644 > --- a/net/sched/sch_prio.c > +++ b/net/sched/sch_prio.c > @@ -144,11 +152,17 @@ prio_dequeue(struct Qdisc* sch) > struct Qdisc *qdisc; > > for (prio = 0; prio < q->bands; prio++) { > - qdisc = q->queues[prio]; > - skb = qdisc->dequeue(qdisc); > - if (skb) { > - sch->q.qlen--; > - return skb; > + /* Check if the target subqueue is available before > + * pulling an skb. This way we avoid excessive requeues > + * for slower queues. > + */ > + if (!netif_subqueue_stopped(sch->dev, q->band2queue[prio])) { > + qdisc = q->queues[prio]; > + skb = qdisc->dequeue(qdisc); > + if (skb) { > + sch->q.qlen--; > + return skb; > + } > } > } > return NULL; > @@ -200,6 +214,10 @@ static int prio_tune(struct Qdisc *sch, struct rtattr > *opt) > struct prio_sched_data *q = qdisc_priv(sch); > struct tc_prio_qopt *qopt = RTA_DATA(opt); > int i; > + int queue; > + int qmapoffset; > + int offset; > + int mod; > > if (opt->rta_len < RTA_LENGTH(sizeof(*qopt))) > return -EINVAL; > @@ -242,6 +260,30 @@ static int prio_tune(struct Qdisc *sch, struct rtattr > *opt) > } > } > } > + /* setup queue to band mapping */ > + if (q->bands < sch->dev->egress_subqueue_count) { > + qmapoffset = 1; > + mod = sch->dev->egress_subqueue_count; > + } else { > + mod = q->bands % sch->dev->egress_subqueue_count; > + qmapoffset = q->bands / sch->dev->egress_subqueue_count + > + ((mod) ? 1 : 0); > + } > + > + queue = 0; > + offset = 0; > + for (i = 0; i < q->bands; i++) { > + q->band2queue[i] = queue; > + if ( ((i + 1) - offset) == qmapoffset) { > + queue++; > + offset += qmapoffset; > + if (mod) > + mod--; > + qmapoffset = q->bands / > + sch->dev->egress_subqueue_count + > + ((mod) ? 1 : 0); > + } > + } > return 0; > } I stand by my point, this needs to be explicitly enabled by the user since it changes the behaviour of prio on multiqueue capable device. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] NET: Multiqueue network device support documentation.
From: Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Adding documentation for the new multiqueue API. Signed-off-by: Peter P. Waskiewicz Jr <[EMAIL PROTECTED]> Signed-off-by: Auke Kok <[EMAIL PROTECTED]> --- Documentation/networking/multiqueue.txt | 97 +++ 1 files changed, 97 insertions(+), 0 deletions(-) diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt new file mode 100644 index 000..c32ed83 --- /dev/null +++ b/Documentation/networking/multiqueue.txt @@ -0,0 +1,97 @@ + + HOWTO for multiqueue network device support + === + +Section 1: Base driver requirements for implementing multiqueue support +Section 2: Qdisc support for multiqueue devices +Section 3: Brief howto using PRIO for multiqueue devices + + +Intro: Kernel support for multiqueue devices +- + +Kernel support for multiqueue devices is only an API that is presented to the +netdevice layer for base drivers to implement. This feature is part of the +core networking stack, and all network devices will be running on the +multiqueue-aware stack. If a base driver only has one queue, then these +changes are transparent to that driver. + + +Section 2: Base driver requirements for implementing multiqueue support +--- + +Base drivers are required to use the new alloc_etherdev_mq() or +alloc_netdev_mq() functions to allocate the subqueues for the device. The +underlying kernel API will take care of the allocation and deallocation of +the subqueue memory, as well as netdev configuration of where the queues +exist in memory. + +The base driver will also need to manage the queues as it does the global +netdev->queue_lock today. Therefore base drivers should use the +netif_{start|stop|wake}_subqueue() functions to manage each queue while the +device is still operational. netdev->queue_lock is still used when the device +comes online or when it's completely shut down (unregister_netdev(), etc.). + +Finally, the base driver should indicate that it is a multiqueue device. The +feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features +bitmap on device initialization. Below is an example from e1000: + +#ifdef CONFIG_E1000_MQ + if ( (adapter->hw.mac.type == e1000_82571) || +(adapter->hw.mac.type == e1000_82572) || +(adapter->hw.mac.type == e1000_80003es2lan)) + netdev->features |= NETIF_F_MULTI_QUEUE; +#endif + + +Section 3: Qdisc support for multiqueue devices +--- + +Currently two qdiscs support multiqueue devices. The default qdisc, pfifo_fast, +and the PRIO qdisc. The qdisc is responsible for classifying the skb's to +bands and queues, and will store the queue mapping into skb->queue_mapping. +Use this field in the base driver to determine which queue to send the skb +to. + +pfifo_fast, being the default qdisc when a device is brought online, will not +assign a queue mapping, therefore the skb will have a value of zero. We +cannot assume anything about the device itself, how many queues it really has, +etc. Therefore sending all traffic to queue 0 is the safest thing to do here. + +The PRIO qdisc naturally plugs into a multiqueue device. Upon load of the +qdisc, PRIO will make a best-effort assignment of queue to PRIO band to evenly +distribute traffic flows. The algorithm can be found in prio_tune() in +net/sched/sch_prio.c. Once the association is made, any skb that is +classified will have skb->queue_mapping set, which will allow the driver to +properly queue skb's to multiple queues. + + +Section 4: Brief howto using PRIO for multiqueue devices + + +The userspace command 'tc,' part of the iproute2 package, is used to configure +qdiscs. To add the PRIO qdisc to your network device, assuming the device is +called eth0, run the following command: + +# tc qdisc add dev eth0 root handle 1: prio + +This will create 3 bands, 0 being highest priority, and associate those bands +to the queues on your NIC. Assuming eth0 has 2 Tx queues, the band mapping +would look like: + +band 0 => queue 0 +band 1 => queue 1 +band 2 => queue 1 + +Traffic will begin flowing through each queue if your TOS values are assigning +traffic across the various bands. For example, ssh traffic will always try to +go out band 0 based on TOS -> Linux priority conversion (realtime traffic), +so it will be sent out queue 0. ICMP traffic (pings) fall into the "normal" +traffic classification, which is band 1. Therefore pings will be send out +queue 1 on the NIC. + +The behavior of tc filters remains the same, where it will override TOS priority +classification. + + +Author: Peter P. Waskiewicz Jr. <[EMAIL PROTECTED]> - To unsubscribe from this list:
[PATCH 2/3] NET: [UPDATED] Multiqueue network device support implementation.
From: Peter P Waskiewicz Jr <[EMAIL PROTECTED]> Update: Removed unnecessary whitespace removals. Reset skb->queue_mapping to zero prior to enqueueing to a qdisc. Fixed band2queue mapping algorithm for bands less than queues. Added an API and associated supporting routines for multiqueue network devices. This allows network devices supporting multiple TX queues to configure each queue within the netdevice and manage each queue independantly. Changes to the PRIO Qdisc also allow a user to map multiple flows to individual TX queues, taking advantage of each queue on the device. Signed-off-by: Peter P. Waskiewicz Jr <[EMAIL PROTECTED]> Signed-off-by: Auke Kok <[EMAIL PROTECTED]> --- include/linux/etherdevice.h |3 +- include/linux/netdevice.h | 62 ++- include/linux/skbuff.h |2 + net/core/dev.c | 27 +++ net/core/skbuff.c |3 ++ net/ethernet/eth.c |9 +++--- net/sched/sch_generic.c |3 +- net/sched/sch_prio.c| 54 + 8 files changed, 144 insertions(+), 19 deletions(-) diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 745c988..446de39 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -39,7 +39,8 @@ extern void eth_header_cache_update(struct hh_cache *hh, struct net_device *dev extern int eth_header_cache(struct neighbour *neigh, struct hh_cache *hh); -extern struct net_device *alloc_etherdev(int sizeof_priv); +extern struct net_device *alloc_etherdev_mq(int sizeof_priv, int queue_count); +#define alloc_etherdev(sizeof_priv) alloc_etherdev_mq(sizeof_priv, 1) static inline void eth_copy_and_sum (struct sk_buff *dest, const unsigned char *src, int len, int base) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 71fc8ff..f00b94a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -106,6 +106,14 @@ struct netpoll_info; #define MAX_HEADER (LL_MAX_HEADER + 48) #endif +struct net_device_subqueue +{ + /* Give a control state for each queue. This struct may contain +* per-queue locks in the future. +*/ + unsigned long state; +}; + /* * Network device statistics. Akin to the 2.0 ether stats but * with byte counters. @@ -324,6 +332,7 @@ struct net_device #define NETIF_F_GSO2048/* Enable software GSO. */ #define NETIF_F_LLTX 4096/* LockLess TX */ #define NETIF_F_INTERNAL_STATS 8192/* Use stats structure in net_device */ +#define NETIF_F_MULTI_QUEUE16384 /* Has multiple TX/RX queues */ /* Segmentation offload features */ #define NETIF_F_GSO_SHIFT 16 @@ -534,6 +543,10 @@ struct net_device struct device dev; /* space for optional statistics and wireless sysfs groups */ struct attribute_group *sysfs_groups[3]; + + /* The TX queue control structures */ + struct net_device_subqueue *egress_subqueue; + int egress_subqueue_count; }; #define to_net_dev(d) container_of(d, struct net_device, dev) @@ -675,6 +688,48 @@ static inline int netif_running(const struct net_device *dev) return test_bit(__LINK_STATE_START, >state); } +/* + * Routines to manage the subqueues on a device. We only need start + * stop, and a check if it's stopped. All other device management is + * done at the overall netdevice level. + * Also test the device if we're multiqueue. + */ +static inline void netif_start_subqueue(struct net_device *dev, u16 queue_index) +{ + clear_bit(__LINK_STATE_XOFF, >egress_subqueue[queue_index].state); +} + +static inline void netif_stop_subqueue(struct net_device *dev, u16 queue_index) +{ +#ifdef CONFIG_NETPOLL_TRAP + if (netpoll_trap()) + return; +#endif + set_bit(__LINK_STATE_XOFF, >egress_subqueue[queue_index].state); +} + +static inline int netif_subqueue_stopped(const struct net_device *dev, + u16 queue_index) +{ + return test_bit(__LINK_STATE_XOFF, + >egress_subqueue[queue_index].state); +} + +static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index) +{ +#ifdef CONFIG_NETPOLL_TRAP + if (netpoll_trap()) + return; +#endif + if (test_and_clear_bit(__LINK_STATE_XOFF, + >egress_subqueue[queue_index].state)) + __netif_schedule(dev); +} + +static inline int netif_is_multiqueue(const struct net_device *dev) +{ + return (!!(NETIF_F_MULTI_QUEUE & dev->features)); +} /* Use this variant when it is known for sure that it * is executing from interrupt context. @@ -968,8 +1023,11 @@ static inline void
[PATCH 3/3] NET: [e1000] Example implementation of multiqueue network device API
From: Peter P Waskiewicz Jr <[EMAIL PROTECTED]> This patch is *not* intended to be integrated into any tree please. This is fulfilling a request to demonstrate the proposed multiqueue network device API in a driver. The necessary updates to the e1000 driver will come in a more official release. This is an as-is patch to this version of e1000, and should not be used outside of testing purposes only. Signed-off-by: Peter P. Waskiewicz Jr <[EMAIL PROTECTED]> --- drivers/net/e1000/e1000.h |8 ++ drivers/net/e1000/e1000_ethtool.c | 47 ++- drivers/net/e1000/e1000_main.c| 164 - 3 files changed, 194 insertions(+), 25 deletions(-) diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h index dd4b728..15e484e 100644 --- a/drivers/net/e1000/e1000.h +++ b/drivers/net/e1000/e1000.h @@ -168,6 +168,10 @@ struct e1000_buffer { uint16_t next_to_watch; }; +struct e1000_queue_stats { + u64 packets; + u64 bytes; +}; struct e1000_ps_page { struct page *ps_page[PS_PAGE_BUFFERS]; }; struct e1000_ps_page_dma { uint64_t ps_page_dma[PS_PAGE_BUFFERS]; }; @@ -188,9 +192,11 @@ struct e1000_tx_ring { /* array of buffer information structs */ struct e1000_buffer *buffer_info; + spinlock_t tx_queue_lock; spinlock_t tx_lock; uint16_t tdh; uint16_t tdt; + struct e1000_queue_stats tx_stats; boolean_t last_tx_tso; }; @@ -218,6 +224,7 @@ struct e1000_rx_ring { uint16_t rdh; uint16_t rdt; + struct e1000_queue_stats rx_stats; }; #define E1000_DESC_UNUSED(R) \ @@ -271,6 +278,7 @@ struct e1000_adapter { /* TX */ struct e1000_tx_ring *tx_ring; /* One per active queue */ + struct e1000_tx_ring **cpu_tx_ring; unsigned int restart_queue; unsigned long tx_queue_len; uint32_t txd_cmd; diff --git a/drivers/net/e1000/e1000_ethtool.c b/drivers/net/e1000/e1000_ethtool.c index 6777887..fd466a1 100644 --- a/drivers/net/e1000/e1000_ethtool.c +++ b/drivers/net/e1000/e1000_ethtool.c @@ -105,7 +105,12 @@ static const struct e1000_stats e1000_gstrings_stats[] = { { "dropped_smbus", E1000_STAT(stats.mgpdc) }, }; -#define E1000_QUEUE_STATS_LEN 0 +#define E1000_QUEUE_STATS_LEN \ +((struct e1000_adapter *)netdev->priv)->num_rx_queues > 1) ? \ + ((struct e1000_adapter *)netdev->priv)->num_rx_queues : 0 ) + \ + (struct e1000_adapter *)netdev->priv)->num_tx_queues > 1) ? \ + ((struct e1000_adapter *)netdev->priv)->num_tx_queues : 0 ))) * \ +(sizeof(struct e1000_queue_stats) / sizeof(u64))) #define E1000_GLOBAL_STATS_LEN \ sizeof(e1000_gstrings_stats) / sizeof(struct e1000_stats) #define E1000_STATS_LEN (E1000_GLOBAL_STATS_LEN + E1000_QUEUE_STATS_LEN) @@ -693,8 +698,10 @@ e1000_set_ringparam(struct net_device *netdev, E1000_MAX_TXD : E1000_MAX_82544_TXD)); E1000_ROUNDUP(txdr->count, REQ_TX_DESCRIPTOR_MULTIPLE); - for (i = 0; i < adapter->num_tx_queues; i++) + for (i = 0; i < adapter->num_tx_queues; i++) { txdr[i].count = txdr->count; + spin_lock_init(>tx_ring[i].tx_queue_lock); + } for (i = 0; i < adapter->num_rx_queues; i++) rxdr[i].count = rxdr->count; @@ -1909,6 +1916,9 @@ e1000_get_ethtool_stats(struct net_device *netdev, struct ethtool_stats *stats, uint64_t *data) { struct e1000_adapter *adapter = netdev_priv(netdev); +u64 *queue_stat; +int stat_count = sizeof(struct e1000_queue_stats) / sizeof(u64); +int j, k; int i; e1000_update_stats(adapter); @@ -1917,12 +1927,29 @@ e1000_get_ethtool_stats(struct net_device *netdev, data[i] = (e1000_gstrings_stats[i].sizeof_stat == sizeof(uint64_t)) ? *(uint64_t *)p : *(uint32_t *)p; } +if (adapter->num_tx_queues > 1) { +for (j = 0; j < adapter->num_tx_queues; j++) { +queue_stat = (u64 *)>tx_ring[j].tx_stats; +for (k = 0; k < stat_count; k++) +data[i + k] = queue_stat[k]; +i += k; +} +} +if (adapter->num_rx_queues > 1) { +for (j = 0; j < adapter->num_rx_queues; j++) { +queue_stat = (u64 *)>rx_ring[j].rx_stats; +for (k = 0; k < stat_count; k++) +data[i + k] = queue_stat[k]; +i += k; +} +} /* BUG_ON(i != E1000_STATS_LEN); */ } static void e1000_get_strings(struct net_device *netdev, uint32_t stringset, uint8_t *data) { + struct e1000_adapter *adapter = netdev_priv(netdev); uint8_t *p = data; int i; @@ -1937,6 +1964,22 @@ e1000_get_strings(struct net_device
[PATCH 0/3] [UPDATED]: Multiqueue network device support
This is a redesign and repost of the multiqueue network device support patches. The new API for base drivers allows multiqueue-capable devices to manage their individual queues in the network stack. The stack now handles both non-multiqueue and multiqueue devices on the same codepath. Also, allocation and deallocation of the queues is handled by the kernel instead of the driver. Fixes have been integrated into this patchset based on community feedback. A patched version of e1000 using the multiqueue API has also been included. NOTE that this version of e1000 is *only* for testing purposes, and is not intended to be integrated into the kernel at this time. It is only for demonstration purposes. The e1000 patch will only work with MAC types of 82571 and higher. Documentation is also included describing in more detail how this works, as well as how a base driver can use the API to implement multiple queues. These patches can also be pulled from my git repository at: git-pull git://lost.foo-projects.org/~ppwaskie/git/net-2.6.22 mq -- Peter P. Waskiewicz Jr. <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
Andi Kleen wrote: > Yes, but then we should have seen more frequently, shouldn't we? I always > run with the stack overflow check enabled and I don't think I ever saw > warnings in inflate. > I guess the window is just while decompressing the root filesystem. Interrupts under Xen might be using a little more stack (~40-50 bytes?), but its not a qualitative difference. It might have more to do with different timing. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] pnpbios_thread_init: don't use CLONE_SIGHAND
pnp_dock_thread() calls allow_signal() which plays with parent process's ->sighand. Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]> --- 2.6.21-rc5/drivers/pnp/pnpbios/core.c~3_pnp 2006-12-17 19:06:40.0 +0300 +++ 2.6.21-rc5/drivers/pnp/pnpbios/core.c 2007-04-13 03:44:34.0 +0400 @@ -589,7 +589,7 @@ static int __init pnpbios_thread_init(vo return 0; #ifdef CONFIG_HOTPLUG init_completion(_sem); - if (kernel_thread(pnp_dock_thread, NULL, CLONE_KERNEL) > 0) + if (kernel_thread(pnp_dock_thread, NULL, CLONE_FS | CLONE_FILES) > 0) unloading = 0; #endif return 0; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] nlmclnt_recovery: don't use CLONE_SIGHAND
reclaimer() calls allow_signal() which plays with parent process's ->sighand. Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]> --- 2.6.21-rc5/fs/lockd/clntlock.c~1_lockd 2007-04-05 12:04:07.0 +0400 +++ 2.6.21-rc5/fs/lockd/clntlock.c 2007-04-13 03:20:51.0 +0400 @@ -153,7 +153,7 @@ nlmclnt_recovery(struct nlm_host *host) if (!host->h_reclaiming++) { nlm_get_host(host); __module_get(THIS_MODULE); - if (kernel_thread(reclaimer, host, CLONE_KERNEL) < 0) + if (kernel_thread(reclaimer, host, CLONE_FS | CLONE_FILES) < 0) module_put(THIS_MODULE); } } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] usbatm_heavy_init: don't use CLONE_SIGHAND
usbatm_do_heavy_init() calls allow_signal() which plays with parent process's ->sighand. Signed-off-by: Oleg Nesterov <[EMAIL PROTECTED]> --- 2.6.21-rc5/drivers/usb/atm/usbatm.c~usbatm 2006-11-27 21:19:30.0 +0300 +++ 2.6.21-rc5/drivers/usb/atm/usbatm.c 2007-04-13 03:34:56.0 +0400 @@ -1019,7 +1019,7 @@ static int usbatm_do_heavy_init(void *ar static int usbatm_heavy_init(struct usbatm_data *instance) { - int ret = kernel_thread(usbatm_do_heavy_init, instance, CLONE_KERNEL); + int ret = kernel_thread(usbatm_do_heavy_init, instance, CLONE_FS | CLONE_FILES); if (ret < 0) { usb_err(instance, "%s: failed to create kernel_thread (%d)!\n", __func__, ret); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] [KERNEL-DOC] fix tex error when building pdfdocs
On Thu, 12 Apr 2007 22:38:42 +0200 Borislav Petkov wrote: > When building pdfdocs, the db2pdf converter bails out because of an > latex-reserved token - '#' - in the intermediary .tex file which ends up in a > conversion error with the following error message: > > > [15.0.32]) > ! Incomplete \iffalse; all text was ignored after line 8154. > > \fi > <*> kernel-hacking.tex > > > This is a rather arbitrary fix, so suggest away. Hi, I don't have a problem with the change, but I don't get that tex error either. Here is an extract from the .tex file: {\def\Element% {451}\def\ProcessingMode% {title-sosofo-mode}}\#if\endNode{}\endSeq{}\endLink{}\Seq% {}\Leader% {}.\endLeader{}\Link% > Signed-off-by: Borislav Petkov <[EMAIL PROTECTED]> > > Index: 21-rc6/Documentation/DocBook/kernel-hacking.tmpl > === > --- 21-rc6.orig/Documentation/DocBook/kernel-hacking.tmpl > +++ 21-rc6/Documentation/DocBook/kernel-hacking.tmpl > @@ -1138,7 +1138,7 @@ static struct block_device_operations op > > > > - if > + Prepocessor Conditionals > > > It is generally considered cleaner to use macros in header files > > - --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: >> This solves a real-life problem for Oracle system monitoring software >> (specifically EM). Among the tasks it must carry out is determining >> per-process memory footprint of a set of cooperating tasks (i.e. Oracle >> processes). RSS is inadequate for this due to page sharing; this work >> provides sufficient information to determine what EM needs. On Thu, Apr 12, 2007 at 04:32:35PM -0700, Andrew Morton wrote: > Not a good idea to expose raw flags in this manner - it changes at the drop > of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber > mapping to make this viable. > Not a good idea to use page->_count: page_count() will be more stable. > Otherwise OK, I guess: the interpretation of the page refcount is unlikely > to change much over time. EM wants to determine page_mapcount() for the most part for the purposes of determining "uniquely attributable RSS" (my ca. 2004 nomenclature) or "proportional RSS" (mpm's more recent nomenclature); as things now stand it will have to infer them by maintaining a table of pfn's and mappings thereof, but at least that can be done with it. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
On Friday 13 April 2007 01:20:40 Jan Engelhardt wrote: > > On Apr 12 2007 15:39, Jeremy Fitzhardinge wrote: > >Andi Kleen wrote: > >> Hmm, does Xen perhaps not use interrupt stacks? Normally 2.7k should be > >> still > >> green as long as there are not too many functions above/below it. > > > >That's a good point, I'll need to check that. Still, nearly 3k of stack! > > I bite. Would compressing the vmlinux binary with LZO or LZMA make an > improvement to the bootstrap uncompress stack usage? We don't care about the stack usage, as long as it doesn't overflow. It's a very limited piece of code that doesn't run on top or below other subsystems. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/30] Use menuconfig objects
On Fri, 13 Apr 2007 01:16:35 +0200 (MEST) Jan Engelhardt <[EMAIL PROTECTED]> wrote: > On Apr 12 2007 15:50, Andrew Morton wrote: > >On Tue, 10 Apr 2007 21:17:40 +0200 (MEST) > >Jan Engelhardt <[EMAIL PROTECTED]> wrote: > > > >> the following patch series turns some menus into menuconfigs, so they > >> can be disabled whilst "walking" thorugh the parent menu > > > >So I merged the 23 of these which survived review and which do not > >intersect with other outstanding work. > > > >I don't think I have an opinion on whether the change is actually an > > > >If we're going to make this change, we should ensure that it is done > >kernel-wide, for UI consistency reasons. > > If time permits, I'll go through the rest of the menus I find > eligible for menuconfig-izing. OK. It's encouraging that Randy is on board. > Does it help to base them on -mm to work better with outstanding work? At this stage in the development cycle: 5444 files changed, 530428 insertions(+), 179401 deletions(-) yes, it helps quite a lot. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
Jan Engelhardt wrote: > On Apr 12 2007 15:39, Jeremy Fitzhardinge wrote: > >> Andi Kleen wrote: >> >>> Hmm, does Xen perhaps not use interrupt stacks? Normally 2.7k should be >>> still >>> green as long as there are not too many functions above/below it. >>> >> That's a good point, I'll need to check that. Still, nearly 3k of stack! >> > > I bite. Would compressing the vmlinux binary with LZO or LZMA make an > improvement to the bootstrap uncompress stack usage? > Well, the thread started with my patch to fix inflate. The stack usage of LZO or LZMA decompressors will primarily depend on how they're implemented rather than any inherent property of the algorithms. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Thu, 12 Apr 2007 16:10:50 -0700 William Lee Irwin III <[EMAIL PROTECTED]> wrote: > On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: > > This patch series introduces /proc/pid/pagemap and /proc/kpagemap, > > which allow detailed run-time examination of process memory usage at a > > page granularity. > > The first several patches whip the page-walking code introduced for > > /proc/pid/smaps and clear_refs into a more generic form, the next > > couple make those interfaces optional, and the last two introduce the > > new interfaces, also optional. > > This solves a real-life problem for Oracle system monitoring software > (specifically EM). Among the tasks it must carry out is determining > per-process memory footprint of a set of cooperating tasks (i.e. Oracle > processes). RSS is inadequate for this due to page sharing; this work > provides sufficient information to determine what EM needs. > > I'm still dying to see what the human-readable output from this thing looks like. > + * Each entry is a pair of unsigned longs representing the > + * corresponding physical page, the first containing the page flags > + * and the second containing the page use count. > + * > + * The first 4 bytes of this file form a simple header: > + * > + * first byte: 0 for big endian, 1 for little > + * second byte: page shift (eg 12 for 4096 byte pages) > + * third byte: entry size in bytes (currently either 4 or 8) > + * fourth byte: header size > > ... > > + while (count > 0) { > + chunk = min_t(size_t, count, PAGE_SIZE); > + i = 0; > + > + if (pfn == -1) { > + page[0] = 0; > + page[1] = 0; > + ((char *)page)[0] = (ntohl(1) != 1); OK. > + ((char *)page)[1] = PAGE_SHIFT; OK. > + ((char *)page)[2] = sizeof(unsigned long); OK. > + ((char *)page)[3] = KPMSIZE; OK. > + i = 2; > + pfn++; > + } > + > + for (; i < 2 * chunk / KPMSIZE; i += 2, pfn++) { > + ppage = pfn_to_page(pfn); > + if (!ppage) { > + page[i] = 0; > + page[i + 1] = 0; > + } else { > + page[i] = ppage->flags; > + page[i + 1] = atomic_read(>_count); > + } > + } Not a good idea to expose raw flags in this manner - it changes at the drop of a hat. We'd need to also expose the kernel's PG_foo-to-bitnumber mapping to make this viable. Not a good idea to use page->_count: page_count() will be more stable. Otherwise OK, I guess: the interpretation of the page refcount is unlikely to change much over time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH resend][CRYPTO]: RSA algorithm patch
Hello, Next time, please do a reply-all so CC's aren't dropped. It seems you jumped halfway in, missing some background info, I'll try to clarify some things. On Thu, April 12, 2007 23:28, David Wagner wrote: > Yes, Satyam Sharma is 100% correct. Unpadded RSA makes no sense. RSA is > not secure if you omit the padding. If you have a good reason why RSA > needs to be in the kernel for security reasons, then the padding has to be > in the kernel, too. Putting plain unpadded RSA in the kernel seems bogus. He is correct, I only argued that's it can still be named RSA (which Satyam disputed), no matter what critical features are missing for a complete infrastructure. I don't know if you read the patch, but right now it's only a multi-precision integer implementation, useful to implement RSA. The rest, including the binary checking, is missing. We're pondering a bit about what, in the end, would be useful to have in or around the kernel. > I worry about the quality of this patch if it is using unpadded RSA. > This is pretty elementary stuff. No one should be implementing their > own crypto code unless they have considerable competence and knowledge > of cryptography. This elementary error leaves reason to be concerned > about whether the developer of this patch has the skills that are needed > to write this kind of code and get it right. As said above, the patch is only an MPI implementation, not RSA, and neither the rest to make it useful, like a crypto API interface and padding. So we can't really judge the developer's skills or crypto knowledge. It does point out that having a hidden implementation can never foster much trust, as no one can read the code and judge if it's good or not. > People often take it personally when I tell them that they do are not > competent to write their own crypto code, but this is not a personal > attack. It takes very specialized knowledge and considerable study > before one can write your own crypto implementation from scratch and > have a good chance that the result will be secure. People without > those skills shouldn't be writing their own crypto code, at least not > if security is important, because it's too easy to get something wrong. To a certain degree you're right, but the nice thing about open source is that people who know better can spot errors, and if those are fixed, it can happen that an "incompetent" person created something excellent. Maybe not at first, but in the end. (I suspect that a good coder with no crypto knowledge, but with feedback from experts, can implement something better than one expert with mediocre coding skills.) The code should be judged, not the people writing it. Besides, it isn't always that hard to get something secure, if things are kept simple and straightforward. E.g. writing a secure AES implementation isn't magic. RSA is much more complex though. (Rather ironic, as the theory behind RSA is simple, but the implementation hairy. With AES it's exactly the opposite. The coder doesn't need to understand the algebra behind it, knowing that it can be done with a simple table lookup is enough). In general the tricky part is around the crypto implementation itself, how it's used, key management, etc. (Though the border is vague, so maybe you included all that when saying "crypto implementation".) > (No, just reading Applied Cryptography is not good enough.) My experience > is that code that contains elementary errors like this is also likely > to contain more subtle errors that are harder to spot. In short, I'm > not getting warm fuzzies here. The code posted has no such errors, see above. Maybe the part that wasn't has, who knows. > And no, you can't just blithely push padding into user space and expect > that to make the security issues go away. If you are putting the > RSA exponentiation in the kernel because you don't trust user space, > then you have to put the padding in the kernel, too, otherwise you're > vulnerable to attack from evil user space code. True, but the code wasn't put into the kernel for security reasons. Why it was remains a bit of a mystery, but it looks like it was for convenience. Greetings, Indan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/30] Use menuconfig objects
Jan Engelhardt wrote: Hi, On Apr 12 2007 16:07, Randy Dunlap wrote: On Thu, 12 Apr 2007 15:50:12 -0700 Andrew Morton wrote: So I merged the 23 of these which survived review and which do not intersect with other outstanding work. I don't think I have an opinion on whether the change is actually an improvement, and I don't get a clear sense of what others think. Shrug. I like them, but then I have made & sent similar patches in the past. Would you like to go through remaining menus and make the patches? Just that efforts are not needlessy duplicated again. And of course for you to get your share if you desire so. Hi, I'm a bit too busy at the moment so please go ahead with it. -- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: intermittant petabyte usage reported with broadcom nic
[Adding Michael Chan, who seems to look after bnx2, to the cc list] > To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of > them all as amd64s). Network card driver in use is the one defined by > CONFIG_BNX2. Kernel's monolithic. >From a quick look at bnx2.c, it seems that the driver gives the NIC (firmware?) a block of memory to DMA stats into, and just reads from that memory in its get_stats method. So if you're seeing wonky stats from the NIC intermittently, my best guess would be that firmware is occasionally writing junk into the stats block. - R. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
On Apr 12 2007 15:39, Jeremy Fitzhardinge wrote: >Andi Kleen wrote: >> Hmm, does Xen perhaps not use interrupt stacks? Normally 2.7k should be still >> green as long as there are not too many functions above/below it. > >That's a good point, I'll need to check that. Still, nearly 3k of stack! I bite. Would compressing the vmlinux binary with LZO or LZMA make an improvement to the bootstrap uncompress stack usage? Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: intermittant petabyte usage reported with broadcom nic
On Thu, Apr 12, 2007 at 04:18:24PM -0700, Roland Dreier wrote: > > > Apr 11 22:14:02 ' eth0:220898233988841368 66750274000 0 > > > 0 86458738 52386430545 101089219 19931300 0 199313 > > > 0 ' > > > > Apr 11 22:15:02 ' eth0:17227454818 81381144000 0 > > > 0 0 33091307388 86658381000 0 0 0 > > > ' > > > But in fact I think you're saying that the numbers go bad, and then stay > > bad. > > Doesn't look like it -- one minute after the first hiccup the eth0 #s > look reasonable again. Yeah. Sorry for not making it clear. I included good values on either side of the bad one. -- "To the extent that we overreact, we proffer the terrorists the greatest tribute." - High Court Judge Michael Kirby - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/30] Use menuconfig objects
Hi, On Apr 12 2007 15:50, Andrew Morton wrote: >On Tue, 10 Apr 2007 21:17:40 +0200 (MEST) >Jan Engelhardt <[EMAIL PROTECTED]> wrote: > >> the following patch series turns some menus into menuconfigs, so they >> can be disabled whilst "walking" thorugh the parent menu > >So I merged the 23 of these which survived review and which do not >intersect with other outstanding work. > >I don't think I have an opinion on whether the change is actually an > >If we're going to make this change, we should ensure that it is done >kernel-wide, for UI consistency reasons. If time permits, I'll go through the rest of the menus I find eligible for menuconfig-izing. Does it help to base them on -mm to work better with outstanding work? Thanks, Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: intermittant petabyte usage reported with broadcom nic
> > Apr 11 22:14:02 ' eth0:220898233988841368 66750274000 0 > >0 86458738 52386430545 101089219 19931300 0 199313 > > 0 ' > > Apr 11 22:15:02 ' eth0:17227454818 81381144000 0 > > 0 0 33091307388 86658381000 0 0 0 ' > But in fact I think you're saying that the numbers go bad, and then stay bad. Doesn't look like it -- one minute after the first hiccup the eth0 #s look reasonable again. - R. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/30] Use menuconfig objects
Hi, On Apr 12 2007 16:07, Randy Dunlap wrote: >On Thu, 12 Apr 2007 15:50:12 -0700 Andrew Morton wrote: >> >> So I merged the 23 of these which survived review and which do not >> intersect with other outstanding work. >> >> I don't think I have an opinion on whether the change is actually an >> improvement, and I don't get a clear sense of what others think. Shrug. > >I like them, but then I have made & sent similar patches in the past. Would you like to go through remaining menus and make the patches? Just that efforts are not needlessy duplicated again. And of course for you to get your share if you desire so. Thanks, Jan -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 13/26] Xen-paravirt_ops: Consistently wrap paravirt ops callsites to make them patchable
From: Paul Mackerras <[EMAIL PROTECTED]> Date: Wed, 21 Mar 2007 11:03:14 +1100 > Linus Torvalds writes: > > > We should just do this natively. There's been several tests over the years > > saying that it's much more efficient to do sti/cli as a simple store, and > > handling the "oops, we got an interrupt while interrupts were disabled" as > > a special case. > > > > I have this dim memory that ARM has done it that way for a long time > > because it's so expensive to do a "real" cli/sti. > > > > And I think -rt does it for other reasons. It's just more flexible. > > 64-bit powerpc does this now as well. I was curious about this so I had a look. There appears to be three pieces of state used to manage this on powerpc, PACASOFTIRQEN(r13), PACAHARDIRQEN(r13) and the SOFTE() in the stackframe. Plus there is all of this complicated logic on trap entry and exit to manage these three values properly. local_irq_restore() doesn't look like a simple piece of code either. Logically it should be simple, update the software binary state, and if enabling see if any interrupts came in while we were disable so we can run them. Given all of that, is it really cheaper than just flipping the bit in the cpu control register? :-/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.21-rc6-mm1 USB related boot hang
On Thu, 12 Apr 2007, Helge Hafting wrote: > Are you sure this is the correct patch - against 2.6.21-rc6-mm1 ? > Hunk 1 out of 1 failed . . . Well I am pretty sure: box:~/scratch # wget ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.21-rc6/2.6.21-rc6-mm1/2.6.21-rc6-mm1.bz2>/dev/null 2>&1 box:~/scratch # wget ftp://ftp.kernel.org/pub/linux/kernel/v2.6/linux-2.6.20.tar.bz2>/dev/null 2>&1 box:~/scratch # wget ftp://ftp.kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.21-rc6.bz2>/dev/null 2>&1 box:~/scratch # tar xf linux-2.6.20.tar.bz2 box:~/scratch # cd linux-2.6.20/ box:~/scratch/linux-2.6.20 # mv ../patch-2.6.21-rc6.bz2 . box:~/scratch/linux-2.6.20 # bunzip2 patch-2.6.21-rc6.bz2 box:~/scratch/linux-2.6.20 # patch -p1 < patch-2.6.21-rc6 >/dev/null 2>&1; echo $? 0 box:~/scratch/linux-2.6.20 # mv ../2.6.21-rc6-mm1.bz2 . box:~/scratch/linux-2.6.20 # bunzip2 2.6.21-rc6-mm1.bz2 box:~/scratch/linux-2.6.20 # patch -p1 < 2.6.21-rc6-mm1 >/dev/null 2>&1; echo $? 0 box:~/scratch/linux-2.6.20 # cat tmp.patch diff --git a/drivers/hid/usbhid/hid-core.c b/drivers/hid/usbhid/hid-core.c index 1ddca31..d930f62 100644 --- a/drivers/hid/usbhid/hid-core.c +++ b/drivers/hid/usbhid/hid-core.c @@ -1550,15 +1550,22 @@ static int __init hid_init(void) retval = hiddev_init(); if (retval) goto hiddev_init_fail; + printk(KERN_DEBUG "hid_init: before usb_register()\n"); retval = usb_register(_driver); + printk(KERN_DEBUG "hid_init: after usb_register(), retuned %d\n", retval); if (retval) goto usb_register_fail; info(DRIVER_VERSION ":" DRIVER_DESC); + printk(KERN_DEBUG "hid_init: returning 0\n"); + dump_stack(); return 0; usb_register_fail: + printk(KERN_DEBUG "hid_init: calling hiddev_exit()\n"); hiddev_exit(); hiddev_init_fail: + printk(KERN_DEBUG "hid_init: returning %d\n", retval); + dump_stack(); return retval; } box:~/scratch/linux-2.6.20 # patch -p1 < tmp.patch patching file drivers/hid/usbhid/hid-core.c box:~/scratch/linux-2.6.20 # So I guess you are operating on some broken version of 2.6.21-rc6-mm1 codebase if you are getting rejects on this trivial patch. Anyway, based on information you have provided in your later messages, it seems that it is probably not necessairly related neither to USB nor HID, as you are getting hangs at different stages of boot, depending on your local configuration/kernel version used. Is vanilla 2.6.21-rc6 ok? If so, would you have time to bisect the offending patch? Thanks, -- Jiri Kosina - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: intermittant petabyte usage reported with broadcom nic
> Apr 9 06:19:04 ' eth0:14250798570591813804 2284720007938 1863800 > 18638 0 27375938 1556640980159 3345714490000 0 > 0 0 ' One odd thing is that crazy number 14250798570591813804 is c5c501cbc5c500ac in hex. I dunno what the significant of the 0xc5 bit pattern is though... The other line has 220898233988841368, which is 0x310c9c6006a7f98, not nearly so regular a patter. I don't think I'm helping much... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] [DEBUG] sd-sched: monitor dynamic priority levels of a running task
Hi, [ just in case, it can be of some avail for anybody ] target : 2.6.21-rc6-mm1 a very simplified but quite funny "toy" that [1] allows to monitor all the dynamic priority levels (counts a number of hits per level) on which a given task (configured via proc) is running; # echo "pid" > /proc/sd_pid - to monitor a task with a given pid"; # cat /proc/sd_slots - to dump statistics. [2] triggers a message when task's *prio* and *static_prio* are out of sync, i.e. a current prio is not allowed by prio_matrix[USER_PRIO(static_prio)]. [ example: --- [1857] static: 35, slot: 3 - nice ] maybe Con has something similar.. but at least I haven't found anything on his website. e.g. for X all the following scenarios give different (obviously) patterns: (1) just occasional cpu users ; (2) a cpu hog with the same static_prio; (3) a niced cpu hog. There are cases when [2] is triggered indeed. It's due to set_user_nice(). Con, is it a "feature"? [ explanation ] - In fact, all this "delta" calculation (delta = p->prio - old_prio) staff is useless in set_user_prio() as effective_prio() returns just the old p->prio and, as a result, we have got p->prio = p->prio :) It makes sense to use delta = p->static_prio - old_static_prio; The p->prio will be recalculated as a result of enqueue_task -> __enqueue_task -> recalc_task_prio .. _but_ if the task is currently in the "active" array and its time_slice != 0 -- the old p->prio is not changed So the task is queued taking into account the old_prio, although this slot can be prohibited by a new p->static_prio. It's only for the very first slot so one may call it err.. a feature (?) -- Best regards, Dmitry Adamushko --- linux-2.6.21-rc6-mm1/kernel/sched-orig3.c 2007-04-11 14:48:19.0 +0200 +++ linux-2.6.21-rc6-mm1/kernel/sched.c 2007-04-12 16:13:12.0 +0200 @@ -260,6 +260,164 @@ struct rq { static DEFINE_PER_CPU(struct rq, runqueues); static DEFINE_MUTEX(sched_hotcpu_mutex); +#define DEBUG_SD_SLOTS +#ifdef DEBUG_SD_SLOTS + +#include + +static int sd_monitor_pid, sd_monitor_idx; +static unsigned long sd_slot_hits[PRIO_RANGE]; +static struct proc_dir_entry *sd_pid_dir, *sd_slots_dir; +static int sd_debug_done; + +static void init_debug_slots(void); + +static void reset_slot_hits(void) +{ + int i = 0; + + for ( ; i < PRIO_RANGE; i++) + sd_slot_hits[i] = 0; +} + +static inline void debug_check_slot_validity(struct task_struct *p) +{ + int sprio = USER_PRIO(p->static_prio), uprio = USER_PRIO(p->prio); + + /* SCHED_BATCH and rt tasks don't use prio_matrix so just skip them. */ + if (p->policy == SCHED_BATCH || rt_task(p)) + return; + + if (unlikely(!sd_debug_done)) + init_debug_slots(); + + if (sd_monitor_pid && p->pid == sd_monitor_pid) + ++sd_slot_hits[uprio]; + + if (test_bit(uprio, prio_matrix[sprio])) + printk(KERN_EMERG "--- [%d] static: %d, slot: %d - %s\n", + p->pid, sprio, uprio, p->comm); +} + +static int sd_pid_proc_read(char *page, char **start, off_t off, + int count, int *eof, void *data) +{ + char *p = page; + int len = 0; + + p += sprintf(p, "pid: %d\n", sd_monitor_pid); + +len = p - page - off; + +if (len <= off + count) +*eof = 1; +*start = page + off; +if (len > count) +len = count; +if (len < 0) +len = 0; + +return len; +} + +static int sd_pid_proc_write(struct file *file, const char __user *buffer, + unsigned long count, void *data) +{ + struct task_struct *task; +char *end, buf[16]; +long pid; +int n; + +n = count > sizeof(buf) - 1 ? sizeof(buf) - 1 : count; + +if (copy_from_user(buf, buffer, n)) +return -EFAULT; + +buf[n] = '\0'; +pid = simple_strtol(buf, , 0); + + /* Stop monitoring. */ + if (!pid) { + sd_monitor_pid = 0; + goto out_exit; + } + + read_lock(_lock); + task = find_task_by_pid(pid); + + if (!task || task->policy == SCHED_BATCH || rt_task(task)) { + read_unlock(_lock); + + printk(KERN_EMERG "*** don't monitor SCHED_BATCH or Real-Time tasks ***\n"); + goto out_exit; + } + + sd_monitor_idx = USER_PRIO(task->static_prio); + read_unlock(_lock); + + reset_slot_hits(); + sd_monitor_pid = pid; + +out_exit: +return count; +} + +static int sd_slots_proc_read(char *page, char **start, off_t off, + int count, int *eof, void *data) +{ + int len = 0, i = 0; + char *p = page; + + if (!sd_monitor_pid) + goto out_exit; + + p += sprintf(p, " slot allowed hits\n\n"); + + for ( ; i < PRIO_RANGE; i++) + p += sprintf(p, "[ %d ] - %d : %lu \n", + i, !!test_bit(i, prio_matrix[sd_monitor_idx]), sd_slot_hits[i]); + +out_exit: +len = p - page - off; + +if (len <= off + count) +*eof = 1; +*start = page + off; +if (len > count) +len = count; +if (len < 0) +len = 0; + +
Re: intermittant petabyte usage reported with broadcom nic
On Fri, 13 Apr 2007 08:52:49 +1000 CaT <[EMAIL PROTECTED]> wrote: > On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote: > > On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote: > > > > > I take minute by minute snapshots of network traffic by sampling > > > /proc/net/dev and most of the time everything works fine. Occasionally > > > though I get petabyte byte traffic and corresponding packet traffic. > > > > How frequently? > > > > Are you able to provide some actual numbers (expected and actual values), > > so we can look at the bit patterns? > > I have some now. These are raw lines from /proc/net/dev. In this case it's > eth0 at 22:14 that chucked a wee wibbly. > > Apr 11 22:13:02 ' eth0:17227166357 81379716000 0 0 >0 33090495625 86656584000 0 0 0 ' > Apr 11 22:13:02 ' eth1:30708022097 91219466000 0 0 >0 122989582024 125073786000 0 0 0 ' > Apr 11 22:14:02 ' eth0:220898233988841368 66750274000 0 > 0 86458738 52386430545 101089219 19931300 0 199313 > 0 ' 0x310_c9c6_006a_7f98 Not sure what to make of that. > Apr 11 22:14:02 ' eth1:30708307787 91220183000 0 0 >0 122989665004 125074344000 0 0 0 ' > Apr 11 22:15:02 ' eth0:17227454818 81381144000 0 0 >0 33091307388 86658381000 0 0 0 ' > Apr 11 22:15:02 ' eth1:30708569308 91220742000 0 0 >0 122989732601 125074712000 0 0 0 ' > > On another server (same hardware except for 2ru case, more ram and more hds): > > Apr 9 06:18:05 ' eth0:1556640056941 3598105481000 0 > 0 0 2281147324747 3318270401000 0 0 0 > ' > Apr 9 06:18:05 ' eth1:912389249044 1190286687000 0 > 0 0 642943095469 991257887000 0 0 0 ' > Apr 9 06:19:04 ' eth0:14250798570591813804 2284720007938 1863800 > 18638 0 27375938 1556640980159 3345714490000 0 > 0 0 ' 0xc5c5_01cb_c5c5_00ac and 0x213_f3ec_ab02 The first one looks like trashed memory: it got overwritten by kernel addresses. Except they're x86-32 kernel addresses, and you're running x86_64 64-bit kernel. hm. I don't see any pattern here. > Apr 9 06:19:04 ' eth1:912389281939 1190287072000 0 > 0 0 642943219035 991258183000 0 0 0 ' > Apr 9 06:20:05 ' eth0:1556643514710 3598121584000 0 > 0 0 2281154391794 3318284878000 0 0 0 > ' > Apr 9 06:20:05 ' eth1:912389305767 1190287354000 0 > 0 0 642943273879 991258351000 0 0 0 ' > > > > This happens on an AMD64, dual core smp box with Broadcom NetXtreme II > > > nics. > > > > What driver drivers that? b44.c? > > To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of > them all as amd64s). Network card driver in use is the one defined by > CONFIG_BNX2. Kernel's monolithic. > > > We do perform racy 64-bit updates of some of the stats counters. But > > that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit > > kernel on that AMD64 box (are you?) > > Yes. With 32bit compat for executables built in. OK. I was earlier assuming that you were seeing transient funny numbers. But in fact I think you're saying that the numbers go bad, and then stay bad. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/13] maps: pagemap, kpagemap, and related cleanups
On Tue, Apr 03, 2007 at 09:43:30PM -0500, Matt Mackall wrote: > This patch series introduces /proc/pid/pagemap and /proc/kpagemap, > which allow detailed run-time examination of process memory usage at a > page granularity. > The first several patches whip the page-walking code introduced for > /proc/pid/smaps and clear_refs into a more generic form, the next > couple make those interfaces optional, and the last two introduce the > new interfaces, also optional. This solves a real-life problem for Oracle system monitoring software (specifically EM). Among the tasks it must carry out is determining per-process memory footprint of a set of cooperating tasks (i.e. Oracle processes). RSS is inadequate for this due to page sharing; this work provides sufficient information to determine what EM needs. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
On Thu, Apr 12, 2007 at 03:57:48PM -0700, Jeremy Fitzhardinge wrote: > Matt Mackall wrote: > > On Thu, Apr 12, 2007 at 01:50:54PM -0700, Jeremy Fitzhardinge wrote: > > > >> -#define HEAP_SIZE 0x3000 > >> +#define HEAP_SIZE 0x4000 > >> > > > > There are a bunch more of these that'll need fixing. > > > > Like this? I'm not sure what the story is with the platforms that bump this to 0x1, but this does get the rest of them. Acked-by: Matt Mackall <[EMAIL PROTECTED]> -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
Jeremy Fitzhardinge wrote: > Andi Kleen wrote: >>> (This was under Xen, but there's no reason it couldn't happen on bare >>> hardware.) >>> >> Hmm, does Xen perhaps not use interrupt stacks? > > Looks like that's all done in do_IRQ, so it should be independent of > whether its Xen or not. And the stack overflow check is performed on > the main stack, before switching to the interrupt stack. > Yeah, the do_IRQ thing is misleading because it makes you think the interrupt caused an overflow when all it did was detect a near-overflow condition. (The number printed is the amount of space left.) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/30] Use menuconfig objects
On Thu, 12 Apr 2007 15:50:12 -0700 Andrew Morton wrote: > On Tue, 10 Apr 2007 21:17:40 +0200 (MEST) > Jan Engelhardt <[EMAIL PROTECTED]> wrote: > > > the following patch series turns some menus into menuconfigs, so they > > can be disabled whilst "walking" thorugh the parent menu > > So I merged the 23 of these which survived review and which do not > intersect with other outstanding work. > > I don't think I have an opinion on whether the change is actually an > improvement, and I don't get a clear sense of what others think. Shrug. I like them, but then I have made & sent similar patches in the past. > If we're going to make this change, we should ensure that it is done > kernel-wide, for UI consistency reasons. > > If nothing else happens, I guess I'll spray these patches at the relevant > maintainers in a couple of weeks time, see what sticks. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
On Friday 13 April 2007 00:56:56 Jeremy Fitzhardinge wrote: > Andi Kleen wrote: > >> (This was under Xen, but there's no reason it couldn't happen on bare > >> hardware.) > >> > > > > Hmm, does Xen perhaps not use interrupt stacks? > > Looks like that's all done in do_IRQ, so it should be independent of > whether its Xen or not. And the stack overflow check is performed on > the main stack, before switching to the interrupt stack. Yes, but then we should have seen more frequently, shouldn't we? I always run with the stack overflow check enabled and I don't think I ever saw warnings in inflate. Something must be different in the Xen setup. Dunno if it's a bug, but such differences could cause more problems later. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
Matt Mackall wrote: > On Thu, Apr 12, 2007 at 01:50:54PM -0700, Jeremy Fitzhardinge wrote: > >> -#define HEAP_SIZE 0x3000 >> +#define HEAP_SIZE 0x4000 >> > > There are a bunch more of these that'll need fixing. > Like this? diff -r 2ad8a0729f26 arch/alpha/boot/misc.c --- a/arch/alpha/boot/misc.cThu Apr 12 13:44:02 2007 -0700 +++ b/arch/alpha/boot/misc.cThu Apr 12 15:48:43 2007 -0700 @@ -98,7 +98,7 @@ static ulg free_mem_ptr; static ulg free_mem_ptr; static ulg free_mem_ptr_end; -#define HEAP_SIZE 0x2000 +#define HEAP_SIZE 0x3000 #include "../../../lib/inflate.c" diff -r 2ad8a0729f26 arch/arm/boot/compressed/misc.c --- a/arch/arm/boot/compressed/misc.c Thu Apr 12 13:44:02 2007 -0700 +++ b/arch/arm/boot/compressed/misc.c Thu Apr 12 15:48:43 2007 -0700 @@ -239,7 +239,7 @@ static ulg free_mem_ptr; static ulg free_mem_ptr; static ulg free_mem_ptr_end; -#define HEAP_SIZE 0x2000 +#define HEAP_SIZE 0x3000 #include "../../../../lib/inflate.c" diff -r 2ad8a0729f26 arch/arm26/boot/compressed/misc.c --- a/arch/arm26/boot/compressed/misc.c Thu Apr 12 13:44:02 2007 -0700 +++ b/arch/arm26/boot/compressed/misc.c Thu Apr 12 15:48:43 2007 -0700 @@ -182,7 +182,7 @@ static ulg free_mem_ptr; static ulg free_mem_ptr; static ulg free_mem_ptr_end; -#define HEAP_SIZE 0x2000 +#define HEAP_SIZE 0x3000 #include "../../../../lib/inflate.c" diff -r 2ad8a0729f26 arch/x86_64/boot/compressed/misc.c --- a/arch/x86_64/boot/compressed/misc.cThu Apr 12 13:44:02 2007 -0700 +++ b/arch/x86_64/boot/compressed/misc.cThu Apr 12 15:48:43 2007 -0700 @@ -189,7 +189,7 @@ static long free_mem_ptr; static long free_mem_ptr; static long free_mem_end_ptr; -#define HEAP_SIZE 0x6000 +#define HEAP_SIZE 0x7000 static char *vidmem = (char *)0xb8000; static int vidport; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
Andi Kleen wrote: >> (This was under Xen, but there's no reason it couldn't happen on bare >> hardware.) >> > > Hmm, does Xen perhaps not use interrupt stacks? Looks like that's all done in do_IRQ, so it should be independent of whether its Xen or not. And the stack overflow check is performed on the main stack, before switching to the interrupt stack. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH resend][CRYPTO]: RSA algorithm patch
On Thu, April 12, 2007 23:13, Satyam Sharma wrote: > But timing attacks are not exclusive to RSA / asymmetric > cryptosystems. Such (side channel / timing / power measurement / bus > access) attacks are possible against AES, etc too. True, but those are often easier to protect, or are less vulnerable in the first place. (E.g. it isn't very hard to make a constant time AES implementation. The operations it does are independent of the key.) > Of course, now we're really moving into a different realm -- I guess > in security there is always a threshold, and you really needn't care > beyond a particular threat perception level. I don't see how even the > existing cryptoapi (or *any* security measure in the kernel for that > matter) stands up to the kind of attacks we're talking about now. True, and very specialized hardware is needed in such cases anyway, so arguing that it's not the kernel's task to protect against such attacks is valid. But it are interesting attacks, and people should be aware of them, instead of blindly trusting any security measure (not implying anyone here does, I mean in general). >> > constant-time crypto implementations do take care of >> > them, though I agree the GPG code too lacks that. >> >> That's because for side-channel attacks you need physical access to the >> hardware, something for most machines means security is breached anyway. >> But when this code is going to be used to sign things by embedded devices >> (with a local, secret key), it can be important. >> >> For checking signatures the key is known and all this doesn't matter, but >> we're talking about a common implementation. It are things to keep in mind. > > I think the original idea was to generate signatures at a centralized > place (not on an embedded system) and only *verify* them using > *public* keys on the embedded systems? For most common > implementations, as I suggested, you only need bother yourself upto a > certain security threshold. Yes, but it depends on how the code is used. It is supposed to be generic code, so whether someone wants to use it for signing or not is an open question. So far it seems only signature checking is needed, and that simplifies a lot, but if that isn't the case more questions pop up, like where the security threshold should be. The user with the tightest requirements more or less dictates the implementation. All in all, to get anything merged at all in the kernel it seems at least the following needs to happen: - Future users speaking up and uniting. - Figuring out their needs (so overlapping needs can make it into common code, and other decisions can be made, as where the kernel and user space border should be.) - Deciding on a commonly agreed security threshold, and making that explicit. - Coding it all up and keeping it in sync with mainline. I don't see this happening soon. But a good start would be if someone who cares about this sets up a mailing list or website to collect all users and information. Good night, Indan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: intermittant petabyte usage reported with broadcom nic
On Mon, Apr 02, 2007 at 12:13:00AM -0700, Andrew Morton wrote: > On Mon, 2 Apr 2007 11:43:19 +1000 CaT <[EMAIL PROTECTED]> wrote: > > > I take minute by minute snapshots of network traffic by sampling > > /proc/net/dev and most of the time everything works fine. Occasionally > > though I get petabyte byte traffic and corresponding packet traffic. > > How frequently? > > Are you able to provide some actual numbers (expected and actual values), > so we can look at the bit patterns? I have some now. These are raw lines from /proc/net/dev. In this case it's eth0 at 22:14 that chucked a wee wibbly. Apr 11 22:13:02 ' eth0:17227166357 81379716000 0 0 0 33090495625 86656584000 0 0 0 ' Apr 11 22:13:02 ' eth1:30708022097 91219466000 0 0 0 122989582024 125073786000 0 0 0 ' Apr 11 22:14:02 ' eth0:220898233988841368 66750274000 0 0 86458738 52386430545 101089219 19931300 0 199313 0 ' Apr 11 22:14:02 ' eth1:30708307787 91220183000 0 0 0 122989665004 125074344000 0 0 0 ' Apr 11 22:15:02 ' eth0:17227454818 81381144000 0 0 0 33091307388 86658381000 0 0 0 ' Apr 11 22:15:02 ' eth1:30708569308 91220742000 0 0 0 122989732601 125074712000 0 0 0 ' On another server (same hardware except for 2ru case, more ram and more hds): Apr 9 06:18:05 ' eth0:1556640056941 3598105481000 0 0 0 2281147324747 3318270401000 0 0 0 ' Apr 9 06:18:05 ' eth1:912389249044 1190286687000 0 0 0 642943095469 991257887000 0 0 0 ' Apr 9 06:19:04 ' eth0:14250798570591813804 2284720007938 1863800 18638 0 27375938 1556640980159 3345714490000 0 0 0 ' Apr 9 06:19:04 ' eth1:912389281939 1190287072000 0 0 0 642943219035 991258183000 0 0 0 ' Apr 9 06:20:05 ' eth0:1556643514710 3598121584000 0 0 0 2281154391794 3318284878000 0 0 0 ' Apr 9 06:20:05 ' eth1:912389305767 1190287354000 0 0 0 642943273879 991258351000 0 0 0 ' > > This happens on an AMD64, dual core smp box with Broadcom NetXtreme II > > nics. > > What driver drivers that? b44.c? To clarify it's an Intel Dual Core Xeon (I just wound up as thinking of them all as amd64s). Network card driver in use is the one defined by CONFIG_BNX2. Kernel's monolithic. > We do perform racy 64-bit updates of some of the stats counters. But > that'll only affect 32-bit kernels and I'm assuming you're running a 64-bit > kernel on that AMD64 box (are you?) Yes. With 32bit compat for executables built in. -- "To the extent that we overreact, we proffer the terrorists the greatest tribute." - High Court Judge Michael Kirby - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/30] Use menuconfig objects
On Tue, 10 Apr 2007 21:17:40 +0200 (MEST) Jan Engelhardt <[EMAIL PROTECTED]> wrote: > the following patch series turns some menus into menuconfigs, so they > can be disabled whilst "walking" thorugh the parent menu So I merged the 23 of these which survived review and which do not intersect with other outstanding work. I don't think I have an opinion on whether the change is actually an improvement, and I don't get a clear sense of what others think. Shrug. If we're going to make this change, we should ensure that it is done kernel-wide, for UI consistency reasons. If nothing else happens, I guess I'll spray these patches at the relevant maintainers in a couple of weeks time, see what sticks. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] uninline remove/add_parent() APIs
I'm travelling this week (through Monday) and can't be of much immediate help on improving the situation or explaining it in great detail. Last week before I left home I was deep in some strange debugging and didn't get a chance to look up. There will be more of that, but I'll try to make some timely progress on answering all the backlog of correspondence about utrace too. Thanks, Roland - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
Andi Kleen wrote: > Hmm, does Xen perhaps not use interrupt stacks? Normally 2.7k should be still > green as long as there are not too many functions above/below it. > That's a good point, I'll need to check that. Still, nearly 3k of stack! J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sched_yield proposals/rationale
[EMAIL PROTECTED] wrote: -Original Message- Besides - but I guess you're aware of it - any randomized algorithms tend to drive benchmarkers and performance analysts crazy because their performance cannot be repeated. So it's usually better to avoid them unless there is really no alternative. That could already solve your concern from above. Statistically speaking, it will give them (benchmarkers) the smoothest curve they've ever seen. Please be aware that I'm just exploring options/insight here. It is not something I intend to push inside the mainline kernel. I just want to find reasonable and logic criticism as you and some others have provided already. Thanks for that! And having gotten same, are you going to code up what appears to be a solution, based on this feedback? I'm curious how well it would run poorly written programs, having recently worked with a company which seemed to have a whole part of purchasing dedicated to buying same. :-( -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH UPDATE] deflate stack usage in lib/inflate.c
On Thu, Apr 12, 2007 at 01:50:54PM -0700, Jeremy Fitzhardinge wrote: > -#define HEAP_SIZE 0x3000 > +#define HEAP_SIZE 0x4000 There are a bunch more of these that'll need fixing. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/