Re: What still uses the block layer?

2007-10-15 Thread Stefan Richter
[EMAIL PROTECTED] wrote:
> On Mon, 15 Oct 2007, Stefan Richter wrote:
>> Low-level networking drivers suggest a default interface name (per
>> interface or as a template like eth%d into which the networking core
>> inserts a lowest spare number).
...
>> Could low-level SCSI drivers provide similar name templates which give a
>> hint on the transport involved?
...
> one other option that could be considered (and I do realize I'm bringing
> up flame-bait here) is that drivers that have fixed addresses could
> offer up a device name that include that address.
...

That's already implemented. :-) Transport drivers expose transport
specific information in sysfs; udev scripts examine it and create by-id
and by-path symlinks to device files of HDDs.  Not everybody agrees, but
many think that it's sensible to implement just mechanism in kernel and
leave policy to userspace.  My suggestion and the default network
interface names already violate this principle to a degree, but it can
still be implemented in a transport independent way, and userspace can
continue to create whatever names the user needs.
-- 
Stefan Richter
-=-=-=== =-=- =
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH try #3] Input/Joystick Driver: add support AD7142 joystick driver

2007-10-15 Thread Bryan Wu
On 10/16/07, Dmitry Torokhov <[EMAIL PROTECTED]> wrote:
> On 10/15/07, Bryan Wu <[EMAIL PROTECTED]> wrote:
> > On 10/15/07, Dmitry Torokhov <[EMAIL PROTECTED]> wrote:
> > >
> > > Completion is just not a good abstraction here... Please use work
> > > abstraction and possibly a separate workqueue.
> >
> > Yes, I agree with you now, although I have a little concern about the
> > possibility of big delay introduced by workqueue.
> >
>
> Having a separate workqueue should isolate the driver from users
> hogging keventd. Otherwise the speed should be pretty much the same as
> with a kthread.
>

Does this driver need the create a new kthread instead of keventd?
I think keventd might be sufficient for this driver.

> >
> > Thanks a lot for you kindly review.
> > I will resend update patch later.
>
> Thank you for not getting frustrated with all my change requests.

Oh, your help is very useful. It encourages us to send out our drivers
to LKML.

> Btw,
> blackfin keypad driver is in my tree and should be in mainline once
> Linus does the pull I requested.
>

Thanks again, I noticed it was merged already.

-Bryan Wu
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/2] Protect crashkernel against BSS overlap

2007-10-15 Thread Vivek Goyal
On Mon, Oct 15, 2007 at 01:50:42PM +0200, Bernhard Walle wrote:
> I observed the problem that even when you choose the default 16M as
> crashkernel base address and the kernel is very big, the reserved area may
> overlap with the kernel BSS. Currently, this is not checked at runtime, so the
> kernel just crashes when you load the panic kernel in the sys_kexec call.
> 
> This two patches check this at runtime. The patches are against current git,
> but with the patches
> 
> extended-crashkernel-command-line.patch
> extended-crashkernel-command-line-update.patch
> extended-crashkernel-command-line-comment-fix.patch
> 
> extended-crashkernel-command-line-improve-error-handling-in-parse_crashkernel_mem.patch
> use-extended-crashkernel-command-line-on-i386.patch
> use-extended-crashkernel-command-line-on-i386-update.patch
> use-extended-crashkernel-command-line-on-x86_64.patch
> use-extended-crashkernel-command-line-on-x86_64-update.patch
> use-extended-crashkernel-command-line-on-ia64.patch
> use-extended-crashkernel-command-line-on-ia64-fix.patch
> use-extended-crashkernel-command-line-on-ia64-update.patch
> use-extended-crashkernel-command-line-on-ppc64.patch
> use-extended-crashkernel-command-line-on-ppc64-update.patch
> use-extended-crashkernel-command-line-on-sh.patch
> use-extended-crashkernel-command-line-on-sh-update.patch
> 
> from -mm tree applied since they are marked to be merged in 2.6.24.
> 
> I know that the implementation of both patches is only x86 (i386 and x86-64),
> but if you agree that it's the way to go, I can add the BSS resource
> and the check for all other architectures that apply.
> 

Hi Bernhard,

Shouldn't bootmem allocator have the functionality to flag an error if
we try to reserve a memory which is already reserved? I see that bootmem
allocator is currently printing a warning under CONFIG_DEBUG_BOOTMEM.

Wouldn't it be better if we reserve the code, data and bss memory also
using bootmem allocator and when somebody tries to reserve craskernel memory
and if there is an overlap, boot memory allocator should scream?

In second patch, you are checking for crash kernel reserved memory being
beyond _end. That will make sure that there is no overlap with kernel
text, data or bss. I am wondering then why do we need first patch and
why should we register bss memory in the resources list. Second patch 
would make sure that there is no overlap with crash kernel memory and kexec
will not place any segment outside crashkernel memory.

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.23-git8 missing include file

2007-10-15 Thread Kamalesh Babulal
Hi,

The build fails with following error

  CC  drivers/usb/storage/scsiglue.o
  CC  drivers/usb/storage/protocol.o
  CC  drivers/usb/storage/transport.o
In file included from drivers/usb/storage/transport.c:53:
include/scsi/scsi_eh.h:79: error: field 'sense_sgl' has incomplete type
make[3]: *** [drivers/usb/storage/transport.o] Error 1
make[2]: *** [drivers/usb/storage] Error 2
make[1]: *** [drivers/usb] Error 2
make: *** [drivers] Error 2

This patch fixes the build failure

--- linux-2.6.23/include/scsi/scsi_cmnd.h   2007-10-16 09:58:30.0 
+0530
+++ linux-2.6.23/include/scsi/~scsi_cmnd.h  2007-10-16 10:20:32.0 
+0530
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct request;
 struct scatterlist;

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] forcedeth: fix the NAPI poll function

2007-10-15 Thread Jeff Garzik

Ingo Molnar wrote:

* Jeff Garzik <[EMAIL PROTECTED]> wrote:


Two comments:

1) we have a vague definition of "RX work processed."  Due to error 
conditions and goto's in that function, rx_processed_cnt may or may 
not equal the number of packets actually processed.


2) man I dislike these inline C statement combinations (ranting at 
original code style, not you).  I would much rather waste a few extra 
lines of source code and make the conditions obvious:


while (... && (rx_processed_cnt < limit)) {
rx_processed_cnt++;

...
}

or even

while (1) {
...
if (rx_processed_cnt == limit)
break;
rx_processed_cnt++;
}

The compiler certainly doesn't care, and IMO it prevents bugs.


agreed. Do you have an uptodate patch/git-URI for the forcedeth rewrite 
you did? I can throw it into the testbed.


Branch 'fe-lock' of
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git

It works here locally, but at this very minute I am rewriting those 
changesets yet again :)


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-2.6.23-mm1 crashed

2007-10-15 Thread Dave Milter
On 10/14/07, Andrew Morton <[EMAIL PROTECTED]> wrote:
> I didn't notice that qemu was involved.  Does qemu have an emulator for the
> gdth hardware?
>

I think no, the kernel just probe exist or not hardware, and hangs after that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] forcedeth: fix the NAPI poll function

2007-10-15 Thread Ingo Molnar

* Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Two comments:
> 
> 1) we have a vague definition of "RX work processed."  Due to error 
> conditions and goto's in that function, rx_processed_cnt may or may 
> not equal the number of packets actually processed.
> 
> 2) man I dislike these inline C statement combinations (ranting at 
> original code style, not you).  I would much rather waste a few extra 
> lines of source code and make the conditions obvious:
> 
>   while (... && (rx_processed_cnt < limit)) {
>   rx_processed_cnt++;
> 
>   ...
>   }
> 
> or even
> 
>   while (1) {
>   ...
>   if (rx_processed_cnt == limit)
>   break;
>   rx_processed_cnt++;
>   }
> 
> The compiler certainly doesn't care, and IMO it prevents bugs.

agreed. Do you have an uptodate patch/git-URI for the forcedeth rewrite 
you did? I can throw it into the testbed.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: usb

2007-10-15 Thread Greg KH
On Mon, Oct 15, 2007 at 10:37:04AM -0700, Yinghai Lu wrote:
> Greg,
> 
> from linus's git this morning..
> 
> ACPI: PCI Interrupt :00:02.1[B] -> Link [LUB2] -> GSI 21 (level,
> low) -> IRQ 21
> ehci_hcd :00:02.1: EHCI Host Controller
> ehci_hcd :00:02.1: new USB bus registered, assigned bus number 1
> ehci_hcd :00:02.1: debug port 1
> ehci_hcd :00:02.1: irq 21, io mem 0xdefbec00
> ehci_hcd :00:02.1: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
> usb usb1: configuration #1 chosen from 1 choice
> hub 1-0:1.0: USB hub found
> hub 1-0:1.0: 10 ports detected
> ACPI: PCI Interrupt :00:02.0[A] -> Link [LUB0] -> GSI 22 (level,
> low) -> IRQ 22
> ohci_hcd :00:02.0: OHCI Host Controller
> ohci_hcd :00:02.0: new USB bus registered, assigned bus number 2
> ohci_hcd :00:02.0: irq 22, io mem 0xdefbf000
> usb 1-1: new high speed USB device using ehci_hcd and address 2
> usb usb2: configuration #1 chosen from 1 choice
> hub 2-0:1.0: USB hub found
> hub 2-0:1.0: 10 ports detected
> USB Universal Host Controller Interface driver v3.0
> Initializing USB Mass Storage driver...
> usb 1-1: configuration #1 chosen from 1 choice
> usb 1-6: new high speed USB device using ehci_hcd and address 5
> usb 1-6: configuration #1 chosen from 1 choice
> hub 1-6:1.0: USB hub found
> hub 1-6:1.0: 4 ports detected
> sysfs: duplicate filename 'bInterfaceNumber' can not be created
> WARNING: at fs/sysfs/dir.c:425 sysfs_add_one()
> 
> Call Trace:
>  [] sysfs_add_one+0x54/0xbd
>  [] sysfs_add_file+0x50/0x81
>  [] sysfs_create_group+0x9a/0xf2
>  [] usb_create_sysfs_intf_files+0x32/0xc7
>  [] usb_set_configuration+0x49d/0x4c0
>  [] generic_probe+0x53/0x95
>  [] driver_probe_device+0xd3/0x150
>  [] __device_attach+0x0/0x5
>  [] bus_for_each_drv+0x40/0x71
>  [] device_attach+0x63/0x7a
>  [] bus_attach_device+0x2a/0x78
>  [] device_add+0x308/0x51e
>  [] usb_new_device+0x47/0x80
>  [] hub_thread+0x75a/0xb4a
>  [] autoremove_wake_function+0x0/0x2e
>  [] hub_thread+0x0/0xb4a
>  [] kthread+0x47/0x76
>  [] child_rip+0xa/0x12
>  [] kthread+0x0/0x76
>  [] child_rip+0x0/0x12
> 
> usb 2-3: new full speed USB device using ohci_hcd and address 2
> usb 2-3: configuration #1 chosen from 1 choice
> hub 2-3:1.0: USB hub found
> hub 2-3:1.0: 4 ports detected
> usb 2-5: new low speed USB device using ohci_hcd and address 3
> usb 2-5: configuration #1 chosen from 1 choice
> scsi15 : SCSI emulation for USB Mass Storage devices
> usb 2-3.1: new low speed USB device using ohci_hcd and address 4
> usb 2-3.1: configuration #1 chosen from 1 choice
> usb 2-3.4: new low speed USB device using ohci_hcd and address 5
> usb 2-3.4: configuration #1 chosen from 1 choice

Yes, I can duplicate this here now too.  Will work on this tomorrow to
track down...

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> >> make_page_uptodate() is most hideous part I have run into.
> >> It has to know details about other layers to now what not
> >> to stomp.  I think my incorrect simplification of this is what messed
> >> things up, last round.
> >
> > Not really, it's just named funny. That's just a minor utility
> > function that more or less does what it says it should do.
> >
> > The main problem is really that it's implementing a block device
> > who's data comes from its own buffercache :P. I think.
>
> Well to put it another way, mark_page_uptodate() is the only
> place where we really need to know about the upper layers.
> Given that you can kill ramdisks by coding it as:
>
> static void make_page_uptodate(struct page *page)
> {
>   clear_highpage(page);
>   flush_dcache_page(page);
>   SetPageUptodate(page);
> }
>
> Something is seriously non-intuitive about that function if
> you understand the usual rules for how to use the page cache.

You're overwriting some buffers that were uptodate and dirty.
That would be expected to cause problems.


> The problem is that we support a case in the buffer cache
> where pages are partially uptodate and only the buffer_heads
> remember which parts are valid.  Assuming we are using them
> correctly.
>
> Having to walk through all of the buffer heads in make_page_uptodate
> seems to me to be a nasty layering violation in rd.c

Sure, but it's not just about the buffers. It's the pagecache
in general. It is supposed to be invisible to the device driver
and sitting above it, and yet it is taking the buffercache and
using it to pull its data out of.


> > I think it's worthwhile, given that we'd have a "real" looking
> > block device and minus these bugs.
>
> For testing purposes I think I can agree with that.

What non-testing uses does it have?


> >> Having a separate store would
> >> solve some of the problems, and probably remove the need
> >> for carefully specifying the ramdisk block size.  We would
> >> still need the magic restictions on page allocations though
> >> and it we would use them more often as the initial write to the
> >> ramdisk would not populate the pages we need.
> >
> > What magic restrictions on page allocations? Actually we have
> > fewer restrictions on page allocations because we can use
> > highmem!
>
> With the proposed rewrite yes.
>
> > And the lowmem buffercache pages that we currently pin
> > (unsuccessfully, in the case of this bug) are now completely
> > reclaimable. And all your buffer heads are now reclaimable.
>
> Hmm.  Good point.  So in net it should save memory even if
> it consumes a little more in the worst case.

Highmem systems would definitely like it. For others, yes, all
the duplicated pages should be able to get reclaimed if memory
gets tight, along with the buffer heads, so yeah footprint may
be a tad smaller.


> > If you mean GFP_NOIO... I don't see any problem. Block device
> > drivers have to allocate memory with GFP_NOIO; this may have
> > been considered magic or deep badness back when the code was
> > written, but it's pretty simple and accepted now.
>
> Well I always figured it was a bit rude allocating large amounts
> of memory GFP_NOIO but whatever.

You'd rather not, of course, but with dirty data limits now,
it doesn't matter much. (and I doubt anybody outside testing
is going to be hammering like crazy on rd).

Note that the buffercache based ramdisk driver is going to
also be allocating with GFP_NOFS if you're talking about a
filesystem writing to its metadata. In most systems, GFP_NOFS
isn't much different to GFP_NOIO.

We could introduce a mode which allocates pages up front
quite easily if it were a problem (which I doubt it ever would
be).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] 2.6.23-git8 kernel oops at __rb_rotate_left+0x7/0x70

2007-10-15 Thread Kamalesh Babulal
Hi,

While running kernbench with the 2.6.23-git8 following oops is produced

Unable to handle kernel NULL pointer dereference at 0010 RIP: 
 [] __rb_rotate_left+0x7/0x70
PGD 31f7ad067 PUD 31f14d067 PMD 0 
Oops:  [1] SMP 
CPU 8 
Modules linked in: loop dm_mod md_mod sg
Pid: 6923, comm: slpd Not tainted 2.6.23-git8-autokern1 #1
RIP: 0010:[]  [] __rb_rotate_left+0x7/0x70
RSP: 0018:81031d083e90  EFLAGS: 00010086
RAX: 8106147550d0 RBX: 81033007b650 RCX: 
RDX:  RSI: 810330080808 RDI: 8106147550d0
RBP: 8106147550d0 R08: 81033007b650 R09: 81033007b650
R10: 8103300807e0 R11:  R12: 8106147550d0
R13: 810330080808 R14: 810330080780 R15: 0008
FS:  2ab70eae80a0() GS:8106146b5440() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 0010 CR3: 00031d08f000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process slpd (pid: 6923, threadinfo 81031d082000, task 81031f31f100)
Stack:  8033f4aa 8106147550c0 81031d083ed0 8103300807e0
 81031f31f300 8022bc59 0008 0384
 81031d083f70 804b6dec 81031d61b0c0 1000
Call Trace:
 [] rb_insert_color+0x8a/0xf0
 [] put_prev_task_fair+0x49/0x60
 [] schedule+0xec/0x1d1
 [] vfs_read+0xc5/0x160
 [] sys_read+0x53/0x90
 [] sysret_careful+0xd/0x10


Code: 48 8b 51 10 49 83 e0 fc 48 85 d2 48 89 57 08 74 0c 48 8b 02 
RIP  [] __rb_rotate_left+0x7/0x70
 RSP 
CR2: 0010

-- 
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.23-git8-autokern1
# Mon Oct 15 19:35:23 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=20
# CONFIG_CPUSETS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_BLK_DEV_BSG is not set

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Processor type and features
#
# CONFIG_TICK_ONESHOT is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
# CONFIG_MTRR is not set
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y

Re: [PATCH 11/11] maps3: make page monitoring /proc file optional

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/init/Kconfig
> ===
> --- l.orig/init/Kconfig   2007-10-14 13:35:07.0 -0500
> +++ l/init/Kconfig2007-10-15 17:18:16.0 -0500
> @@ -571,6 +571,15 @@ config SLOB
>  
>  endchoice
>  
> +config PROC_PAGE_MONITOR
> + default y
> + bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS && MMU
> + help
> +   Various /proc files exist to monitor process memory utilization:
> +   /proc/pid/smaps, /proc/pid/clear_refs, /proc/pid/pagemap,
> +   /proc/kpagecount, and /proc/kpageflags. Disabling these
> +  interfaces will reduce the size of the kernel by approximately 4kb.
> +
>  endmenu  # General setup
>  
>  config RT_MUTEXES

It's probably better not to include the text size savings since it will 
most likely be outdated at some time in the future.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-usb-devel] usb+sysfs: duplicate filename 'bInterfaceNumber'

2007-10-15 Thread Dave Young
>On 10/16/07, Greg KH <[EMAIL PROTECTED]> wrote:
> On Mon, Oct 15, 2007 at 02:38:25PM -0400, Alan Stern wrote:
> > On Mon, 15 Oct 2007, Dave Young wrote:
> >
> > > On 10/14/07, Borislav Petkov <[EMAIL PROTECTED]> wrote:
> > > > Hi,
> > > >
> > > > i get the following warning on yesterday's git tree 
> > > > (v2.6.23-2840-g752097c):
> > > >
> > > > Oct 14 09:07:15 zmei kernel: [   49.368030] sysfs: duplicate filename 
> > > > 'bInterfaceNumber' can not be created
> > > > Oct 14 09:07:15 zmei kernel: [   49.368086] WARNING: at 
> > > > fs/sysfs/dir.c:425 sysfs_add_one()
> > > > Oct 14 09:07:15 zmei kernel: [   49.368134]  [] 
> > > > show_trace_log_lvl+0x1a/0x2f
> > > > Oct 14 09:07:15 zmei kernel: [   49.368220]  [] 
> > > > show_trace+0x12/0x14
> > > > Oct 14 09:07:15 zmei kernel: [   49.368300]  [] 
> > > > dump_stack+0x16/0x18
> > > > Oct 14 09:07:15 zmei kernel: [   49.368379]  [] 
> > > > sysfs_add_one+0x57/0xbc
> > > > Oct 14 09:07:15 zmei kernel: [   49.368461]  [] 
> > > > sysfs_add_file+0x49/0x71
> > > > Oct 14 09:07:15 zmei kernel: [   49.368541]  [] 
> > > > sysfs_create_group+0x86/0xe8
> > > > Oct 14 09:07:15 zmei kernel: [   49.368621]  [] 
> > > > usb_create_sysfs_intf_files+0x27/0x9b
> > > > Oct 14 09:07:15 zmei kernel: [   49.368704]  [] 
> > > > usb_set_configuration+0x454/0x466
> > > > Oct 14 09:07:15 zmei kernel: [   49.368787]  [] 
> > > > generic_probe+0x53/0x94
> > > > Oct 14 09:07:15 zmei kernel: [   49.368867]  [] 
> > > > usb_probe_device+0x35/0x3b
> > > > Oct 14 09:07:15 zmei kernel: [   49.368947]  [] 
> > > > driver_probe_device+0xcb/0x14f
> > > > Oct 14 09:07:15 zmei kernel: [   49.369039]  [] 
> > > > __device_attach+0x8/0xa
> > > > Oct 14 09:07:15 zmei kernel: [   49.369119]  [] 
> > > > bus_for_each_drv+0x3b/0x63
> > > > Oct 14 09:07:15 zmei kernel: [   49.369199]  [] 
> > > > device_attach+0x70/0x85
> > > > Oct 14 09:07:15 zmei kernel: [   49.369279]  [] 
> > > > bus_attach_device+0x29/0x77
> > > > Oct 14 09:07:15 zmei kernel: [   49.369359]  [] 
> > > > device_add+0x28c/0x445
> > > > Oct 14 09:07:15 zmei kernel: [   49.369439]  [] 
> > > > usb_new_device+0x44/0x82
> > > > Oct 14 09:07:15 zmei kernel: [   49.369519]  [] 
> > > > hub_thread+0x666/0x9c2
> > > > Oct 14 09:07:15 zmei kernel: [   49.369598]  [] 
> > > > kthread+0x3b/0x62
> > > > Oct 14 09:07:15 zmei kernel: [   49.369679]  [] 
> > > > kernel_thread_helper+0x7/0x10
> > > > Oct 14 09:07:15 zmei kernel: [   49.369759]  ===
> > > >
> > > > The usb hub in question is named 4-1:1.0 and it has an extension 
> > > > connected to it
> > > > which is used to activate the 2 usb connectors at the side of the pc's 
> > > > monitor.
> > > > Correct me if i'm wrong but from what i've understood so far from 
> > > > reading the code,
> > > > i think, it adds the bInterfaceNumber-file after calling 
> > > > usb_create_sysfs_intf_files(intf).
> > > > However, the currently active usbhost interface alternate setting is 
> > > > the only one active
> > > > so the bInterfaceNumber exists already and therefore the warning, but 
> > > > this is
> > > > just a guess since i'm not that fluent in the usb internals.
> > > Hi,
> > > I have encountered the same problem which was  reported in
> > > http://lkml.org/lkml/2007/9/29/45
> > >
> > > For the first one "usbcore duplicated sysfs filename" , I have submit
> > > a patch to fix it.
> > >
> > > For the "bInterfaceNumber" one, I have no idea, the same problem still
> > > exist in the latest 23-mm1 tree.
> >
> > I have tried several times to duplicate this, most recently under
> > 2.6.23-mm1.  But nothing goes wrong; the error messages don't appear.
> >
> > You may have to do your own debugging.  Try adding printk statements to
> > usb_create_sysfs_intf_files() and usb_remove_sysfs_intf_files() so you
> > can tell when they get called.
>
> I finally duplicated this on one of my machines here at boot time, with
> USB built into the kernel.  I'll work tomorrow on tracking this down
> further...
Hi,
I add some printk messages, dump_stack and some others, here is the
dmesg dump with debug info(lines begin with "hidave"):

Linux version 2.6.23-mm1 ([EMAIL PROTECTED]) (gcc version 3.4.6) #4 SMP
PREEMPT Tue Oct 16 11:14:10 CST 2007
BIOS-provided physical RAM map:
 BIOS-e820:  - 000a (usable)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 3fe88c00 (usable)
 BIOS-e820: 3fe88c00 - 3fe8ac00 (ACPI NVS)
 BIOS-e820: 3fe8ac00 - 3fe8cc00 (ACPI data)
 BIOS-e820: 3fe8cc00 - 4000 (reserved)
 BIOS-e820: f000 - f400 (reserved)
 BIOS-e820: fec0 - fed00400 (reserved)
 BIOS-e820: fed2 - feda (reserved)
 BIOS-e820: fee0 - fef0 (reserved)
 BIOS-e820: ffb0 - 0001 (reserved)
126MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000fe710

Re: [RFC] cpuset update_cgroup_cpus_allowed

2007-10-15 Thread Paul Jackson
> Will do - I justed wanted to get this quickly out to show the idea
> that I was working on.

Ok - good.

In the final analysis, I'll take whatever works ;).

I'll lobby for keeping the code "simple" (a subjective metric) and poke
what holes I can in things, and propose what alternatives I can muster.

But so long as setting a cpusets 'cpus' in 2.6.24 leads, whether by my
historical "rewrite the pid to its own 'tasks' file" hack, or by a
proper solution such as you have advocated, or by some other scheme
or hack, to updating the cpus_allowed of each task in that cpuset, then
I'm ok.

Right now, that goal is not met, with the cgroup patches lined up in
*-mm for what will become 2.6.24.

We're getting short of time to fix this.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 7/11] maps3: move clear_refs code to task_mmu.c

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/fs/proc/task_mmu.c
> ===
> --- l.orig/fs/proc/task_mmu.c 2007-10-14 13:38:43.0 -0500
> +++ l/fs/proc/task_mmu.c  2007-10-14 13:39:00.0 -0500
> @@ -324,19 +324,47 @@ static int show_smap(struct seq_file *m,
>  
>  static struct mm_walk clear_refs_walk = { .pmd_entry = clear_refs_pte_range 
> };
>  
> -void clear_refs_smap(struct mm_struct *mm)
> +static ssize_t clear_refs_write(struct file *file, const char __user *buf,
> + size_t count, loff_t *ppos)
>  {
> + struct task_struct *task;
> + char buffer[13], *end;

The #define for PROC_NUMBUF will need to be moved from fs/proc/base.c to 
include/linux/proc_fs.h and used here instead of hardcoding it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Allow kconfig to accept overrides

2007-10-15 Thread Rob Landley
On Monday 15 October 2007 11:29:58 pm Sam Ravnborg wrote:
> Hi Rob & Jan.
>
> On Fri, Oct 12, 2007 at 11:44:08PM +0200, Jan Engelhardt wrote:
> > Allow config variables in .config to override earlier ones in the same
> > file. In other words,
> >
> > # CONFIG_SECURITY is not defined
> > CONFIG_SECURITY=y
> >
> > will activate it. This makes it a bit easier to do
> >
> > (cat original-config myconfig myconfig2 ... >.config)
> >
> > and run menuconfig as expected.
>
> How far is this from the miniconfig functionality?
> Is it the same or can we achieve the miniconfig support
> by extending Jan's patch?
>
> See: http://lkml.org/lkml/2007/10/12/391

Way way back (2.6.10 or thereabouts) I first did a miniconfig via running 
allnoconfig, concatenating a miniconfig to the result, and running "make 
oldconfig" on that.  This concatenation method had two main problems:

1) Around 2.6.15 the kconfig infrastructure changed so the first instance 
symbol won rather than the last symbol.  It looks like this patch just sets 
the behavior back to what we had in 2.6.14 and earlier.

2) When a symbol activates new subsymbols (opening a new menu, for example), 
those dependant symbols would be activated at their oldconfig default values, 
not their allnoconfig default values.  This meant there might be a valid 
configuration that you couldn't specify without saying "symbol=n" to turn 
some of them off in your miniconfig, which is something a miniconfig should 
never have to do.  (This happens when allnoconfig and oldconfig are run in 
two separate passes.  The oldconfig pass uses the wrong default values for 
newly enabled symbols.  Menuconfig has the same defaults as oldconfig, which 
are _not_ the same defaults as allnoconfig.)

Note that the infrastructure I'm using to _read_ miniconfig files is just a 
repurpose of the existing KCONFIG_ALLCONFIG as applied to allnoconfig.  
That's in kconfig already, has been since 2.6.15-ish, and works fine.  The 
syntax is nonobvious (two patches from me to improve said syntax and add some 
error checking were rejected), but the functionality is there and easy enough 
to trigger:

  make allnoconfig KCONFIG_ALLCONFIG=mini.conf

That expands a mini.conf into a .config, and does the other setup necessary.  
(You can feed that O= to build out of tree, or ARCH= to build another 
architecture...  Anything you can currently do with allnoconfig.)

It's the "shrinking a .config into a mini.conf" side of things that uses a 
hideous shell script that's not in the tree:
http://landley.net/hg/firmware/raw-file/tip/sources/toys/miniconfig.sh

To use it:
  make ARCH=arm defconfig
  mv .config tempname
  ARCH=arm ./miniconfig.sh tempname
  ls -l mini.config

(Obviously, the ARCH=arm is optional and you don't have to start with 
defconfig.)

If I had unlimited spare time I might teach kconfig to automatically write a 
mini.conf every time it writes a .config, and have it use whichever was newer 
in the update commands (oldconfig/menuconfig/etc).  But after two rejected 
patches on this topic already, with the shell script meeting my needs, that's 
impressively far down on my todo list.

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] cpuset update_cgroup_cpus_allowed

2007-10-15 Thread Paul Menage
On 10/15/07, Paul Jackson <[EMAIL PROTECTED]> wrote:
> > currently against an older kernel
>
> ah .. which older kernel?

2.6.18, but I can do a version against 2.6.23-mm1.

> +   if (!retval) {
> +   cpus_allowed = cpuset_cpus_allowed(p);
> +   if (!cpus_subset(new_mask, cpus_allowed)) {
> +   /*
> +* We must have raced with a concurrent cpuset
> +* update. Just reset the cpus_allowed to the
> +* cpuset's cpus_allowed
> +*/
> +   new_mask = cpus_allowed;
>
> This narrows the race, perhaps sufficiently, but I don't see that it
> guarantees closure.  Memory accesses to two different locations are not
> guaranteed to be ordered across nodes, as best I recall.  The second
> line above, that rereads the cpuset cpus_allowed, could get an old
> value, in essence.
>
> cpuset update task  sched_setaffinity task
> --  --
>
> A. write cpuset [Q] V. read cpuset [Q]
> B. read task [P]W. check ok
> C. write task [P]   X. write task [P]
> Y. reread cpuset [Q]
> Z. check ok again
>
> Two memory locations:
> [P] the cpus_allowed mask in the task_struct of the
> task doing the sched_setaffinity call.
> [Q] the cpus_allowed mask in the cpuset of the cpuset
> to which the sched_setaffinity task is attached.
>
> Even though, from the perspective of location [P], both B. and C.
> happened before X., still from the perspective of location [Q] the
> rereading in Y. could return the value the cpuset cpus_allowed had
> before the write in A.  This could result in a task running with
> a cpus_allowed that was totally outside its cpusets cpus_allowed.

But cpuset_cpus_allowed() synchronizes on callback_mutex. So I assert
this race isn't an issue.

>
> I will grant that this is a narrow window.  I won't loose much sleep
> over it.
>
> >  - uses a priority heap to pick the processes to act on, based on start time
>
> This adds a fair bit of code and complexity, relative to my patch.
> This I do loose more sleep over.  There has to be a compelling
> reason for doing this.

My plan was to hide this inside cgroup_iter_* so that users didn't
have to hold the cssgroup_lock across the entire iteration.

>
> The point that David raises, regarding the interaction of this with
> hotplug, seems to be a compelling reason for doing -something-
> different than my patch proposal.
>
> I don't know yet if it compels us to this much code, however.
>
> Any chance you could provide a patch that works against cgroups?
>

Will do - I justed wanted to get this quickly out to show the idea
that I was working on.

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] sched: Rationalize sys_sched_rr_get_interval()

2007-10-15 Thread Peter Williams

Jarek Poplawski wrote:

On 13-10-2007 03:29, Peter Williams wrote:

Jarek Poplawski wrote:

On 12-10-2007 00:23, Peter Williams wrote:
...
The reason I was going that route was for modularity (which helps 
when adding plugsched patches).  I'll submit a revised patch for 
consideration.

...

IMHO, it looks like modularity could suck here:


+static unsigned int default_timeslice_fair(struct task_struct *p)
+{
+return NS_TO_JIFFIES(sysctl_sched_min_granularity);
+}

If it's needed for outside and sched_fair will use something else
(to avoid double conversion) this could be misleading. Shouldn't
this be kind of private and return something usable for the class
mainly?
This is supplying data for a system call not something for internal use 
by the class.  As far as the sched_fair class is concerned this is just 
a (necessary - because it's need by a system call) diversion.


So, now all is clear: this is the misleading case!


Why anything else than sched_fair should care about this?
sched_fair doesn't care so if nothing else does why do we even have 
sys_sched_rr_get_interval()?  Is this whole function an anachronism that 
can be expunged?  I'm assuming that the reason it exists is that there 
are user space programs that use this system call.  Am I correct in this 
assumption?  Personally, I can't think of anything it would be useful 
for other than satisfying curiosity.


Since this is for some special aim (not default for most classes, at
least not for sched_fair) I'd suggest to change names:
default_timeslice_fair() and .default_timeslice to something like eg.:
rr_timeslice_fair() and .rr_timeslice or rr_interval_fair() and
.rr_interval (maybe with this "default" before_"rr_" if necessary).

On the other hand man (2) sched_rr_get_interval mentions that:
"The identified process should be running under the SCHED_RR
scheduling policy".

Also this place seems to say about something simpler:
http://www.gnu.org/software/libc/manual/html_node/Basic-Scheduling-Functions.html

So, I still doubt sched_fair's "notion" of timeslices should be
necessary here.


As do I.  Even more so now that you've shown me the man page for 
sched_rr_get_interval().


I'd suggest that we modify sched_rr_get_interval() to return -EINVAL 
(with *interval set to zero) if the target task is not SCHED_RR.  That 
way we can save a lot of unnecessary code.  I'll work on a patch. 
Unless you want to do it?




Sorry for too harsh words.


I didn't consider them harsh.

Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/11] maps3: use pagewalker in clear_refs and smaps

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Matt Mackall wrote:

> Use the generic pagewalker for smaps and clear_refs
> 
> Signed-off-by: Matt Mackall <[EMAIL PROTECTED]>

Acked-by: David Rientjes <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/11] maps3: introduce a generic page walker

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Matt Mackall wrote:

> Introduce a general page table walker
> 
> Signed-off-by: Matt Mackall <[EMAIL PROTECTED]>
> 
> Index: l/include/linux/mm.h
> ===
> --- l.orig/include/linux/mm.h 2007-10-09 17:37:59.0 -0500
> +++ l/include/linux/mm.h  2007-10-10 11:46:37.0 -0500
> @@ -773,6 +773,17 @@ unsigned long unmap_vmas(struct mmu_gath
>   struct vm_area_struct *start_vma, unsigned long start_addr,
>   unsigned long end_addr, unsigned long *nr_accounted,
>   struct zap_details *);
> +
> +struct mm_walk {
> + int (*pgd_entry)(pgd_t *, unsigned long, unsigned long, void *);
> + int (*pud_entry)(pud_t *, unsigned long, unsigned long, void *);
> + int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *);
> + int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *);
> + int (*pte_hole) (unsigned long, unsigned long, void *);
> +};
> +
> +int walk_page_range(struct mm_struct *, unsigned long addr, unsigned long 
> end,
> + struct mm_walk *walk, void *private);

The struct mm_walk * can be qualified as const.

> Index: l/mm/pagewalk.c
> ===
> --- /dev/null 1970-01-01 00:00:00.0 +
> +++ l/mm/pagewalk.c   2007-10-10 11:46:37.0 -0500
> @@ -0,0 +1,120 @@
> +#include 
> +#include 
> +#include 
> +
> +static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> +   struct mm_walk *walk, void *private)
> +{
> + pte_t *pte;
> + int err = 0;
> +
> + pte = pte_offset_map(pmd, addr);
> + do {
> + err = walk->pte_entry(pte, addr, addr, private);
> + if (err)
> +break;
> + } while (pte++, addr += PAGE_SIZE, addr != end);
> +
> + pte_unmap(pte);
> + return err;
> +}
> +
> +static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> +   struct mm_walk *walk, void *private)
> +{
> + pmd_t *pmd;
> + unsigned long next;
> + int err = 0;
> +
> + pmd = pmd_offset(pud, addr);
> + do {
> + next = pmd_addr_end(addr, end);
> + if (pmd_none_or_clear_bad(pmd)) {
> + if (walk->pte_hole)
> + err = walk->pte_hole(addr, next, private);
> + if (err)
> + break;
> + continue;
> + }
> + if (walk->pmd_entry)
> + err = walk->pmd_entry(pmd, addr, next, private);
> + if (!err && walk->pte_entry)
> + err = walk_pte_range(pmd, addr, next, walk, private);
> + if (err)
> + break;
> + } while (pmd++, addr = next, addr != end);
> +
> + return err;
> +}
> +
> +static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> +   struct mm_walk *walk, void *private)
> +{
> + pud_t *pud;
> + unsigned long next;
> + int err = 0;
> +
> + pud = pud_offset(pgd, addr);
> + do {
> + next = pud_addr_end(addr, end);
> + if (pud_none_or_clear_bad(pud)) {
> + if (walk->pte_hole)
> + err = walk->pte_hole(addr, next, private);
> + if (err)
> + break;
> + continue;
> + }
> + if (walk->pud_entry)
> + err = walk->pud_entry(pud, addr, next, private);
> + if (!err && (walk->pmd_entry || walk->pte_entry))
> + err = walk_pmd_range(pud, addr, next, walk, private);
> + if (err)
> + break;
> + } while (pud++, addr = next, addr != end);
> +
> + return err;
> +}
> +
> +/*
> + * walk_page_range - walk a memory map's page tables with a callback
> + * @mm - memory map to walk
> + * @addr - starting address
> + * @end - ending address
> + * @walk - set of callbacks to invoke for each level of the tree
> + * @private - private data passed to the callback function
> + *
> + * Recursively walk the page table for the memory area in a VMA, calling
> + * a callback for every bottom-level (PTE) page table.
> + */
> +int walk_page_range(struct mm_struct *mm,
> + unsigned long addr, unsigned long end,
> + struct mm_walk *walk, void *private)
> +{
> + pgd_t *pgd;
> + unsigned long next;
> + int err = 0;
> +
> + if (addr >= end)
> + return err;

unlikely?

> +
> + pgd = pgd_offset(mm, addr);
> + do {
> + next = pgd_addr_end(addr, end);
> + if (pgd_none_or_clear_bad(pgd)) {
> + if (walk->pte_hole)
> + err = walk->pte_hole(addr, next, private);
> + if (err)

Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Eric W. Biederman
Nick Piggin <[EMAIL PROTECTED]> writes:

>>
>> make_page_uptodate() is most hideous part I have run into.
>> It has to know details about other layers to now what not
>> to stomp.  I think my incorrect simplification of this is what messed
>> things up, last round.
>
> Not really, it's just named funny. That's just a minor utility
> function that more or less does what it says it should do.
>
> The main problem is really that it's implementing a block device
> who's data comes from its own buffercache :P. I think.

Well to put it another way, mark_page_uptodate() is the only
place where we really need to know about the upper layers.
Given that you can kill ramdisks by coding it as:

static void make_page_uptodate(struct page *page)
{
clear_highpage(page);
flush_dcache_page(page);
SetPageUptodate(page);
}

Something is seriously non-intuitive about that function if
you understand the usual rules for how to use the page cache.

The problem is that we support a case in the buffer cache
where pages are partially uptodate and only the buffer_heads
remember which parts are valid.  Assuming we are using them
correctly.

Having to walk through all of the buffer heads in make_page_uptodate
seems to me to be a nasty layering violation in rd.c

>> > I guess it's not nice
>> > for operating on the pagecache from its request_fn, but the
>> > alternative is to duplicate pages for backing store and buffer
>> > cache (actually that might not be a bad alternative really).
>>
>> Cool. Triple buffering :)  Although I guess that would only
>> apply to metadata these days.
>
> Double buffering. You no longer serve data out of your buffer
> cache.  All filesystem data was already double buffered anyway,
> so we'd be just losing out on one layer of savings for metadata.

Yep we are in agreement there.

> I think it's worthwhile, given that we'd have a "real" looking
> block device and minus these bugs.

For testing purposes I think I can agree with that.

>> Having a separate store would 
>> solve some of the problems, and probably remove the need
>> for carefully specifying the ramdisk block size.  We would
>> still need the magic restictions on page allocations though
>> and it we would use them more often as the initial write to the
>> ramdisk would not populate the pages we need.
>
> What magic restrictions on page allocations? Actually we have
> fewer restrictions on page allocations because we can use
> highmem! 

With the proposed rewrite yes.

> And the lowmem buffercache pages that we currently pin
> (unsuccessfully, in the case of this bug) are now completely
> reclaimable. And all your buffer heads are now reclaimable.

Hmm.  Good point.  So in net it should save memory even if
it consumes a little more in the worst case.


> If you mean GFP_NOIO... I don't see any problem. Block device
> drivers have to allocate memory with GFP_NOIO; this may have
> been considered magic or deep badness back when the code was
> written, but it's pretty simple and accepted now.

Well I always figured it was a bit rude allocating large amounts
of memory GFP_NOIO but whatever.

>> A very ugly bit seems to be the fact that we assume we can
>> dereference bh->b_data without any special magic which
>> means the ramdisk must live in low memory on 32bit machines.
>
> Yeah but that's not rd.c. You need to rewrite the buffer layer
> to fix that (see fsblock ;)).

I'm not certain which way we should go.  Take fsblock and run it
in parallel until everything is converted or use fsblock as a
prototype and once we have figured out which way we should go
convert struct buffer_head into struct fsblock one patch at a time.

I'm inclined to think we should evolve the buffer_head.

Eric





-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread david

On Mon, 15 Oct 2007, Greg KH wrote:


On Mon, Oct 15, 2007 at 10:04:01PM -0600, Matthew Wilcox wrote:

On Mon, Oct 15, 2007 at 07:54:22PM -0700, [EMAIL PROTECTED] wrote:

do PCI devices reorder their bus numbers spontaniously, or only if you
change the hardware?


The only system I've had that reordered PCI bus numbers was when I had a
partitionable system and changed the partitioning.  Not quite "change
the hardware", but neither was it "spontaneous".  It was certainly
unexpected (for me).

Greg probably has quite different examples.


Changing the hardware (adding a new PCI device or removing one) are the
most common times this happens.  But I have seen reports of this
happening when you upgrade/downgrade BIOS versions, and, in some
oops-we-messed-up cases, when we changed things in the kernel.


BIOS upgrades qualify as changing hardware (or close to it)

oops-we-messed-up cases of kernel changes don't justify 'best effort' 
nameing, it's a regression that needs to be fixed.


now the other example given of docking a laptop is closer to reasonable 
(and is definantly a reason to have 'best effort' nameing as an option), 
but that's still a relativly special case, and it _is_ definantly 
changeing the hardware


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-2.6.23-git3: Many sysfs-related warnings in dmesg

2007-10-15 Thread Greg KH
On Sat, Oct 13, 2007 at 09:26:32PM +0200, Rafael J. Wysocki wrote:
> Hi,
> 
> There are many traces like this in my dmesg from 2.6.23-git3 (they don't
> appear for vanilla 2.6.23):
> 
> <4>sysfs: duplicate filename 'ethxx1' can not be created
> WARNING: at /home/rafael/src/linux-2.6/fs/sysfs/dir.c:425 sysfs_add_one()
> 
> Call Trace:
>  [] sysfs_add_one+0x5c/0xc9
>  [] sysfs_create_link+0xd1/0x12c
>  [] device_rename+0x17a/0x1db
>  [] dev_change_name+0x114/0x20c
>  [] dev_ifsioc+0x204/0x2d0
>  [] dev_ioctl+0x520/0x633
>  [] sk_alloc+0x37/0x10c
>  [] up_read+0x9/0xb
>  [] sock_ioctl+0x1fe/0x20c
>  [] do_ioctl+0x2a/0x77
>  [] vfs_ioctl+0x251/0x26e
>  [] sys_ioctl+0x5f/0x83
>  [] system_call+0x7e/0x83
> 
> net ethxx1: device_rename: sysfs_create_symlink failed (-17)
> sysfs: duplicate filename 'eth1' can not be created
> 
> Everything seems to work, but this just looks fishy.

This is a userspace program renaming your network device to a name that
is already in use.  What distro and release is this?

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 14:38, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:

> > I don't follow your logic. We don't need SWAP > RAM in order to swap
> > effectively, IMO.
>
> The steady state of a system that is heavily and usably swapping but
> not thrashing is that all of the pages in RAM are in the swap cache,
> at least that used to be the case.

Yeah, it works better in 2.6 (and, IIRC later 2.4 kernels).


> > I don't know if there is a causal relationship there. I mean, I
> > think it's been a long time since thrashing was ever a viable mode
> > of operation, right?
>
> Right.  But swapping heavily has been a viable mode of operation
> and that the vast gap in disk random IO performance seems to have
> hurt significantly.

Or, just not improved as fast as everything else is improving.
There isn't too much the kernel can do about that. It just
relatively changes the point at which you'd consider "swapping
heavily", right?


> It be very clear is used to able to run a problem at little below
> full speed with the disk pegged with swap traffic, and I did this
> regularly when I started out with linux.

I can do this now. In make -jhuge tests for example, you can get
a 4GB, 4 core machine to max out a disk with swapping and still
have 0 idle time. Of course you can also go past that point and
your idle time comes up. That's not new though.


> > Maybe desktops just have less need for swapping now, so nobody sees
> > it much until something goes _really_ bad. When I'm using my 256MB
> > machine, unused stuff goes to swap.
>
> There is a bit of truth in the fact that there is less need for
> swapping now.  At the same time however swapping simply does not
> work well right now, and I'm not at all certain why.
>
> >> the disk for is very limited.   I wonder if we could figure out
> >> how to push and pull 1M or bigger chunks into and out of swap?
> >
> > Pulling in 1MB pages can really easily end up compounding the
> > thrashing problem unless you're very sure a significant amount
> > of it will be used.
>
> It's a hard call.  The I/O time for 1MB of contiguous disk data
> is about the I/O time of 512 bytes of contiguous disk data.

And if you're thrashing, then by definition you need to throw
out 1MB of your working set in order to read it in.


> >> I don't know if swap has actually worked since we vmscan stopped
> >> going over the virtual addresses.
> >
> > I do, and it does ;)
>
> Really?  Not just the pushing of unused stuff into swap.

We had several bugs and things that caused swapping performance
regressions vs 2.4 in earlyish 2.6. After those were fixed, we're
pretty competitive with 2.4 in some basic tests I was using. I
haven't run them for a fair while, so something might have broken
since then, I don't know.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-usb-devel] usb+sysfs: duplicate filename 'bInterfaceNumber'

2007-10-15 Thread Greg KH
On Mon, Oct 15, 2007 at 02:38:25PM -0400, Alan Stern wrote:
> On Mon, 15 Oct 2007, Dave Young wrote:
> 
> > On 10/14/07, Borislav Petkov <[EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > i get the following warning on yesterday's git tree 
> > > (v2.6.23-2840-g752097c):
> > >
> > > Oct 14 09:07:15 zmei kernel: [   49.368030] sysfs: duplicate filename 
> > > 'bInterfaceNumber' can not be created
> > > Oct 14 09:07:15 zmei kernel: [   49.368086] WARNING: at 
> > > fs/sysfs/dir.c:425 sysfs_add_one()
> > > Oct 14 09:07:15 zmei kernel: [   49.368134]  [] 
> > > show_trace_log_lvl+0x1a/0x2f
> > > Oct 14 09:07:15 zmei kernel: [   49.368220]  [] 
> > > show_trace+0x12/0x14
> > > Oct 14 09:07:15 zmei kernel: [   49.368300]  [] 
> > > dump_stack+0x16/0x18
> > > Oct 14 09:07:15 zmei kernel: [   49.368379]  [] 
> > > sysfs_add_one+0x57/0xbc
> > > Oct 14 09:07:15 zmei kernel: [   49.368461]  [] 
> > > sysfs_add_file+0x49/0x71
> > > Oct 14 09:07:15 zmei kernel: [   49.368541]  [] 
> > > sysfs_create_group+0x86/0xe8
> > > Oct 14 09:07:15 zmei kernel: [   49.368621]  [] 
> > > usb_create_sysfs_intf_files+0x27/0x9b
> > > Oct 14 09:07:15 zmei kernel: [   49.368704]  [] 
> > > usb_set_configuration+0x454/0x466
> > > Oct 14 09:07:15 zmei kernel: [   49.368787]  [] 
> > > generic_probe+0x53/0x94
> > > Oct 14 09:07:15 zmei kernel: [   49.368867]  [] 
> > > usb_probe_device+0x35/0x3b
> > > Oct 14 09:07:15 zmei kernel: [   49.368947]  [] 
> > > driver_probe_device+0xcb/0x14f
> > > Oct 14 09:07:15 zmei kernel: [   49.369039]  [] 
> > > __device_attach+0x8/0xa
> > > Oct 14 09:07:15 zmei kernel: [   49.369119]  [] 
> > > bus_for_each_drv+0x3b/0x63
> > > Oct 14 09:07:15 zmei kernel: [   49.369199]  [] 
> > > device_attach+0x70/0x85
> > > Oct 14 09:07:15 zmei kernel: [   49.369279]  [] 
> > > bus_attach_device+0x29/0x77
> > > Oct 14 09:07:15 zmei kernel: [   49.369359]  [] 
> > > device_add+0x28c/0x445
> > > Oct 14 09:07:15 zmei kernel: [   49.369439]  [] 
> > > usb_new_device+0x44/0x82
> > > Oct 14 09:07:15 zmei kernel: [   49.369519]  [] 
> > > hub_thread+0x666/0x9c2
> > > Oct 14 09:07:15 zmei kernel: [   49.369598]  [] 
> > > kthread+0x3b/0x62
> > > Oct 14 09:07:15 zmei kernel: [   49.369679]  [] 
> > > kernel_thread_helper+0x7/0x10
> > > Oct 14 09:07:15 zmei kernel: [   49.369759]  ===
> > >
> > > The usb hub in question is named 4-1:1.0 and it has an extension 
> > > connected to it
> > > which is used to activate the 2 usb connectors at the side of the pc's 
> > > monitor.
> > > Correct me if i'm wrong but from what i've understood so far from reading 
> > > the code,
> > > i think, it adds the bInterfaceNumber-file after calling 
> > > usb_create_sysfs_intf_files(intf).
> > > However, the currently active usbhost interface alternate setting is the 
> > > only one active
> > > so the bInterfaceNumber exists already and therefore the warning, but 
> > > this is
> > > just a guess since i'm not that fluent in the usb internals.
> > Hi,
> > I have encountered the same problem which was  reported in
> > http://lkml.org/lkml/2007/9/29/45
> > 
> > For the first one "usbcore duplicated sysfs filename" , I have submit
> > a patch to fix it.
> > 
> > For the "bInterfaceNumber" one, I have no idea, the same problem still
> > exist in the latest 23-mm1 tree.
> 
> I have tried several times to duplicate this, most recently under 
> 2.6.23-mm1.  But nothing goes wrong; the error messages don't appear.
> 
> You may have to do your own debugging.  Try adding printk statements to
> usb_create_sysfs_intf_files() and usb_remove_sysfs_intf_files() so you
> can tell when they get called.

I finally duplicated this on one of my machines here at boot time, with
USB built into the kernel.  I'll work tomorrow on tracking this down
further...

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[patch] lockdep: fixup the inode dir annotation

2007-10-15 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > please pull the lockdep tree from:
> > 
> >  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep.git 
> >  v2.6.24-lockdep
> 
> Hmm. I'm now getting
> 
>   WARNING: at kernel/lockdep.c:700 look_up_lock_class()

it triggered here too - the patch from Peter below was tested overnight 
and seems to do the trick for me.

Ingo

->
Subject: lockdep: fixup the inode dir annotation

A slight oversight tripped lockdep debugging code, each lockdep
class should have but a single init site.

Rearange the code to make this true.

Signed-off-by: Peter Zijlstra <[EMAIL PROTECTED]>
Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 fs/inode.c |   18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

Index: linux/fs/inode.c
===
--- linux.orig/fs/inode.c
+++ linux/fs/inode.c
@@ -568,16 +568,16 @@ EXPORT_SYMBOL(new_inode);
 void unlock_new_inode(struct inode *inode)
 {
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-   struct file_system_type *type = inode->i_sb->s_type;
-   /*
-* ensure nobody is actually holding i_mutex
-*/
-   mutex_destroy(>i_mutex);
-   mutex_init(>i_mutex);
-   if (inode->i_mode & S_IFDIR)
+   if (inode->i_mode & S_IFDIR) {
+   struct file_system_type *type = inode->i_sb->s_type;
+
+   /*
+* ensure nobody is actually holding i_mutex
+*/
+   mutex_destroy(>i_mutex);
+   mutex_init(>i_mutex);
lockdep_set_class(>i_mutex, >i_mutex_dir_key);
-   else
-   lockdep_set_class(>i_mutex, >i_mutex_key);
+   }
 #endif
/*
 * This is special!  We do not need the spinlock
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Eric W. Biederman
[EMAIL PROTECTED] writes:

>
> on some kernel versions you are correct about needing swap > ram, but on 
> current
> versions you are not. the swap space gets allocated as needed, and re-used as
> needed (I don't know the mechanism of this, but I remember the last time this
> changed from vm=max(ram,swap) to vm=ram+swap)

I don't think I can recall a linux kernel that required swap > ram.
However for serious swapping under linux having swap > ram was very
useful and pretty much a requirement for a workload that involved
swapping heavily (not thrashing).

>> I have not heard of many people swapping and not thrashing lately.
>> I think part of the problem is that we do random access to the swap
>> partition which makes us seek limited.  And since the number of
>> seeks per unit time has been increasing at a linear or slower rate
>> that if we are doing random disk I/O then the amount we can use
>> the disk for is very limited.   I wonder if we could figure out
>> how to push and pull 1M or bigger chunks into and out of swap?
>
> it has been noted by many people that linux is very slow to pull things back
> into ram from swap, significantly slower then simple seed limiting would seem 
> to
> account for.

Yes.  It may be the large amount of random access (my current guess)
or it may be something else.

I'm wonder if I should build an application with a configurable
data set and working set that can be used for swap testing.  I don't
think it would be very hard and it might help sort through some of
the swap performance problems.

Eric



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Eric W. Biederman
Nick Piggin <[EMAIL PROTECTED]> writes:

> On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:
>> Nick Piggin <[EMAIL PROTECTED]> writes:
>
>> > How much swap do you have configured? You really shouldn't configure
>> > so much unless you do want the kernel to actually use it all, right?
>>
>> No.
>>
>> There are three basic swapping scenarios.
>> - Pushing unused data out of ram
>> - Swapping
>> - Thrashing
>>
>> To effectively swap you need SWAP > RAM because after a little while of
>> swapping all of your pages in RAM should be assigned a location in the
>> page cache.
>
> I don't follow your logic. We don't need SWAP > RAM in order to swap
> effectively, IMO.

The steady state of a system that is heavily and usably swapping but
not thrashing is that all of the pages in RAM are in the swap cache,
at least that used to be the case.

>> I have not heard of many people swapping and not thrashing lately.
>> I think part of the problem is that we do random access to the swap
>> partition which makes us seek limited.  And since the number of
>> seeks per unit time has been increasing at a linear or slower rate
>> that if we are doing random disk I/O then the amount we can use
>
> I don't know if there is a causal relationship there. I mean, I
> think it's been a long time since thrashing was ever a viable mode
> of operation, right?

Right.  But swapping heavily has been a viable mode of operation
and that the vast gap in disk random IO performance seems to have
hurt significantly.

It be very clear is used to able to run a problem at little below
full speed with the disk pegged with swap traffic, and I did this
regularly when I started out with linux.

> Maybe desktops just have less need for swapping now, so nobody sees
> it much until something goes _really_ bad. When I'm using my 256MB
> machine, unused stuff goes to swap.

There is a bit of truth in the fact that there is less need for
swapping now.  At the same time however swapping simply does not
work well right now, and I'm not at all certain why.

>> the disk for is very limited.   I wonder if we could figure out
>> how to push and pull 1M or bigger chunks into and out of swap?
>
> Pulling in 1MB pages can really easily end up compounding the
> thrashing problem unless you're very sure a significant amount
> of it will be used.

It's a hard call.  The I/O time for 1MB of contiguous disk data
is about the I/O time of 512 bytes of contiguous disk data.

>> I don't know if swap has actually worked since we vmscan stopped
>> going over the virtual addresses.
>
> I do, and it does ;)

Really?  Not just the pushing of unused stuff into swap.


>> > Because if we're not really conservative about OOM killing, then the
>> > user who actually really did want to use all the swap they configured
>> > gets angry when we kill their jobs without using it all.
>>
>> I totally agree. The fact that the OOM killer started is a sign that
>> the system was completely overwhelmed and nothing better could happen.
>>
>> In this case my gut feel says limiting the total number of processes
>> would have been much more effective then anything at all to do with
>> swap. make -j reminds me of the classic fork bomb.
>
> Yep.
>
>
>> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>>
>> Well we have SAQ which should kill everything on your current VT
>> which should include X and all of it's children.
>
> Which is exactly what you don't want to do if you've just forkbombed
> yourself. I missed the fact that we now have a manual oom kill...

You probably have a point there.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Allow kconfig to accept overrides

2007-10-15 Thread Sam Ravnborg
Hi Rob & Jan.

On Fri, Oct 12, 2007 at 11:44:08PM +0200, Jan Engelhardt wrote:
> 
> Allow config variables in .config to override earlier ones in the same
> file. In other words,
> 
>   # CONFIG_SECURITY is not defined
>   CONFIG_SECURITY=y
> 
> will activate it. This makes it a bit easier to do
> 
>   (cat original-config myconfig myconfig2 ... >.config)
> 
> and run menuconfig as expected.


How far is this from the miniconfig functionality?
Is it the same or can we achieve the miniconfig support
by extending Jan's patch?

See: http://lkml.org/lkml/2007/10/12/391

Sam

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread Greg KH
On Mon, Oct 15, 2007 at 10:04:01PM -0600, Matthew Wilcox wrote:
> On Mon, Oct 15, 2007 at 07:54:22PM -0700, [EMAIL PROTECTED] wrote:
> > do PCI devices reorder their bus numbers spontaniously, or only if you 
> > change the hardware?
> 
> The only system I've had that reordered PCI bus numbers was when I had a
> partitionable system and changed the partitioning.  Not quite "change
> the hardware", but neither was it "spontaneous".  It was certainly
> unexpected (for me).
> 
> Greg probably has quite different examples.

Changing the hardware (adding a new PCI device or removing one) are the
most common times this happens.  But I have seen reports of this
happening when you upgrade/downgrade BIOS versions, and, in some
oops-we-messed-up cases, when we changed things in the kernel.

thanks,

greg k-h
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 13:55, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:

> > How much swap do you have configured? You really shouldn't configure
> > so much unless you do want the kernel to actually use it all, right?
>
> No.
>
> There are three basic swapping scenarios.
> - Pushing unused data out of ram
> - Swapping
> - Thrashing
>
> To effectively swap you need SWAP > RAM because after a little while of
> swapping all of your pages in RAM should be assigned a location in the
> page cache.

I don't follow your logic. We don't need SWAP > RAM in order to swap
effectively, IMO.


> I have not heard of many people swapping and not thrashing lately.
> I think part of the problem is that we do random access to the swap
> partition which makes us seek limited.  And since the number of
> seeks per unit time has been increasing at a linear or slower rate
> that if we are doing random disk I/O then the amount we can use

I don't know if there is a causal relationship there. I mean, I
think it's been a long time since thrashing was ever a viable mode
of operation, right?

Maybe desktops just have less need for swapping now, so nobody sees
it much until something goes _really_ bad. When I'm using my 256MB
machine, unused stuff goes to swap.


> the disk for is very limited.   I wonder if we could figure out
> how to push and pull 1M or bigger chunks into and out of swap?

Pulling in 1MB pages can really easily end up compounding the
thrashing problem unless you're very sure a significant amount
of it will be used.


> I don't know if swap has actually worked since we vmscan stopped
> going over the virtual addresses.

I do, and it does ;)


> > Because if we're not really conservative about OOM killing, then the
> > user who actually really did want to use all the swap they configured
> > gets angry when we kill their jobs without using it all.
>
> I totally agree. The fact that the OOM killer started is a sign that
> the system was completely overwhelmed and nothing better could happen.
>
> In this case my gut feel says limiting the total number of processes
> would have been much more effective then anything at all to do with
> swap. make -j reminds me of the classic fork bomb.

Yep.


> > Would an oom-kill-someone-now sysrq be of help, I wonder?
>
> Well we have SAQ which should kill everything on your current VT
> which should include X and all of it's children.

Which is exactly what you don't want to do if you've just forkbombed
yourself. I missed the fact that we now have a manual oom kill...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Make m68k cross compile like every other architecture.

2007-10-15 Thread Sam Ravnborg
On Mon, Oct 15, 2007 at 07:31:54PM -0500, Rob Landley wrote:
> On Monday 15 October 2007 3:25:35 pm Geert Uytterhoeven wrote:
> > 64-bit parisc tests if /usr/bin/hppa64-linux-gnu- exists.
> > If yes, it sets CROSS_COMPILE to hppa64-linux-gnu-.
> > If no, it sets CROSS_COMPILE to hppa64-linux-
> >
> > 32-bit parisc unconditionally sets CROSS_COMPILE to hppa-linux-.
> >
> > This still breaks Rob's setup if his compiler is called differently.
> 
> Another thing to take into account is that kconfig was recently changed to 
> save ARCH and CROSS_COMPILE in the .config file:
> 
> http://lwn.net/Articles/253889/
The patch is postponed one merge window.
It caused troubles I had not foreseen which needs some attention first.
I plan to have it ready for next merge window.

Sam
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread Arjan van de Ven
On Mon, 15 Oct 2007 22:04:01 -0600
Matthew Wilcox <[EMAIL PROTECTED]> wrote:

> On Mon, Oct 15, 2007 at 07:54:22PM -0700, [EMAIL PROTECTED] wrote:
> > do PCI devices reorder their bus numbers spontaniously, or only if
> > you change the hardware?
> 
> The only system I've had that reordered PCI bus numbers was when I
> had a partitionable system and changed the partitioning.  Not quite
> "change the hardware", but neither was it "spontaneous".  It was
> certainly unexpected (for me).
> 

a very common one is booting your laptop docked (a real dock, not just
a port extender) versus non-docked
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread david

On Mon, 15 Oct 2007, Matthew Wilcox wrote:


On Mon, Oct 15, 2007 at 07:54:22PM -0700, [EMAIL PROTECTED] wrote:

do PCI devices reorder their bus numbers spontaniously, or only if you
change the hardware?


The only system I've had that reordered PCI bus numbers was when I had a
partitionable system and changed the partitioning.  Not quite "change
the hardware", but neither was it "spontaneous".  It was certainly
unexpected (for me).


Ok, I would class that as the equivalent of 'changing the hardware'.


Greg probably has quite different examples.


I would definantly be interested in hearing some of them. Greg's comment 
makes it sound like this is something that (with modern hardware) could 
happen to anyone at any time (which, if true, would be sufficiant to 
require 'best effort' nameing of devices for everything), while my 
experiance is that if the hardware is static (i.e. you don't plugin or 
unplug PCI devices) the numbering of exisitng PCI devices and buses is 
static. and while I understand that consumer distros want to have 
everything 'best effort' named to make it easier for users, I disagree 
that this should force everyone to use 'best effort' when there are many 
situations where it's unnessasary overhead and chances for errors.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread david

On Mon, 15 Oct 2007, Eric W. Biederman wrote:


Nick Piggin <[EMAIL PROTECTED]> writes:


How much swap do you have configured? You really shouldn't configure
so much unless you do want the kernel to actually use it all, right?


No.

There are three basic swapping scenarios.
- Pushing unused data out of ram
- Swapping
- Thrashing

To effectively swap you need SWAP > RAM because after a little while of
swapping all of your pages in RAM should be assigned a location in the
page cache.


on some kernel versions you are correct about needing swap > ram, but on 
current versions you are not. the swap space gets allocated as needed, and 
re-used as needed (I don't know the mechanism of this, but I remember the 
last time this changed from vm=max(ram,swap) to vm=ram+swap)



I have not heard of many people swapping and not thrashing lately.
I think part of the problem is that we do random access to the swap
partition which makes us seek limited.  And since the number of
seeks per unit time has been increasing at a linear or slower rate
that if we are doing random disk I/O then the amount we can use
the disk for is very limited.   I wonder if we could figure out
how to push and pull 1M or bigger chunks into and out of swap?


it has been noted by many people that linux is very slow to pull things 
back into ram from swap, significantly slower then simple seed limiting 
would seem to account for.


Davdi Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 13:14, Eric W. Biederman wrote:
> Nick Piggin <[EMAIL PROTECTED]> writes:
> > On Monday 15 October 2007 19:16, Andrew Morton wrote:
> >> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[EMAIL PROTECTED]>
> >
> > wrote:
> >> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> >> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> >> > > unmaintained, so decided to sent the patch to you :-).
> >> > > I have CCed Ted, who did work on the code in the 90s. I found no
> >> > > current email address of Chad Page.
> >> >
> >> > This really needs to be fixed...
> >>
> >> rd.c is fairly mind-boggling vfs abuse.
> >
> > Why do you say that? I guess it is _different_, by necessity(?)
> > Is there anything that is really bad?
>
> make_page_uptodate() is most hideous part I have run into.
> It has to know details about other layers to now what not
> to stomp.  I think my incorrect simplification of this is what messed
> things up, last round.

Not really, it's just named funny. That's just a minor utility
function that more or less does what it says it should do.

The main problem is really that it's implementing a block device
who's data comes from its own buffercache :P. I think.


> > I guess it's not nice
> > for operating on the pagecache from its request_fn, but the
> > alternative is to duplicate pages for backing store and buffer
> > cache (actually that might not be a bad alternative really).
>
> Cool. Triple buffering :)  Although I guess that would only
> apply to metadata these days.

Double buffering. You no longer serve data out of your buffer
cache. All filesystem data was already double buffered anyway,
so we'd be just losing out on one layer of savings for metadata.
I think it's worthwhile, given that we'd have a "real" looking
block device and minus these bugs.


> Having a separate store would 
> solve some of the problems, and probably remove the need
> for carefully specifying the ramdisk block size.  We would
> still need the magic restictions on page allocations though
> and it we would use them more often as the initial write to the
> ramdisk would not populate the pages we need.

What magic restrictions on page allocations? Actually we have
fewer restrictions on page allocations because we can use
highmem! And the lowmem buffercache pages that we currently pin
(unsuccessfully, in the case of this bug) are now completely
reclaimable. And all your buffer heads are now reclaimable.

If you mean GFP_NOIO... I don't see any problem. Block device
drivers have to allocate memory with GFP_NOIO; this may have
been considered magic or deep badness back when the code was
written, but it's pretty simple and accepted now.


> A very ugly bit seems to be the fact that we assume we can
> dereference bh->b_data without any special magic which
> means the ramdisk must live in low memory on 32bit machines.

Yeah but that's not rd.c. You need to rewrite the buffer layer
to fix that (see fsblock ;)).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread Matthew Wilcox
On Mon, Oct 15, 2007 at 07:54:22PM -0700, [EMAIL PROTECTED] wrote:
> do PCI devices reorder their bus numbers spontaniously, or only if you 
> change the hardware?

The only system I've had that reordered PCI bus numbers was when I had a
partitionable system and changed the partitioning.  Not quite "change
the hardware", but neither was it "spontaneous".  It was certainly
unexpected (for me).

Greg probably has quite different examples.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: OOM killer gripe (was Re: What still uses the block layer?)

2007-10-15 Thread Eric W. Biederman
Nick Piggin <[EMAIL PROTECTED]> writes:

> On Monday 15 October 2007 18:04, Rob Landley wrote:
>> On Sunday 14 October 2007 8:45:03 pm Theodore Tso wrote:
>
>> > > excuse for conflating different categories of devices in the first
>> > > place.
>> >
>> > See the thinkpad Ultrabay drive example above.
>>
>> Last week I drove my laptop so deep into swap (with a "make -j" on qemu)
>> that after half an hour trying to repaint my kmail window, it locked solid.
>> Again.  You'd think the oom killer would come to the rescue, but it didn't.
>> Maybe Ubuntu disabled it.  I have _2_gigs_ of ram in this sucker, on a
>> stock Ubuntu 7.04 install (with the "upgrade all" tab pressed a few times),
>> and yet I managed to make it swap itself to death one more time.
>>
>> Virtual memory isn't perfect.  I've _always_ been able to come up with
>> examples where it just doesn't work for me.  This doesn't mean VM
>> overcommit should be abolished, because it's useful more often than not.
>
> I hate to go completely offtopic here, but disks are so incredibly
> slow when compared to RAM that there is really nothing the kernel
> can do about this. Presumably the job will finish, given infinite
> time.
>
> How much swap do you have configured? You really shouldn't configure
> so much unless you do want the kernel to actually use it all, right?

No.

There are three basic swapping scenarios.
- Pushing unused data out of ram
- Swapping 
- Thrashing

To effectively swap you need SWAP > RAM because after a little while of
swapping all of your pages in RAM should be assigned a location in the
page cache.

I have not heard of many people swapping and not thrashing lately.
I think part of the problem is that we do random access to the swap
partition which makes us seek limited.  And since the number of
seeks per unit time has been increasing at a linear or slower rate
that if we are doing random disk I/O then the amount we can use
the disk for is very limited.   I wonder if we could figure out
how to push and pull 1M or bigger chunks into and out of swap?

I don't know if swap has actually worked since we vmscan stopped
going over the virtual addresses.

> Because if we're not really conservative about OOM killing, then the
> user who actually really did want to use all the swap they configured
> gets angry when we kill their jobs without using it all.

I totally agree. The fact that the OOM killer started is a sign that
the system was completely overwhelmed and nothing better could happen.

In this case my gut feel says limiting the total number of processes
would have been much more effective then anything at all to do with
swap. make -j reminds me of the classic fork bomb.

> Would an oom-kill-someone-now sysrq be of help, I wonder?

Well we have SAQ which should kill everything on your current VT
which should include X and all of it's children.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] powerpc64 vDSO: linker script indentation

2007-10-15 Thread Roland McGrath

This cleans up the formatting in the vDSO linker script, mostly just the
use of whitespace.  It's intended to approximate the kernel standard
conventions for indenting C, treating elements of the linker script about
like initialized variable definitions.

Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
CC: Sam Ravnborg <[EMAIL PROTECTED]>
---
 arch/powerpc/kernel/vdso64/vdso64.lds.S |  225 +--
 1 files changed, 122 insertions(+), 103 deletions(-)

diff --git a/arch/powerpc/kernel/vdso64/vdso64.lds.S 
b/arch/powerpc/kernel/vdso64/vdso64.lds.S
index 2d70f35..932b3fd 100644
--- a/arch/powerpc/kernel/vdso64/vdso64.lds.S
+++ b/arch/powerpc/kernel/vdso64/vdso64.lds.S
@@ -10,100 +10,114 @@ ENTRY(_start)
 
 SECTIONS
 {
-  . = VDSO64_LBASE + SIZEOF_HEADERS;
-  .hash   : { *(.hash) }   :text
-  .gnu.hash   : { *(.gnu.hash) }
-  .dynsym : { *(.dynsym) }
-  .dynstr : { *(.dynstr) }
-  .gnu.version: { *(.gnu.version) }
-  .gnu.version_d  : { *(.gnu.version_d) }
-  .gnu.version_r  : { *(.gnu.version_r) }
-
-  .note  : { *(.note.*) }  :text   :note
-
-  . = ALIGN (16);
-  .text   :
-  {
-*(.text .stub .text.* .gnu.linkonce.t.*)
-*(.sfpr .glink)
-  }:text
-  PROVIDE (__etext = .);
-  PROVIDE (_etext = .);
-  PROVIDE (etext = .);
-
-  . = ALIGN(8);
-  __ftr_fixup : {
-*(__ftr_fixup)
-  }
-
-  . = ALIGN(8);
-  __fw_ftr_fixup : {
-*(__fw_ftr_fixup)
-  }
-
-  /* Other stuff is appended to the text segment: */
-  .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
-  .rodata1: { *(.rodata1) }
-  .eh_frame_hdr   : { *(.eh_frame_hdr) }   :text   :eh_frame_hdr
-  .eh_frame   : { KEEP (*(.eh_frame)) }:text
-  .gcc_except_table   : { *(.gcc_except_table) }
-
-  .opd   ALIGN(8) : { KEEP (*(.opd)) }
-  .got  ALIGN(8) : { *(.got .toc) }
-  .rela.dyn ALIGN(8) : { *(.rela.dyn) }
-
-  .dynamic: { *(.dynamic) }:text   :dynamic
-
-  _end = .;
-  PROVIDE (end = .);
-
-  /* Stabs debugging sections are here too
-   */
-  .stab  0 : { *(.stab) }
-  .stabstr   0 : { *(.stabstr) }
-  .stab.excl 0 : { *(.stab.excl) }
-  .stab.exclstr  0 : { *(.stab.exclstr) }
-  .stab.index0 : { *(.stab.index) }
-  .stab.indexstr 0 : { *(.stab.indexstr) }
-  .comment   0 : { *(.comment) }
-  /* DWARF debug sectio/ns.
- Symbols in the DWARF debugging sections are relative to the beginning
- of the section so we begin them at 0.  */
-  /* DWARF 1 */
-  .debug  0 : { *(.debug) }
-  .line   0 : { *(.line) }
-  /* GNU DWARF 1 extensions */
-  .debug_srcinfo  0 : { *(.debug_srcinfo) }
-  .debug_sfnames  0 : { *(.debug_sfnames) }
-  /* DWARF 1.1 and DWARF 2 */
-  .debug_aranges  0 : { *(.debug_aranges) }
-  .debug_pubnames 0 : { *(.debug_pubnames) }
-  /* DWARF 2 */
-  .debug_info 0 : { *(.debug_info .gnu.linkonce.wi.*) }
-  .debug_abbrev   0 : { *(.debug_abbrev) }
-  .debug_line 0 : { *(.debug_line) }
-  .debug_frame0 : { *(.debug_frame) }
-  .debug_str  0 : { *(.debug_str) }
-  .debug_loc  0 : { *(.debug_loc) }
-  .debug_macinfo  0 : { *(.debug_macinfo) }
-  /* SGI/MIPS DWARF 2 extensions */
-  .debug_weaknames 0 : { *(.debug_weaknames) }
-  .debug_funcnames 0 : { *(.debug_funcnames) }
-  .debug_typenames 0 : { *(.debug_typenames) }
-  .debug_varnames  0 : { *(.debug_varnames) }
-
-  /DISCARD/ : { *(.note.GNU-stack) }
-  /DISCARD/ : { *(.branch_lt) }
-  /DISCARD/ : { *(.data .data.* .gnu.linkonce.d.*) }
-  /DISCARD/ : { *(.bss .sbss .dynbss .dynsbss) }
+   . = VDSO64_LBASE + SIZEOF_HEADERS;
+
+   .hash   : { *(.hash) }  :text
+   .gnu.hash   : { *(.gnu.hash) }
+   .dynsym : { *(.dynsym) }
+   .dynstr : { *(.dynstr) }
+   .gnu.version: { *(.gnu.version) }
+   .gnu.version_d  : { *(.gnu.version_d) }
+   .gnu.version_r  : { *(.gnu.version_r) }
+
+   .note   : { *(.note.*) }:text   :note
+
+   . = ALIGN(16);
+   .text   : {
+   *(.text .stub .text.* .gnu.linkonce.t.*)
+   *(.sfpr .glink)
+   }   :text
+   PROVIDE(__etext = .);
+   PROVIDE(_etext = .);
+   PROVIDE(etext = .);
+
+   . = ALIGN(8);
+   __ftr_fixup : { *(__ftr_fixup) }
+
+   . = ALIGN(8);
+   __fw_ftr_fixup  : { *(__fw_ftr_fixup) }
+
+   /*
+* Other stuff is appended to the text segment:
+*/
+   .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
+   .rodata1: { *(.rodata1) }
+
+   .eh_frame_hdr   : { *(.eh_frame_hdr) }  :text   :eh_frame_hdr
+   .eh_frame   : { KEEP (*(.eh_frame)) }   :text
+   .gcc_except_table : { *(.gcc_except_table) }
+
+   .opd ALIGN(8)   : { KEEP (*(.opd)) }
+   .got ALIGN(8)   : 

[PATCH] powerpc32 vDSO: linker script indentation

2007-10-15 Thread Roland McGrath

This cleans up the formatting in the vDSO linker script, mostly just the
use of whitespace.  It's intended to approximate the kernel standard
conventions for indenting C, treating elements of the linker script about
like initialized variable definitions.

Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
CC: Sam Ravnborg <[EMAIL PROTECTED]>
---
 arch/powerpc/kernel/vdso32/vdso32.lds.S |  219 +--
 1 files changed, 118 insertions(+), 101 deletions(-)

diff --git a/arch/powerpc/kernel/vdso32/vdso32.lds.S 
b/arch/powerpc/kernel/vdso32/vdso32.lds.S
index 26e138c..9352ab5 100644
--- a/arch/powerpc/kernel/vdso32/vdso32.lds.S
+++ b/arch/powerpc/kernel/vdso32/vdso32.lds.S
@@ -1,130 +1,147 @@
-
 /*
  * This is the infamous ld script for the 32 bits vdso
  * library
  */
 #include 
 
-/* Default link addresses for the vDSOs */
 OUTPUT_FORMAT("elf32-powerpc", "elf32-powerpc", "elf32-powerpc")
 OUTPUT_ARCH(powerpc:common)
 ENTRY(_start)
 
 SECTIONS
 {
-  . = VDSO32_LBASE + SIZEOF_HEADERS;
-  .hash   : { *(.hash) }   :text
-  .gnu.hash   : { *(.gnu.hash) }
-  .dynsym : { *(.dynsym) }
-  .dynstr : { *(.dynstr) }
-  .gnu.version: { *(.gnu.version) }
-  .gnu.version_d  : { *(.gnu.version_d) }
-  .gnu.version_r  : { *(.gnu.version_r) }
-
-  .note  : { *(.note.*) }  :text   :note
-
-  . = ALIGN (16);
-  .text :
-  {
-*(.text .stub .text.* .gnu.linkonce.t.*)
-  }
-  PROVIDE (__etext = .);
-  PROVIDE (_etext = .);
-  PROVIDE (etext = .);
-
-  . = ALIGN(8);
-  __ftr_fixup : {
-*(__ftr_fixup)
-  }
+   . = VDSO32_LBASE + SIZEOF_HEADERS;
+
+   .hash   : { *(.hash) }  :text
+   .gnu.hash   : { *(.gnu.hash) }
+   .dynsym : { *(.dynsym) }
+   .dynstr : { *(.dynstr) }
+   .gnu.version: { *(.gnu.version) }
+   .gnu.version_d  : { *(.gnu.version_d) }
+   .gnu.version_r  : { *(.gnu.version_r) }
+
+   .note   : { *(.note.*) }:text   :note
+
+   . = ALIGN(16);
+   .text   : {
+   *(.text .stub .text.* .gnu.linkonce.t.*)
+   }
+   PROVIDE(__etext = .);
+   PROVIDE(_etext = .);
+   PROVIDE(etext = .);
+
+   . = ALIGN(8);
+   __ftr_fixup : { *(__ftr_fixup) }
 
 #ifdef CONFIG_PPC64
-  . = ALIGN(8);
-  __fw_ftr_fixup : {
-*(__fw_ftr_fixup)
-  }
+   . = ALIGN(8);
+   __fw_ftr_fixup  : { *(__fw_ftr_fixup) }
 #endif
 
-  /* Other stuff is appended to the text segment: */
-  .rodata  : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
-  .rodata1 : { *(.rodata1) }
-
-  .eh_frame_hdr: { *(.eh_frame_hdr) }  :text   
:eh_frame_hdr
-  .eh_frame: { KEEP (*(.eh_frame)) }   :text
-  .gcc_except_table: { *(.gcc_except_table) }
-  .fixup   : { *(.fixup) }
-
-  .dynamic : { *(.dynamic) }   :text   :dynamic
-  .got : { *(.got) }
-  .plt : { *(.plt) }
-
-  _end = .;
-  __end = .;
-  PROVIDE (end = .);
-
-
-  /* Stabs debugging sections are here too
-   */
-  .stab 0 : { *(.stab) }
-  .stabstr 0 : { *(.stabstr) }
-  .stab.excl 0 : { *(.stab.excl) }
-  .stab.exclstr 0 : { *(.stab.exclstr) }
-  .stab.index 0 : { *(.stab.index) }
-  .stab.indexstr 0 : { *(.stab.indexstr) }
-  .comment 0 : { *(.comment) }
-  .debug 0 : { *(.debug) }
-  .line 0 : { *(.line) }
-
-  .debug_srcinfo 0 : { *(.debug_srcinfo) }
-  .debug_sfnames 0 : { *(.debug_sfnames) }
-
-  .debug_aranges 0 : { *(.debug_aranges) }
-  .debug_pubnames 0 : { *(.debug_pubnames) }
-
-  .debug_info 0 : { *(.debug_info .gnu.linkonce.wi.*) }
-  .debug_abbrev 0 : { *(.debug_abbrev) }
-  .debug_line 0 : { *(.debug_line) }
-  .debug_frame 0 : { *(.debug_frame) }
-  .debug_str 0 : { *(.debug_str) }
-  .debug_loc 0 : { *(.debug_loc) }
-  .debug_macinfo 0 : { *(.debug_macinfo) }
-
-  .debug_weaknames 0 : { *(.debug_weaknames) }
-  .debug_funcnames 0 : { *(.debug_funcnames) }
-  .debug_typenames 0 : { *(.debug_typenames) }
-  .debug_varnames 0 : { *(.debug_varnames) }
-
-  /DISCARD/ : { *(.note.GNU-stack) }
-  /DISCARD/ : { *(.data .data.* .gnu.linkonce.d.* .sdata*) }
-  /DISCARD/ : { *(.bss .sbss .dynbss .dynsbss) }
+   /*
+* Other stuff is appended to the text segment:
+*/
+   .rodata : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
+   .rodata1: { *(.rodata1) }
+
+   .eh_frame_hdr   : { *(.eh_frame_hdr) }  :text   :eh_frame_hdr
+   .eh_frame   : { KEEP (*(.eh_frame)) }   :text
+   .gcc_except_table : { *(.gcc_except_table) }
+   .fixup  : { *(.fixup) }
+
+   .dynamic: { *(.dynamic) }   :text   :dynamic
+   .got: { *(.got) }
+   .plt: { *(.plt) }
+
+   _end = .;
+   __end = .;
+   PROVIDE(end = .);
+
+   /*
+* Stabs debugging sections are here too.
+*/
+   

[PATCH] SH vDSO: linker script indentation

2007-10-15 Thread Roland McGrath

This cleans up the formatting in the vDSO linker script, mostly just the
use of whitespace.  It's intended to approximate the kernel standard
conventions for indenting C, treating elements of the linker script about
like initialized variable definitions.

Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
CC: Sam Ravnborg <[EMAIL PROTECTED]>
---
 arch/sh/kernel/vsyscall/vsyscall.lds.S |   77 +--
 1 files changed, 42 insertions(+), 35 deletions(-)

diff --git a/arch/sh/kernel/vsyscall/vsyscall.lds.S 
b/arch/sh/kernel/vsyscall/vsyscall.lds.S
index b13c3d4..c9bf2af 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.lds.S
+++ b/arch/sh/kernel/vsyscall/vsyscall.lds.S
@@ -17,45 +17,52 @@ ENTRY(__kernel_vsyscall);
 
 SECTIONS
 {
-  . = SIZEOF_HEADERS;
+   . = SIZEOF_HEADERS;
 
-  .hash   : { *(.hash) }   :text
-  .gnu.hash   : { *(.gnu.hash) }
-  .dynsym : { *(.dynsym) }
-  .dynstr : { *(.dynstr) }
-  .gnu.version: { *(.gnu.version) }
-  .gnu.version_d  : { *(.gnu.version_d) }
-  .gnu.version_r  : { *(.gnu.version_r) }
+   .hash   : { *(.hash) }  :text
+   .gnu.hash   : { *(.gnu.hash) }
+   .dynsym : { *(.dynsym) }
+   .dynstr : { *(.dynstr) }
+   .gnu.version: { *(.gnu.version) }
+   .gnu.version_d  : { *(.gnu.version_d) }
+   .gnu.version_r  : { *(.gnu.version_r) }
 
-  /* This linker script is used both with -r and with -shared.
- For the layouts to match, we need to skip more than enough
- space for the dynamic symbol table et al.  If this amount
- is insufficient, ld -shared will barf.  Just increase it here.  */
-  . = 0x400;
+   /*
+* This linker script is used both with -r and with -shared.
+* For the layouts to match, we need to skip more than enough
+* space for the dynamic symbol table et al.  If this amount
+* is insufficient, ld -shared will barf.  Just increase it here.
+*/
+   . = 0x400;
 
-  .text   : { *(.text) }   :text =0x90909090
-  .note  : { *(.note.*) }  :text :note
-  .eh_frame_hdr   : { *(.eh_frame_hdr) }   :text :eh_frame_hdr
-  .eh_frame   : { KEEP (*(.eh_frame)) }:text
-  .dynamic: { *(.dynamic) }:text :dynamic
-  .useless: {
-   *(.got.plt) *(.got)
-   *(.data .data.* .gnu.linkonce.d.*)
-   *(.dynbss)
-   *(.bss .bss.* .gnu.linkonce.b.*)
-  }:text
+   .text   : { *(.text) }  :text   =0x90909090
+   .note   : { *(.note.*) }:text   :note
+   .eh_frame_hdr   : { *(.eh_frame_hdr ) } :text   :eh_frame_hdr
+   .eh_frame   : { KEEP (*(.eh_frame)) }   :text
+   .dynamic: { *(.dynamic) }   :text   :dynamic
+   .useless: {
+ *(.got.plt) *(.got)
+ *(.data .data.* .gnu.linkonce.d.*)
+ *(.dynbss)
+ *(.bss .bss.* .gnu.linkonce.b.*)
+   }   :text
 }
 
 /*
+ * Very old versions of ld do not recognize this name token; use the constant.
+ */
+#define PT_GNU_EH_FRAME0x6474e550
+
+/*
  * We must supply the ELF program headers explicitly to get just one
  * PT_LOAD segment, and set the flags explicitly to make segments read-only.
  */
 PHDRS
 {
-  text PT_LOAD FILEHDR PHDRS FLAGS(5); /* PF_R|PF_X */
-  dynamic PT_DYNAMIC FLAGS(4); /* PF_R */
-  note PT_NOTE FLAGS(4); /* PF_R */
-  eh_frame_hdr 0x6474e550; /* PT_GNU_EH_FRAME, but ld doesn't match the name */
+   textPT_LOAD FILEHDR PHDRS FLAGS(5); /* PF_R|PF_X */
+   dynamic PT_DYNAMIC FLAGS(4);/* PF_R */
+   notePT_NOTE FLAGS(4);   /* PF_R */
+   eh_frame_hdrPT_GNU_EH_FRAME;
 }
 
 /*
@@ -63,12 +70,12 @@ PHDRS
  */
 VERSION
 {
-  LINUX_2.6 {
-global:
-   __kernel_vsyscall;
-   __kernel_sigreturn;
-   __kernel_rt_sigreturn;
+   LINUX_2.6 {
+   global:
+   __kernel_vsyscall;
+   __kernel_sigreturn;
+   __kernel_rt_sigreturn;
 
-local: *;
-  };
+   local: *;
+   };
 }
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Killing a network connection

2007-10-15 Thread Stefan Monnier
> There is a /proc/sys/net/ipv4/ip_dynaddr sysctl in 2.6.21.

Actually, it does look promising, thanks.


Stefan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Killing a network connection

2007-10-15 Thread Stefan Monnier
>> The main use for me is to deal with dangling connections due to taking
>> network interfaces up with different IP addresses (typically the wlan0
>> interface where the IP is different because I've modes from an AP to
>> another).  Of course, maybe there's another way to solve this particular
>> problem, in case I'd like to hear about it as well.

> Long ago I did a 2.4 patch that solved exactly this problem. It introduced
> a new ifconfig flag "dynamic" and when a dynamic address went down
> all TCP connections originating from it were killed. It's still available
> in older SUSE releases. I might post a forward port later.

Actually, I'm pretty happy sometimes with the current behavior: if the
interface goes down and back up with the same AP within a short enough time,
it typically gets the same IP and the router's NAT table still has the TCP
connection live and things "just work".

So I'd want to kill the connections not when the interface goes down, but in
comes back up with a different IP.


Stefan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ia64 vDSO: linker script indentation

2007-10-15 Thread Roland McGrath

This cleans up the formatting in the vDSO linker script, mostly just the
use of whitespace.  It's intended to approximate the kernel standard
conventions for indenting C, treating elements of the linker script about
like initialized variable definitions.

Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
CC: Sam Ravnborg <[EMAIL PROTECTED]>
---
 arch/ia64/kernel/gate.lds.S |  135 +++
 1 files changed, 72 insertions(+), 63 deletions(-)

diff --git a/arch/ia64/kernel/gate.lds.S b/arch/ia64/kernel/gate.lds.S
index 6d19833..44817d9 100644
--- a/arch/ia64/kernel/gate.lds.S
+++ b/arch/ia64/kernel/gate.lds.S
@@ -1,7 +1,8 @@
 /*
- * Linker script for gate DSO.  The gate pages are an ELF shared object 
prelinked to its
- * virtual address, with only one read-only segment and one execute-only 
segment (both fit
- * in one page).  This script controls its layout.
+ * Linker script for gate DSO.  The gate pages are an ELF shared object
+ * prelinked to its virtual address, with only one read-only segment and
+ * one execute-only segment (both fit in one page).  This script controls
+ * its layout.
  */
 
 
@@ -9,72 +10,80 @@
 
 SECTIONS
 {
-  . = GATE_ADDR + SIZEOF_HEADERS;
-
-  .hash: { *(.hash) }  
:readable
-  .gnu.hash: { *(.gnu.hash) }
-  .dynsym  : { *(.dynsym) }
-  .dynstr  : { *(.dynstr) }
-  .gnu.version : { *(.gnu.version) }
-  .gnu.version_d   : { *(.gnu.version_d) }
-  .gnu.version_r   : { *(.gnu.version_r) }
-  .dynamic : { *(.dynamic) }   
:readable :dynamic
-
-  /*
-   * This linker script is used both with -r and with -shared.  For the 
layouts to match,
-   * we need to skip more than enough space for the dynamic symbol table et 
al.  If this
-   * amount is insufficient, ld -shared will barf.  Just increase it here.
-   */
-  . = GATE_ADDR + 0x500;
-
-  .data.patch  : {
-   __start_gate_mckinley_e9_patchlist = .;
-   *(.data.patch.mckinley_e9)
-   __end_gate_mckinley_e9_patchlist = .;
-
-   __start_gate_vtop_patchlist = .;
-   *(.data.patch.vtop)
-   __end_gate_vtop_patchlist = .;
-
-   __start_gate_fsyscall_patchlist = .;
-   *(.data.patch.fsyscall_table)
-   __end_gate_fsyscall_patchlist = .;
-
-   __start_gate_brl_fsys_bubble_down_patchlist 
= .;
-   *(.data.patch.brl_fsys_bubble_down)
-   __end_gate_brl_fsys_bubble_down_patchlist = 
.;
-  }
:readable
-  .IA_64.unwind_info   : { *(.IA_64.unwind_info*) }
-  .IA_64.unwind: { *(.IA_64.unwind*) } 
:readable :unwind
+   . = GATE_ADDR + SIZEOF_HEADERS;
+
+   .hash   : { *(.hash) }  :readable
+   .gnu.hash   : { *(.gnu.hash) }
+   .dynsym : { *(.dynsym) }
+   .dynstr : { *(.dynstr) }
+   .gnu.version: { *(.gnu.version) }
+   .gnu.version_d  : { *(.gnu.version_d) }
+   .gnu.version_r  : { *(.gnu.version_r) }
+
+   .dynamic: { *(.dynamic) }   :readable   :dynamic
+
+   /*
+* This linker script is used both with -r and with -shared.  For
+* the layouts to match, we need to skip more than enough space for
+* the dynamic symbol table et al.  If this amount is insufficient,
+* ld -shared will barf.  Just increase it here.
+*/
+   . = GATE_ADDR + 0x500;
+
+   .data.patch : {
+   __start_gate_mckinley_e9_patchlist = .;
+   *(.data.patch.mckinley_e9)
+   __end_gate_mckinley_e9_patchlist = .;
+
+   __start_gate_vtop_patchlist = .;
+   *(.data.patch.vtop)
+   __end_gate_vtop_patchlist = .;
+
+   __start_gate_fsyscall_patchlist = .;
+   *(.data.patch.fsyscall_table)
+   __end_gate_fsyscall_patchlist = .;
+
+   __start_gate_brl_fsys_bubble_down_patchlist = .;
+   *(.data.patch.brl_fsys_bubble_down)
+   __end_gate_brl_fsys_bubble_down_patchlist = .;
+   }   :readable
+
+   .IA_64.unwind_info  : { *(.IA_64.unwind_info*) }
+   .IA_64.unwind   : { *(.IA_64.unwind*) } :readable   :unwind
 #ifdef HAVE_BUGGY_SEGREL
-  .text (GATE_ADDR + PAGE_SIZE): { *(.text) *(.text.*) }   
:readable
+  

Re: [PATCH] Map volume and brightness events on thinkpads

2007-10-15 Thread Jesse Barnes
On Monday, October 15, 2007 2:07 pm Henrique de Moraes Holschuh wrote:
> We should fix the backlight class to be more useful and support
> poll() or somesuch, for userspace to track the backlight level in a
> resource-friendly way for OSD (the only sane thing to do on an IBM
> thinkpad with such events). And an ALSA mixer to provide a proper
> path to the thinkpad-acpi volume functionality is also in my schedule
> for 2.6.25.
>
> As for Lenovo thinkpads, brightness control is to be processed by the
> ACPI video module, so brightness hot keys are not to be reported by
> default there either.  I am not so sure about the volume keys, but
> your patch touches the IBM keymap *and* you provide no testing
> information for the various Lenovo models, so I have to NAK it as
> well until more information is available.

No, on Lenovo (and in general actually) the firmware should *not* touch 
the backlight.  Otherwise if another driver touches it the driver and 
firmware will be out of sync, causing unexpected and undesirable 
behavior.  We intend to fix this for the Intel driver at least 
(requiring both ACPI video driver and gfx driver updates), others will 
probably follow eventually.

Jesse
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Eric W. Biederman
Nick Piggin <[EMAIL PROTECTED]> writes:

> On Monday 15 October 2007 19:16, Andrew Morton wrote:
>> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[EMAIL PROTECTED]> 
> wrote:
>> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
>> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
>> > > unmaintained, so decided to sent the patch to you :-).
>> > > I have CCed Ted, who did work on the code in the 90s. I found no
>> > > current email address of Chad Page.
>> >
>> > This really needs to be fixed...
>>
>> rd.c is fairly mind-boggling vfs abuse.
>
> Why do you say that? I guess it is _different_, by necessity(?)
> Is there anything that is really bad?

make_page_uptodate() is most hideous part I have run into.
It has to know details about other layers to now what not
to stomp.  I think my incorrect simplification of this is what messed
things up, last round.

> I guess it's not nice
> for operating on the pagecache from its request_fn, but the
> alternative is to duplicate pages for backing store and buffer
> cache (actually that might not be a bad alternative really).

Cool. Triple buffering :)  Although I guess that would only
apply to metadata these days.   Having a separate store would
solve some of the problems, and probably remove the need
for carefully specifying the ramdisk block size.  We would
still need the magic restictions on page allocations though
and it we would use them more often as the initial write to the
ramdisk would not populate the pages we need.

A very ugly bit seems to be the fact that we assume we can
dereference bh->b_data without any special magic which
means the ramdisk must live in low memory on 32bit machines.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread david

On Mon, 15 Oct 2007, Stefan Richter wrote:


Subject: Re: What still uses the block layer?

Matthew Wilcox wrote:

On Mon, Oct 15, 2007 at 04:26:04AM -0500, Rob Landley wrote:

Combining USB and IDE into the same /dev/sd? namespace makes enumerating the
IDE devices much harder than in the traditional "/dev/hdb doesn't move
without a screwdriver" model.  The merger creates a new problem for IDE, one
which didn't exist before: the addition or removal of other unrelated types
of devices may change this device's location next boot.  It may be possible
to add additional complication to the system to compensate, but what was the
advantage of merging the namespaces in the first place?


It's not something anyone particularly set out to do, it's just how
it worked out.  It was justified by saying "ok, this goes from a 99%
solution to a 96% solution, but there's 100% solution called uuids".
I don't particularly agree with this line of argumentation, but it did
hold sway.


Low-level networking drivers suggest a default interface name (per
interface or as a template like eth%d into which the networking core
inserts a lowest spare number).  Userspace can rename interfaces, but
nevertheless it's nice to have different default kernel names for
ethernet, wlan etc..

Could low-level SCSI drivers provide similar name templates which give a
hint on the transport involved?  It's a bit more difficult as with
networking interfaces though because
 - SCSI devices can have sd, sr, st, osst, ch, sg interfaces,
 - SCSI device files share a namespace with all other device files.

E.g.
/dev/sd-ide-b   - second IDE HDD,
/dev/sd-iscsi-e - fifth iSCSI direct access device,
/dev/sr-sata-0  - first SATA CD-ROM,
/dev/sr-usb-0   - a USB CD-ROM,
/dev/st-fw-0- a FireWire tape drive,
/dev/sda- a device whose transport driver didn't propose a name

Of course the really interesting names will still be provided by
udev-generated symlinks.


this is a nice option, and since most of the existing userspace code is 
looking for /dev/sd*, /dev/sr*, etc this should be able to work for new 
installs with no userspace changes. Since it would break existing installs 
it would need to be optional.


one other option that could be considered (and I do realize I'm bringing 
up flame-bait here) is that drivers that have fixed addresses could offer 
up a device name that include that address.
i.e. depending on the config option a device could show up as either sda, 
sd-scsi-a, sd-scsi-0:0:0:0, or even sd-scsi-


if the driver or bus doesn't have a real numbering, it wouldn't invent a 
fake one (which is a big problem with most of the prior suggestions that 
have tried to offer a numbering option), it would just offer the most 
specific information it has.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread david

On Mon, 15 Oct 2007, Greg KH wrote:


On Mon, Oct 15, 2007 at 05:08:36AM -0500, Rob Landley wrote:

On Monday 15 October 2007 4:06:20 am Julian Calaby wrote:

On 10/15/07, Rob Landley <[EMAIL PROTECTED]> wrote:

I note that the eth0 and eth1 names are dynamically assigned on a first
come first serve basis (like scsi).  This never causes me a problem
because the driver loading order is constant, and once you figure out
that eth0 is gigabit and eth1 is the 80211g it _stays_ that way across
reboots, reliably. Yeah, it's a heuristic.  Hands up everybody relying on
such a heuristic in the real world.


Umm, not quite, from my experiences with pre-production wireless
drivers, (another story, another time) fancy stuff is being done in
udev to make sure that your gigabit card is always assigned to eth0.


I remember building a 2.4 kernel, statically linking in all the drivers, and
getting the ethernet devices showing up in a reliable order for years.  Where
does the need for fancy stuff come in?


Because PCI devices reorder their bus numbers all the time.  And we have
ethernet devices hanging off of USB connections now (yes, even built-in
to the machine), and we have network connections on other hot-pluggable
busses (remember, PCI is hot pluggable.)


do PCI devices reorder their bus numbers spontaniously, or only if you 
change the hardware?



So, the distros need to name network devices in a persistant way, that
is why the distros now do this.  If you don't like the distro doing it,
complain to them, it's not a kernel issue :)


I have, at least the response was to tell me how to kill this 'feature' 
even if they won't change it.


David Lang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread david

On Mon, 15 Oct 2007, Theodore Tso wrote:


On Mon, Oct 15, 2007 at 03:04:00AM -0500, Rob Landley wrote:


just
as Ethernet and PPP interfaces really are fundamentally the same
thing.


They're the same thing?

Do you mean that on a system with both, going:
  ifconfig eth1 66.92.53.140
  ifconfig ppp 192.168.0.42

Would be functionally equivalent to:
  ifconfig eth1 192.168.0.42
  ifconfig ppp 66.92.53.140


No, of course not.  But we don't have separate IP stacks for ethernet
and ppp devices.  And how we connect to a host via ssh makes no
difference whether we accessed it via Ethernet or PPP.  And I would
argue that how we address a filesystem should also make no difference
depending on the path to hard drive.


I think a close analogy would be that after a partition is mounted you 
don't need to know the path to the hard drive, and that is already true 
today. when you mount a drive (or assign and IP address to a network 
interface) the path to the device not only matters, it's critical.



By the way, ethernet cards contain a unique MAC address.  Hard
drives do not seem to, or if they do it's not being consistently
exposed in a way I can find.


You can pull a Model and Serial number via hdparm -i, but it's not as
easy to manipulate as a fixed-length MAC address.  That's why people
tend to use filesystem UUID's.


More to the point, with SATA, hot plugging has been designed in, so
probing order is not going to be well defined,


The spec may define the capability to hotplug, but your average
laptop doesn't not offer the capability to hotplug anything into its
SATA controllers.  The hard drive is screwed in (due to the
portability part of laptopness), all the controllers wired onto the
motherboard are accounted for, none are exposed externally.  What
_is_ exposed externally is USB, and if you want to add an extra hard
drive you can buy a cheap USB one at Fry's.


That may be true for laptops today, but Linux doesn't run just on
servers.  You can easily get home servers with hot-swap SATA bays.  My
home fileserver, which is a white box I purchased on my own nickel,
NOT IBM big iron, has 3TB of raw storage for less than $10,000 a year
ago.  Today, that amount of home storage with hot-swap SATA drives and
a battery-backed hardware RAID controller could probably be purchased
for about half that price.


I also have a 3TB raid I built at home, it uses 3ware cards and a dozen 
300G IDE drives. since the 3ware driver is classified as SCSI if a drive 
fails all the other drives get renumbered on the next boot and it's 
painful to figure out which drive has a problem. I have to reboot and go 
into the 3ware BIOS to figure out which drive isn't reporting. This system 
also has an adaptec raid card in it and an adaptec regular SCSI card. The 
fact that these three cards take different drivers, and so the order of 
detection changes the drive numbering is a real pain when I'm installing a 
new distro onto it. once I get it installed I compile my own monolithic 
kernel and this problem stops becouse the kernel linking order determins 
the detection order.


this replaced a 1.2TB raid that I just about filled up, and then stared 
having drive failures due to age on. It used 8 160G IDE drives, and when I 
had problems with a drive it was easy to see that /dev/hdk was missing 
from the set, and I was still able to have a removable drive bay for 
/dev/hdc that I could hook my tivo drive into (on a reboot for safety) and 
not have things go haywire if I left the bay empty (or switched off) when 
I booted.


this may not be hundreds of drives, but it should be enough to show that I 
have experianced the pain that some people claim is the reason all of this 
must be dynamic with a userspace helper to sort it all out. My take is 
that adding the userspace helper and not enumerating things that are easy 
to enumerate is making things worse, not better.



And even for laptops, if you need the performance, you can get Cardbus
cards that will allow you to connect eSATA drives to your laptop at
Fry's.

So even if you ignore "big data center" interconnects like FC, this
problem exists even for commodity grade SATA devices.


but these are seperate SATA buses, while you could run into ordering 
issues if you hook multiple devices to one bus, you should be able to have 
no ordering issues if you don't have more then one device of a type on any 
one bus (you could have a SATA hard drive on the internal PCI controller, 
and another one of the Cardbus controller, but if you always order 
directly connected devices before cardbus connected devices they will 
always show up in the same order)



It's necessary for IBM big iron to do this.  It's generally not
necessary for laptops or embedded systems to do this if they
distinguish between _types_ of devices, which is something they
until recently did for the types of devices I was interested in, and
something they _stopped_ doing when everything got merged into the
scsi layer, and I 

Re: More Large blocksize benchmarks

2007-10-15 Thread David Chinner
On Mon, Oct 15, 2007 at 08:22:31PM -0400, Chris Mason wrote:
> Hello everyone,
> 
> I'm stealing the cc list and reviving and old thread because I've
> finally got some numbers to go along with the Btrfs variable blocksize
> feature.  The basic idea is to create a read/write interface to
> map a range of bytes on the address space, and use it in Btrfs for all
> metadata operations (file operations have always been extent based).
> 
> So, instead of casting buffer_head->b_data to some structure, I read and
> write at offsets in a struct extent_buffer.  The extent buffer is very
> small and backed by an address space, and I get large block sizes the
> same way file_write gets to write to 16k at a time, by finding the
> appropriate page in the addess space.  This is an over simplification
> since I try to cache these mapping decisions to avoid using too much
> CPU, but hopefully you get the idea.
> 
> The advantage to this approach is the changes are all inside Btrfs.  No
> extra kernel patches were required.
> 
> Dave reported that XFS saw much higher write throughput with large
> blocksizes, but so far I'm seeing the most benefits during reads.

Apples to oranges, Chris ;)

btrfs linearises writes due to it's COW behaviour and this is trades
off read speed. i.e. we take more seeks to read data so we can keep
the write speed high. By using large blocks, you're reducing the
number of seeks needed to find anything, and hence the read speed
will increase. Write speed will be pretty much unchanged because
btrfs does linear writes no matter the block size.

XFS doesn't linearise writes and optimises it's layout for a large
number of disks and a low number of seeks on reads - the opposite
of btrfs. Hence large block sizes reduce the number of writes XFS
needs to write a given set of data+metadata and hence write speed
increases much more than the read speed (until you get to large tree
traversals).

The basic conclusion is that different filesystems will benefit in
different ways with large block sizes

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] cpuset update_cgroup_cpus_allowed

2007-10-15 Thread Paul Jackson
> currently against an older kernel 

ah .. which older kernel?

I tried it against the broken out 2.6.23-rc8-mm2 patch set,
inserting it before the task-containersv11-* patches, but
that blew up on me - three rejected hunks.

Any chance of getting this against a current cgroup (aka
container) kernel?

Could you use the diff --show-c-function option when composing
patches - they're easier to read that way - thanks.

+   if (!retval) {
+   cpus_allowed = cpuset_cpus_allowed(p);
+   if (!cpus_subset(new_mask, cpus_allowed)) {
+   /*
+* We must have raced with a concurrent cpuset
+* update. Just reset the cpus_allowed to the
+* cpuset's cpus_allowed
+*/
+   new_mask = cpus_allowed;

This narrows the race, perhaps sufficiently, but I don't see that it
guarantees closure.  Memory accesses to two different locations are not
guaranteed to be ordered across nodes, as best I recall.  The second
line above, that rereads the cpuset cpus_allowed, could get an old
value, in essence.

cpuset update task  sched_setaffinity task
--  --

A. write cpuset [Q] V. read cpuset [Q]
B. read task [P]W. check ok
C. write task [P]   X. write task [P]
Y. reread cpuset [Q]
Z. check ok again

Two memory locations:
[P] the cpus_allowed mask in the task_struct of the
task doing the sched_setaffinity call.
[Q] the cpus_allowed mask in the cpuset of the cpuset
to which the sched_setaffinity task is attached.

Even though, from the perspective of location [P], both B. and C.
happened before X., still from the perspective of location [Q] the
rereading in Y. could return the value the cpuset cpus_allowed had
before the write in A.  This could result in a task running with
a cpus_allowed that was totally outside its cpusets cpus_allowed.

I will grant that this is a narrow window.  I won't loose much sleep
over it.

>  - uses a priority heap to pick the processes to act on, based on start time

This adds a fair bit of code and complexity, relative to my patch.
This I do loose more sleep over.  There has to be a compelling
reason for doing this.

The point that David raises, regarding the interaction of this with
hotplug, seems to be a compelling reason for doing -something-
different than my patch proposal.

I don't know yet if it compels us to this much code, however.

Any chance you could provide a patch that works against cgroups?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] cpuset update_cgroup_cpus_allowed

2007-10-15 Thread Paul Jackson
> Yet by not doing any locking here to prevent a cpu from being 
> hot-unplugged, you can race and allow the hot-unplug event to happen 
> before calling set_cpus_allowed().  That makes this entire function a 
> no-op with set_cpus_allowed() returning -EINVAL for every call, which 
> isn't caught, and no error is reported to userspace.

Good point ... hmmm ...

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] [PATCH 2/2] capabilities: implement 64-bit capabilities

2007-10-15 Thread Serge E. Hallyn
>From 7dd503c612afcb86b3165602ab264e2e9493b4bf Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <[EMAIL PROTECTED]>
Date: Mon, 15 Oct 2007 20:57:52 -0400
Subject: [RFC] [PATCH 2/2] capabilities: implement 64-bit capabilities

We are out of capabilities in the 32-bit capability fields, and
several users could make use of additional capabilities.
Convert the capabilities to 64-bits, change the capability
version number accordingly, and convert the file capability
code to handle both 32-bit and 64-bit file capability xattrs.

It might seem desirable to also implement back-compatibility
to read 32-bit caps from userspace, but that becomes
problematic with capget, as capget could return valid info
for processes not using high bits, but would have to return
-EINVAL for those which were.

So with this patch, libcap would need to be updated to make
use of capset/capget.

Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>
---
 fs/proc/array.c|6 +++---
 include/linux/capability.h |   29 -
 security/commoncap.c   |   37 +
 3 files changed, 52 insertions(+), 20 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index 3f4d824..c8ea46d 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -288,9 +288,9 @@ static inline char *task_sig(struct task_struct *p, char 
*buffer)
 
 static inline char *task_cap(struct task_struct *p, char *buffer)
 {
-return buffer + sprintf(buffer, "CapInh:\t%016x\n"
-   "CapPrm:\t%016x\n"
-   "CapEff:\t%016x\n",
+return buffer + sprintf(buffer, "CapInh:\t%016lx\n"
+   "CapPrm:\t%016lx\n"
+   "CapEff:\t%016lx\n",
cap_t(p->cap_inheritable),
cap_t(p->cap_permitted),
cap_t(p->cap_effective));
diff --git a/include/linux/capability.h b/include/linux/capability.h
index bb017ed..a3da4b9 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -29,7 +29,7 @@ struct task_struct;
library since the draft standard requires the use of malloc/free
etc.. */
 
-#define _LINUX_CAPABILITY_VERSION  0x19980330
+#define _LINUX_CAPABILITY_VERSION  0x20071015
 
 typedef struct __user_cap_header_struct {
__u32 version;
@@ -37,29 +37,40 @@ typedef struct __user_cap_header_struct {
 } __user *cap_user_header_t;
 
 typedef struct __user_cap_data_struct {
-__u32 effective;
-__u32 permitted;
-__u32 inheritable;
+__u64 effective;
+__u64 permitted;
+__u64 inheritable;
 } __user *cap_user_data_t;
 
 #define XATTR_CAPS_SUFFIX "capability"
 #define XATTR_NAME_CAPS XATTR_SECURITY_PREFIX XATTR_CAPS_SUFFIX
 
-#define XATTR_CAPS_SZ (3*sizeof(__le32))
+#define XATTR_CAPS_SZ_1 (3*sizeof(__le32))
+#define XATTR_CAPS_SZ_2 (2*sizeof(__le64) + sizeof(__le32))
 #define VFS_CAP_REVISION_MASK  0xFF00
 #define VFS_CAP_REVISION_1 0x0100
+#define VFS_CAP_REVISION_2 0x0200
 
-#define VFS_CAP_REVISION   VFS_CAP_REVISION_1
+#define VFS_CAP_REVISION   VFS_CAP_REVISION_2
+#define XATTR_CAPS_SZ  XATTR_CAPS_SZ_2
 
 #define VFS_CAP_FLAGS_MASK ~VFS_CAP_REVISION_MASK
 #define VFS_CAP_FLAGS_EFFECTIVE0x01
 
-struct vfs_cap_data {
+struct vfs_cap_data_v1 {
__u32 magic_etc;  /* Little endian */
__u32 permitted;/* Little endian */
__u32 inheritable;  /* Little endian */
 };
 
+struct vfs_cap_data_v2 {
+   __u32 magic_etc;  /* Little endian */
+   __u64 permitted;/* Little endian */
+   __u64 inheritable;  /* Little endian */
+};
+
+typedef struct vfs_cap_data_v2 vfs_cap_data;
+
 #ifdef __KERNEL__
 
 /* #define STRICT_CAP_T_TYPECHECKS */
@@ -67,12 +78,12 @@ struct vfs_cap_data {
 #ifdef STRICT_CAP_T_TYPECHECKS
 
 typedef struct kernel_cap_struct {
-   __u32 cap;
+   __u64 cap;
 } kernel_cap_t;
 
 #else
 
-typedef __u32 kernel_cap_t;
+typedef __u64 kernel_cap_t;
 
 #endif
 
diff --git a/security/commoncap.c b/security/commoncap.c
index 542bbe9..2cca843 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -190,25 +190,46 @@ int cap_inode_killpriv(struct dentry *dentry)
return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
 }
 
-static inline int cap_from_disk(struct vfs_cap_data *caps,
+union vfs_cap_union {
+   struct vfs_cap_data_v1 v1;
+   struct vfs_cap_data_v2 v2;
+};
+
+static inline int cap_from_disk(union vfs_cap_union *caps,
struct linux_binprm *bprm,
int size)
 {
__u32 magic_etc;
 
-   if (size != XATTR_CAPS_SZ)
+   if (size != XATTR_CAPS_SZ_1 && size != XATTR_CAPS_SZ_2)
return -EINVAL;
 
-   magic_etc = le32_to_cpu(caps->magic_etc);
+   magic_etc = le32_to_cpu(caps->v1.magic_etc);
 
switch ((magic_etc & VFS_CAP_REVISION_MASK)) {
-   case 

[PATCH 1/2 -mm] capabilities: clean up file capability reading

2007-10-15 Thread Serge E. Hallyn
This patch is a simple cleanup which should probably be
applied to -mm (assuming I haven't messed it up).  The next
patch is an experimental patch which will require userspace
support and is just RFC at this point.

>From 9fc0782de6e1287aaeebe8ad653b008f09b22c11 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <[EMAIL PROTECTED]>
Date: Mon, 15 Oct 2007 17:33:24 -0400
Subject: [PATCH 1/2] capabilities: clean up file capability reading

Simplify the vfs_cap_data structure.

Also fix get_file_caps which was declaring
__le32 v1caps[XATTR_CAPS_SZ] on the stack, but
XATTR_CAPS_SZ is already * sizeof(__le32).

Signed-off-by: Serge E. Hallyn <[EMAIL PROTECTED]>
---
 include/linux/capability.h |6 ++
 security/commoncap.c   |   23 +++
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/capability.h b/include/linux/capability.h
index 7a8d7ad..bb017ed 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -56,10 +56,8 @@ typedef struct __user_cap_data_struct {
 
 struct vfs_cap_data {
__u32 magic_etc;  /* Little endian */
-   struct {
-   __u32 permitted;/* Little endian */
-   __u32 inheritable;  /* Little endian */
-   } data[1];
+   __u32 permitted;/* Little endian */
+   __u32 inheritable;  /* Little endian */
 };
 
 #ifdef __KERNEL__
diff --git a/security/commoncap.c b/security/commoncap.c
index 43f9027..542bbe9 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -190,7 +190,8 @@ int cap_inode_killpriv(struct dentry *dentry)
return inode->i_op->removexattr(dentry, XATTR_NAME_CAPS);
 }
 
-static inline int cap_from_disk(__le32 *caps, struct linux_binprm *bprm,
+static inline int cap_from_disk(struct vfs_cap_data *caps,
+   struct linux_binprm *bprm,
int size)
 {
__u32 magic_etc;
@@ -198,7 +199,7 @@ static inline int cap_from_disk(__le32 *caps, struct 
linux_binprm *bprm,
if (size != XATTR_CAPS_SZ)
return -EINVAL;
 
-   magic_etc = le32_to_cpu(caps[0]);
+   magic_etc = le32_to_cpu(caps->magic_etc);
 
switch ((magic_etc & VFS_CAP_REVISION_MASK)) {
case VFS_CAP_REVISION:
@@ -206,8 +207,8 @@ static inline int cap_from_disk(__le32 *caps, struct 
linux_binprm *bprm,
bprm->cap_effective = true;
else
bprm->cap_effective = false;
-   bprm->cap_permitted = to_cap_t( le32_to_cpu(caps[1]) );
-   bprm->cap_inheritable = to_cap_t( le32_to_cpu(caps[2]) );
+   bprm->cap_permitted = to_cap_t( le32_to_cpu(caps->permitted) );
+   bprm->cap_inheritable = to_cap_t( 
le32_to_cpu(caps->inheritable) );
return 0;
default:
return -EINVAL;
@@ -219,7 +220,7 @@ static int get_file_caps(struct linux_binprm *bprm)
 {
struct dentry *dentry;
int rc = 0;
-   __le32 v1caps[XATTR_CAPS_SZ];
+   struct vfs_cap_data incaps;
struct inode *inode;
 
if (bprm->file->f_vfsmnt->mnt_flags & MNT_NOSUID) {
@@ -232,8 +233,14 @@ static int get_file_caps(struct linux_binprm *bprm)
if (!inode->i_op || !inode->i_op->getxattr)
goto out;
 
-   rc = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, ,
-   XATTR_CAPS_SZ);
+   rc = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS, NULL, 0);
+   if (rc > 0) {
+   if (rc == XATTR_CAPS_SZ)
+   rc = inode->i_op->getxattr(dentry, XATTR_NAME_CAPS,
+   , XATTR_CAPS_SZ);
+   else
+   rc = -EINVAL;
+   }
if (rc == -ENODATA || rc == -EOPNOTSUPP) {
/* no data, that's ok */
rc = 0;
@@ -242,7 +249,7 @@ static int get_file_caps(struct linux_binprm *bprm)
if (rc < 0)
goto out;
 
-   rc = cap_from_disk(v1caps, bprm, rc);
+   rc = cap_from_disk(, bprm, rc);
if (rc)
printk(KERN_NOTICE "%s: cap_from_disk returned %d for %s\n",
__FUNCTION__, rc, bprm->filename);
-- 
1.5.1.1.GIT

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/11] maps3: introduce task_size_of for all arches

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Dave Hansen wrote:

> diff -puN 
> include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
>  include/asm-mips/processor.h
> --- 
> lxc/include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
>   2007-10-15 17:29:22.0 -0700
> +++ lxc-dave/include/asm-mips/processor.h 2007-10-15 17:34:12.0 
> -0700
> @@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
>   * space during mmap's.
>   */
>  #define TASK_UNMAPPED_BASE   (PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)\
> + (test_tsk_thread_flag(tak, TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #ifdef CONFIG_64BIT

tak needs to be tsk.

> @@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
>  #define TASK_UNMAPPED_BASE   \
>   (test_thread_flag(TIF_32BIT_ADDR) ? \
>   PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)\
> + (test_tsk_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #define NUM_FPU_REGS 32

test_tsk_thread_flag() takes two arguments.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] maps3: add proportional set size accounting in smaps

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Matt Mackall wrote:

> > The pss is going to need accessor functions, preferably inlined, and the 
> > comment adjusted stating that all accesses should be through those 
> > functions and not directly to the mem_size_stats struct.
> > 
> > static inline u64 pss_up(unsigned long pss)
> > {
> > return pss << PSS_DIV_BITS;
> > }
> > 
> > static inline unsigned long pss_down(u64 pss)
> > {
> > return pss >> PSS_DIV_BITS;
> > }
> 
> I think that's overkill for something that has exactly one use of each.
> 

There's no overkill at all, the current uses are already accessed with 
these bitshifts so there's no overhead when using an inlined function 
instead.

To correctly access the pss, these bitshifts are required because the 
decision was made to use the lower PSS_DIV_BITS for rounding.  Thus, you 
need to include accessor functions so that they are always accessed 
correctly now and in the future.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


nmi_watchdog on x86_64

2007-10-15 Thread Yinghai Lu
just found my on hand ck804, and mcp55 based AMD servers:
nmi_watchdog=1 doesn't work
but nmi_watchdog=2 does work

=1, it say: IOAPIC 8259A virtual wire mode...

Did nmi_watchdog=1 work on any other amd64 platform?

YH
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread david

On Tue, 16 Oct 2007, Neil Brown wrote:


On Monday October 15, [EMAIL PROTECTED] wrote:

Therefore it is best to not have stable single-number naming schemes
for any devices on any machines.  Why?  Because it ensure there will
not be any second class citizens.


This is where we disagree.  The existence of devices you cannot stably
enumerate does not eliminate the existence of devices you trivially can.


No, but it dramatically reduces that value of being able to enumerate
those devices.


this is the point of disagreement. the devices you can trivially enumerate 
can be handled easily and trivially, the ones that you can't may require 
more complex things to handle them, but that depends on the situation. If 
you only have one USB drive on a system you don't need to worry about what 
order USB hotplug events come in if you can just say 'the first USB 
drive'. mixing the different types of devices into one namespace 
complicates things in a couple of ways.


1. devices that used to have stable names no longer have stable names 
without extra effort.


2. having multiple seperate unstable namespaces with one name in each of 
them looks to the user like a stable namespace, since the instability 
never comes into play. combineing these into a single namespace looses 
this stability




Pulling out the "IBM numa cluster with multiple SAS enclosures _and_ firewire"
infrastructure to find the root partition on my hard drive may be good for
the IBM numa clusters, but only at the expense of complicating this part of
my laptop's infrastructure by an order of magnitude, and making embedded
systems nearly impossible to put together.  If "one size fits all" were true,
my cell phone would be running Red Hat Enterprise.


If some devices that are even reasonably common (e.g. IDE drives) are
stable, then some application developers or system integrators will
work under the assumption of stability and whatever they build will
break when you try it on different hardware.


So you break the IDE drives to get laptop users to debug the Niagra set?  The


Breaking old behaviour is always bad... My computers with IDE
interfaces still see stable "/dev/hda" devices.  Are you saying the
devices that used to be "hda" are now "sdb" ??  Maybe there is a
.config option...


yes, this changed. If you run your IDE drives with the PATA drivers of 
libata they show up as sdX, and are subject to the same detection order 
issues as any other sd device.



solution is to make the easy cases hard?


Is it really that hard?


Note that stable names a still a very real option.  udev provides
several.  /dev/disk-by-path/XXX will be stable for lots of "screwed
in" devices.  /dev/disk-by-id will be stable for devices the report a
unique id. etc.


Here it's

  ls /dev/disk/by-path/
  pci-:00:1f.2-scsi-0:0:0:0pci-:00:1f.2-scsi-0:0:0:0-part4
  pci-:00:1f.2-scsi-0:0:0:0-part1  pci-:00:1f.2-scsi-0:0:0:0-part5
  pci-:00:1f.2-scsi-0:0:0:0-part2  pci-:00:1f.2-scsi-0:0:0:0-part6
  pci-:00:1f.2-scsi-0:0:0:0-part3  pci-:00:1f.2-scsi-1:0:0:0

And this is an improvement?


Depends on your metric.

"Easy to type" - I guess /dev/hda1 wins hands down.
"Can be used in a script or config file and is guaranteed always to
work until a screwdriver is used to change that device or it's
controller"
 I think
 /dev/disk/by-path/pci-:00:1f.2-scsi-0:0:0:0-part1
is quite acceptable.
What is your metric?


does it have to be one or the other? /dev/hda1 suceeded on both metrics.



The different between IDE, SATA, SCSI and even USB is peripheral for
the large majority of uses, and I think maintaining the distinction in
the major/minor number or in the "primary" /dev name is - for the
above reasons - more of a cost that a value.


Is your definition of "the large majority of uses" where ncr Voyager, the
Amiga, and current macintosh laptops are all one use each, or is your
definition of "the large majority of uses" the one where each "use" is an
installation, of which there are millions of PCs (and even more ARM cell
phones), and something like three instances of Voyager?


My definition of "the large majority or uses" is "mkfs, fsck, mount,
fdisk, system-install-process".

Different people differentiate devices in different ways.  A system
integrator might know about the hardware path.  An end user might know
about drive brands or sizes.  A casual user might just think "internal
or external".  The kernel cannot support all these different
approaches to naming.  It really is best if it uses arbitrary names,
and provides access to descriptions that the user can choose between.
udev facilitates this with links in /dev/disk/.  A system install can
facilitate this even more by reporting size/manufacturer information etc.


but is the possibility of wanting different options really sufficiant 
reason to eliminate every stable option? right now the /dev names are 
essentially random without external help. why couldn't they be stable 

Re: [PATCH RFC 2/2] paravirt: clean up lazy mode handling

2007-10-15 Thread Glauber de Oliveira Costa
On 10/12/07, Jeremy Fitzhardinge <[EMAIL PROTECTED]> wrote:
> [ Changes since last post: fixed up lguest ]
>
> Currently, the set_lazy_mode pv_op is overloaded with 5 functions:
>  1. enter lazy cpu mode
>  2. leave lazy cpu mode
>  3. enter lazy mmu mode
>  4. leave lazy mmu mode
>  5. flush pending batched operations
>
> This complicates each paravirt backend, since it needs to deal with
> all the possible state transitions, handling flushing, etc. In
> particular, flushing is quite distinct from the other 4 functions, and
> seems to just cause complication.
>
> This patch removes the set_lazy_mode operation, and adds "enter" and
> "leave" lazy mode operations on mmu_ops and cpu_ops.  All the logic
> associated with enter and leaving lazy states is now in common code
> (basically BUG_ONs to make sure that no mode is current when entering
> a lazy mode, and make sure that the mode is current when leaving).
> Also, flush is handled in a common way, by simply leaving and
> re-entering the lazy mode.
>
> The result is that the Xen, lguest and VMI lazy mode implementations
> are much simpler.
>
> Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
> Cc: Andi Kleen <[EMAIL PROTECTED]>
> Cc: Zach Amsden <[EMAIL PROTECTED]>
> Cc: Rusty Russell <[EMAIL PROTECTED]>
> Cc: Avi Kivity <[EMAIL PROTECTED]>
> Cc: Anthony Liguory <[EMAIL PROTECTED]>
> Cc: "Glauber de Oliveira Costa" <[EMAIL PROTECTED]>
> Cc: "Nakajima, Jun" <[EMAIL PROTECTED]>
>
> ---
>  arch/i386/kernel/paravirt.c |   58 
> +++
>  arch/i386/kernel/vmi.c  |   45 +++--
>  arch/i386/xen/enlighten.c   |   44 ++--
>  arch/i386/xen/mmu.c |2 -
>  arch/i386/xen/multicalls.h  |2 -
>  arch/i386/xen/xen-ops.h |7 -
>  drivers/lguest/lguest.c |   34 -
>  include/asm-i386/paravirt.h |   52 --
>  8 files changed, 140 insertions(+), 104 deletions(-)
>
> ===
> --- a/arch/i386/kernel/paravirt.c
> +++ b/arch/i386/kernel/paravirt.c
> @@ -266,6 +266,49 @@ int paravirt_disable_iospace(void)
> }
>
> return ret;
> +}
> +
> +static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = 
> PARAVIRT_LAZY_NONE;
> +
> +static inline void enter_lazy(enum paravirt_lazy_mode mode)
> +{
> +   BUG_ON(x86_read_percpu(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
> +   BUG_ON(preemptible());
Wouldn't it be better to WARN_ON, and simply not entering lazy mode?
It does not sound like a fatal condition.

> +void paravirt_leave_lazy(enum paravirt_lazy_mode mode)
> +{
> +   BUG_ON(x86_read_percpu(paravirt_lazy_mode) != mode);
> +   BUG_ON(preemptible());

Although this one seems like a fatal condition ;-)

> +void paravirt_enter_lazy_mmu(void)
> +{
> +   enter_lazy(PARAVIRT_LAZY_MMU);
> +}
> +
> +void paravirt_leave_lazy_mmu(void)
> +{
> +   paravirt_leave_lazy(PARAVIRT_LAZY_MMU);
> +}
> +
> +void paravirt_enter_lazy_cpu(void)
> +{
> +   enter_lazy(PARAVIRT_LAZY_CPU);
> +}
> +
> +void paravirt_leave_lazy_cpu(void)
> +{
> +   paravirt_leave_lazy(PARAVIRT_LAZY_CPU);
> +}
> +
> +enum paravirt_lazy_mode paravirt_get_lazy_mode(void)
> +{
> +   return x86_read_percpu(paravirt_lazy_mode);
>  }

I am concerned that this is 32-bit specific.
But hey: We could wrap it here, but the best solution may be just to
define this macro for 64-bit, and make it everyone benefits. So yeah,
this is a concern here, but I don't think anything should be changed
in this patch to address it so... so... ok ;-)

-- 
Glauber de Oliveira Costa.
"Free as in Freedom"
http://glommer.net

"The less confident you are, the more serious you have to act."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] git scsi misc include fix

2007-10-15 Thread Paul Jackson
James wrote:
> In that case, the correct fix
> is actually to move the scatterlist include from scsi_error.c (where the
> scatterlist was originally used locally) into scsi_eh.h, like this.

I suspect you're correct, yes.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [git pull] lockdep for v2.6.24

2007-10-15 Thread Linus Torvalds


On Mon, 15 Oct 2007, Peter Zijlstra wrote:
> 
> please pull the lockdep tree from:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep.git 
> v2.6.24-lockdep

Hmm. I'm now getting

WARNING: at kernel/lockdep.c:700 look_up_lock_class()

Call Trace:
 [] __lock_acquire+0x15f/0xc92
 [] do_lookup+0x83/0x1b0
 [] lock_acquire+0x5a/0x73
 [] do_lookup+0x83/0x1b0
 [] debug_mutex_lock_common+0x16/0x23
 [] mutex_lock_nested+0x10c/0x2b0
 [] do_lookup+0x83/0x1b0
 [] __link_path_walk+0x924/0xde9
 [] link_path_walk+0x58/0xe0
 [] _spin_unlock+0x17/0x20
 [] get_unused_fd_flags+0x115/0x126
 [] do_path_lookup+0x1ae/0x229
 [] __path_lookup_intent_open+0x56/0x96
 [] open_namei+0x7d/0x66c
 [] do_filp_open+0x1c/0x38
 [] _spin_unlock+0x17/0x20
 [] get_unused_fd_flags+0x115/0x126
 [] do_sys_open+0x46/0xc3
 [] system_call+0x7e/0x83

which seems to be new..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces

2007-10-15 Thread Dave Hansen
On Mon, 2007-10-15 at 19:58 -0500, Matt Mackall wrote:
> 
> > For the bits that we want to export, we could also add the unoptimized
> > access functions for any that don't already have them:
> > 
> > #define __ClearPageReserved(page)   __clear_bit(PG_reserved, 
> > &(page)->flags)
> 
> Confused. Why are we interested in clear? 

We're not.  I just grabbed a random line to show the non-atomic
accessors.  Any actual one we'd need to add would be:

#define __PageBuddy(page) __test_bit(PG_buddy, &(page)->flags)

It looks like we don't have any of these non-atomic ones for plain
__PageFoo().  So, we'd have to add them for each one that we wanted.
Still not much work, and still satisfies the "grep test". :)

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] git scsi misc include fix

2007-10-15 Thread James Bottomley
On Mon, 2007-10-15 at 17:08 -0700, Paul Jackson wrote:
> James wrote:
> > The requirement for struct scatterlist is the same
> > before and after the gid scsi-misc patch. 
> 
> Not so.  The git-scsi-misc.patch in 2.6.23-mm1 clearly adds the line:
> 
> struct scatterlist sense_sgl;
> 
> as part of the added struct scsi_eh_save in scsi/scsi_eh.h.
> 
> This bit me while I was doing a bisection on 2.6.23-mm1, for another
> problem, in git-sched, which is discussed in the lkml thread:
> 
> git-sched patch won't boot on SN arch, 2.6.23-mm1
> 
> This is using sn2_defconfig.  The full 2.6.23-mm1 patch set builds ok,
> because another patch, git-block.patch as I recall, includes
> scatterlist.h some other way, but for the following range of patches in
> 2.6.23-mm1, on the configuration sn2_defconfig, the build is broken,
> due to 'struct scatterlist' being an incomplete type:
> 
> git-scsi-misc.patch
> git-scsi-misc-include-fix.patch
> git-scsi-misc-fixup.patch
> qla2xxx-printk-fixes.patch
> pci-error-recovery-symbios-scsi-base-support.patch
> pci-error-recovery-symbios-scsi-first-failure.patch
> nsp32_restart_autoscsi-remove-error-check.patch
> scsi-send-media-state-change-modification-events.patch
> scsi-early-detection-of-medium-not-present-updated.patch
> mptbase-reset-ioc-initiator-during-pci-resume.patch
> scsi-use-notifier-chain-for-asynchronous-event.patch
> initio-fix-conflict-when-loading-driver.patch
> git-block.patch
> 
> > it should also fail with vanilla 2.6.23
> 
> I don't know about the vanilla 2.6.23 case.

Ah, right, sorry ... on the ball now.  I thought you were saying that
the scsi_error.c compilation was failing.  In that case, the correct fix
is actually to move the scatterlist include from scsi_error.c (where the
scatterlist was originally used locally) into scsi_eh.h, like this.

James

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index d29f846..ebaca4c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
diff --git a/include/scsi/scsi_eh.h b/include/scsi/scsi_eh.h
index 44224ba..d21b891 100644
--- a/include/scsi/scsi_eh.h
+++ b/include/scsi/scsi_eh.h
@@ -1,6 +1,8 @@
 #ifndef _SCSI_SCSI_EH_H
 #define _SCSI_SCSI_EH_H
 
+#include 
+
 #include 
 struct scsi_device;
 struct Scsi_Host;


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces

2007-10-15 Thread Matt Mackall
On Mon, Oct 15, 2007 at 05:49:10PM -0700, Dave Hansen wrote:
> On Mon, 2007-10-15 at 19:35 -0500, Matt Mackall wrote:
> > Perhaps we need something like:
> > 
> > flags = page->flags;
> > userflags = 
> >   FLAG_BIT(USER_REFERENCED, flags & PG_referenced) |
> >   ...
> > 
> > etc. for the flags we want to export. This will let us change to
> > 
> >  FLAG_BIT(USER_SLAB, PageSlab(page)) |
> > 
> > if we make a virtual slab bit.
> 
> Yeah, that looks like a pretty sane scheme.  Do we want to be any more
> abstract about it?  Perhaps instead of USER_SLAB, it should be
> USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going away
> as we speak. :)

Perhaps. SLUB is still "a slab-based allocator". SLOB isn't, but I
intend to start making it use PG_slab shortly anyway.
 
> > And it shows up in grep.
> > 
> > Unfortunately, i386 test_bit is an asm inline and not a macro so we
> > can't hope for the compiler to fold up a bunch of identity bit
> > mappings for us. 
> 
> We could also Yeah, that looks like a pretty sane scheme.  Do we want to
> be any more abstract about it?  Perhaps instead of USER_SLAB, it should
> be USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going
> away as we speak.
> 
> For the bits that we want to export, we could also add the unoptimized
> access functions for any that don't already have them:
> 
> #define __ClearPageReserved(page)   __clear_bit(PG_reserved, 
> &(page)->flags)

Confused. Why are we interested in clear?

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sc1200 pci cleanup, resume improvement, bug fix

2007-10-15 Thread Jeff Garzik
This patch accomplishes the following goals:

* kill the 'pci_enable_device ret val not checked' warning

* eliminate the incorrect mucking with pci_dev::current_state

via the following changes:

* [minor bug fix] eliminate pci_set_power_state() call in resume,
  pci_enable_device() does so for us.

* [bug fix] do not touch dev->current_state, pci_set_power_state()
  and other PCI layer functions manage this for us.

* [minor bug fix, warning fix] check pci_enable_device() ret val in
  resume, and do not bring up interfaces if it fails (which it might)

* eliminate lookup_pci_dev(), a needless loop over a global list, by
  storing our associated hwif in a struct allocated at probe time.

* introduce __ide_setup_pci_device() to facilitate making PCI drivers
  aware of the hwifs created during IDE generic probe.

Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>
---
WARNING WARNING WARNING

This is a drop-n-run patch, created ultimately because I was annoyed at
the [quite valid] pci_enable_device() build warning.

If someone likes this, please "take ownership" of the patch.

WARNING WARNING WARNING

 drivers/ide/pci/sc1200.c |   72 ++-
 drivers/ide/setup-pci.c  |   12 +++
 include/linux/ide.h  |2 +
 3 files changed, 59 insertions(+), 27 deletions(-)

diff --git a/drivers/ide/pci/sc1200.c b/drivers/ide/pci/sc1200.c
index ee0e3f5..fa61550 100644
--- a/drivers/ide/pci/sc1200.c
+++ b/drivers/ide/pci/sc1200.c
@@ -41,6 +41,12 @@
 #define PCI_CLK_66 0x02
 #define PCI_CLK_33A0x03
 
+#define SC1200_IFS 2
+
+struct sc1200_ifs {
+   ide_hwif_t  *iface[SC1200_IFS];
+};
+
 static unsigned short sc1200_get_pci_clock (void)
 {
unsigned char chip_id, silicon_revision;
@@ -274,22 +280,6 @@ static void sc1200_set_pio_mode(ide_drive_t *drive, const 
u8 pio)
 }
 
 #ifdef CONFIG_PM
-static ide_hwif_t *lookup_pci_dev (ide_hwif_t *prev, struct pci_dev *dev)
-{
-   int h;
-
-   for (h = 0; h < MAX_HWIFS; h++) {
-   ide_hwif_t *hwif = _hwifs[h];
-   if (prev) {
-   if (hwif == prev)
-   prev = NULL;// found previous, now look for 
next match
-   } else {
-   if (hwif && hwif->pci_dev == dev)
-   return hwif;// found next match
-   }
-   }
-   return NULL;// not found
-}
 
 typedef struct sc1200_saved_state_s {
__u32   regs[4];
@@ -298,7 +288,9 @@ typedef struct sc1200_saved_state_s {
 
 static int sc1200_suspend (struct pci_dev *dev, pm_message_t state)
 {
-   ide_hwif_t  *hwif = NULL;
+   struct sc1200_ifs   *ifs = pci_get_drvdata(dev);
+   ide_hwif_t  *hwif;
+   int i;
 
printk("SC1200: suspend(%u)\n", state.event);
 
@@ -308,9 +300,14 @@ static int sc1200_suspend (struct pci_dev *dev, 
pm_message_t state)
//
// Loop over all interfaces that are part of this PCI device:
//
-   while ((hwif = lookup_pci_dev(hwif, dev)) != NULL) {
+   for (i = 0; i < SC1200_IFS; i++) {
sc1200_saved_state_t*ss;
unsigned intbasereg, r;
+
+   hwif = ifs->iface[i];
+   if (!hwif)
+   continue;
+
//
// allocate a permanent save area, if not already 
allocated
//
@@ -337,23 +334,31 @@ static int sc1200_suspend (struct pci_dev *dev, 
pm_message_t state)
 
pci_disable_device(dev);
pci_set_power_state(dev, pci_choose_state(dev, state));
-   dev->current_state = state.event;
return 0;
 }
 
 static int sc1200_resume (struct pci_dev *dev)
 {
-   ide_hwif_t  *hwif = NULL;
+   struct sc1200_ifs   *ifs = pci_get_drvdata(dev);
+   ide_hwif_t  *hwif;
+   int rc, i;
+
+   rc = pci_enable_device(dev);
+   if (rc)
+   return rc;
 
-   pci_set_power_state(dev, PCI_D0);   // bring chip back from sleep 
state
-   dev->current_state = PM_EVENT_ON;
-   pci_enable_device(dev);
//
// loop over all interfaces that are part of this pci device:
//
-   while ((hwif = lookup_pci_dev(hwif, dev)) != NULL) {
+   for (i = 0; i < SC1200_IFS; i++) {
unsigned intbasereg, r;
-   sc1200_saved_state_t*ss = (sc1200_saved_state_t 
*)hwif->config_data;
+   sc1200_saved_state_t*ss;
+
+   hwif = ifs->iface[i];
+   if (!hwif)
+   continue;
+
+   ss = (sc1200_saved_state_t *)hwif->config_data;
 
//
// Restore timing registers:  this may be unnecessary if BIOS 
also does it
@@ -411,7 +416,22 

Re: [rfc][patch 3/3] x86: optimise barriers

2007-10-15 Thread Nick Piggin
On Mon, Oct 15, 2007 at 11:10:00AM +0200, Jarek Poplawski wrote:
> On Mon, Oct 15, 2007 at 10:09:24AM +0200, Nick Piggin wrote:
> ...
> > Has performance really been much problem for you? (even before the
> > lfence instruction, when you theoretically had to use a locked op)?
> > I mean, I'd struggle to find a place in the Linux kernel where there
> > is actually a measurable difference anywhere... and we're pretty
> > performance critical and I think we have a reasonable amount of lockless
> > code (I guess we may not have a lot of tight computational loops, though).
> > I'd be interested to know what, if any, application had found these
> > barriers to be problematic...
> 
> I'm not performance-words at all, so I can't help you, sorry. But, I
> understand people who care about this, and think there is a popular
> conviction barriers and locked instructions are costly, so I'm
> surprised there is any "problem" now with finding these gains...

It's more expensive than nothing, sure. However in real code, algorithmic
complexity, cache misses and cacheline bouncing tend to be much bigger
issues.

I can't think of a place in the kernel where smp_rmb matters _that_ much.
seqlocks maybe (timers, dcache lookup), vmscan... Obviously removing the
lfence is not going to hurt. Maybe we even gain 0.01% performance in
someone's workload.

Also, remember: if loads are already in-order, then lfence is a noop,
right? (in practice it seems to have to do a _little_ bit of work, but
it's like a dozen cycles).


> > The thing is that those documents are not defining what a particular
> > implementation does, but how the architecture is defined (ie. what
> > must some arbitrary software/hardware provide and what may it expect).
> 
> I'm not sure this is the right way to tell it. If there is no
> distinction between what is and what could be, how can I believe in
> similar Alpha or Itanium stuff? IMHO, these manuals sometimes look
> like they describe some real hardware mechanisms, and sometimes they
> mention about possible changes and reserved features too. So, when
> they don't mention you could think it's a present behavior.

No. Why are you reading that much into it? I know for a fact that some
non-x86 architectures actual implementations have stronger ordering than
their ISA allows. It's nothing to do with you "believing" how the hardware
works. That's not what the document is describing (directly).


> > It's pretty natural that Intel started out with a weaker guarantee
> > than their CPUs of the time actually supported, and tightened it up
> > after (presumably) deciding not to implement such relaxed semantics
> > for the forseeable future.
> 
> As a matter of fact it's not natural for me at all. I expected the
> other direction, and I still doubt programmers' intentions could be
> "automatically" predicted good enough, so IMHO, it's not for long.

Really? Consider the consequences if, instead of releasing this latest
document tightening consistency, Intel found that out of order loads
were worth 5% more performance and implemented them in their next chip.
The chip could be completely backwards compatible, but all your old code
would break, because it was broken to begin with (because it was outside
the spec).

IMO Intel did exactly the right thing from an engineering perspective,
and so did Linux to always follow the spec.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces

2007-10-15 Thread Dave Hansen
On Mon, 2007-10-15 at 19:35 -0500, Matt Mackall wrote:
> Perhaps we need something like:
> 
> flags = page->flags;
> userflags = 
>   FLAG_BIT(USER_REFERENCED, flags & PG_referenced) |
>   ...
> 
> etc. for the flags we want to export. This will let us change to
> 
>  FLAG_BIT(USER_SLAB, PageSlab(page)) |
> 
> if we make a virtual slab bit.

Yeah, that looks like a pretty sane scheme.  Do we want to be any more
abstract about it?  Perhaps instead of USER_SLAB, it should be
USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going away
as we speak. :)

> And it shows up in grep.
> 
> Unfortunately, i386 test_bit is an asm inline and not a macro so we
> can't hope for the compiler to fold up a bunch of identity bit
> mappings for us. 

We could also Yeah, that looks like a pretty sane scheme.  Do we want to
be any more abstract about it?  Perhaps instead of USER_SLAB, it should
be USER_KERNEL_INTERNAL, or USER_KERNEL_USE.  The slab itself is going
away as we speak.

For the bits that we want to export, we could also add the unoptimized
access functions for any that don't already have them:

#define __ClearPageReserved(page)   __clear_bit(PG_reserved, &(page)->flags)

Anybody changing bit behavior will certainly go check all of the
callers, such as ClearPageReserved() *and* __ClearPageReserved().

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: WANTED: kernel projects for CS students

2007-10-15 Thread david

On Mon, 15 Oct 2007, Zan Lynx wrote:


On Sun, 2007-10-14 at 19:01 -0400, Rik van Riel wrote:

The kernel newbies community often gets inquiries from CS students who
need a project for their studies and would like to do something with
the Linux kernel, but would also like their code to be useful to the
community afterwards.

In order to make it easier for them, I am trying to put together a
page with projects that:
- Are self contained enough that the students can implement the
  project by themselves, since that is often a university requirement.
- Are self contained enough that Linux could merge the code (maybe
  with additional changes) after the student has been working on it
  for a few months.
- Are large enough to qualify as a student project, luckily there is
  flexibility here since we get inquiries for anything from 6 week
  projects to 6 month projects.

If you have ideas on what projects would be useful, please add them
to this page (or email me):

http://kernelnewbies.org/KernelProjects


How about this in the Device Mapper raid-1/mirror code?
/* FIXME: add read balancing */

That comment has been in there for many releases.  I've wanted read
balancing for several servers and had all sorts of ideas about it, like
adding functions to the underlying device queues to return a "queuing
cost" to determine which is the best queue to add the read request.  I
think that could work better for queues like CFQ than the MD
closest-head.

An implementation would also need to be benchmarked against the MD
raid-1.

Along with the time to submit it to LKML, get it reviewed and polish it
up, it might make a good student project.


another couple of raid enhancements would be:

1. teach the system that a raid456 stripe is handled most efficiantly if 
treated as a single block of data


by this I mean that if you read one block from the stripe the system reads 
the entire stripe, so it should take this into account when doing 
read-ahead and not always throw away most of the data it read becouse it's 
outside the current readahead window (if nothing else, look at putting it 
on the tail of the LRU list instead of just forgetting it)


if you write one block of the stripe the system must read the stripe, then 
update two blocks of the stripe (the data block and the parity block), but 
if you are going to write the entire stripe out you can ignore whatever's 
there and just calculate the parity block from the data you are writing. 
this should make writing to a raid456 stripe as fast as writing to a raid0 
stripe (well, almost, you have one more block to write).



2. not directly a kernel project, create userspace tools that make 
managing raid and partitioning on linux as easy as the zfs tools



3. there is currently the ability to grow a raid56 array by adding a disk, 
but there is not the ability to take a raid5 array, add a disk and make 
the result a raid6 array.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: More Large blocksize benchmarks

2007-10-15 Thread Christoph Lameter
On Mon, 15 Oct 2007, Chris Mason wrote:

> Dave reported that XFS saw much higher write throughput with large
> blocksizes, but so far I'm seeing the most benefits during reads.

Dave's tests were done with an early large blocksize patchset that had 
issues with readahead. More recent versions have the fixes by Fengguang 
that address the issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/11] maps3: introduce task_size_of for all arches

2007-10-15 Thread Dave Hansen
David,

All of your comments looked pretty valid to me.  I've refreshed that
patch.

I haven't even compile-tested this so there may be some fat fingering
somewhere.  I'll run compile tests on it now.

-- Dave


For the /proc//pagemap code[1], we need to able to query how
much virtual address space a particular task has.  The trick is
that we do it through /proc and can't use TASK_SIZE since it
references "current" on some arches.  The process opening the
/proc file might be a 32-bit process opening a 64-bit process's
pagemap file.

x86_64 already has a TASK_SIZE_OF() macro:

#define TASK_SIZE_OF(child) ((test_tsk_thread_flag(child, TIF_IA32)) ? 
IA32_PAGE_OFFSET : TASK_SIZE64)

I'd like to have that for other architectures.  So, add it
for all the architectures that actually use "current" in 
their TASK_SIZE.  For the others, just add a quick #define
in sched.h to use plain old TASK_SIZE.

1. http://www.linuxworld.com/news/2007/042407-kernel.html

- MIPS portion from Ralf Baechle <[EMAIL PROTECTED]>

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Ralf Baechle <[EMAIL PROTECTED]>
Signed-off-by: Matt Mackall <[EMAIL PROTECTED]>
---

 lxc-dave/include/asm-ia64/processor.h|3 ++-
 lxc-dave/include/asm-mips/processor.h|4 
 lxc-dave/include/asm-parisc/processor.h  |3 ++-
 lxc-dave/include/asm-powerpc/processor.h |3 ++-
 lxc-dave/include/asm-s390/processor.h|3 ++-
 lxc-dave/include/linux/sched.h   |4 
 6 files changed, 16 insertions(+), 4 deletions(-)

diff -puN 
include/asm-ia64/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
 include/asm-ia64/processor.h
--- 
lxc/include/asm-ia64/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
2007-10-15 17:29:22.0 -0700
+++ lxc-dave/include/asm-ia64/processor.h   2007-10-15 17:29:22.0 
-0700
@@ -31,7 +31,8 @@
  * each (assuming 8KB page size), for a total of 8TB of user virtual
  * address space.
  */
-#define TASK_SIZE  (current->thread.task_size)
+#define TASK_SIZE_OF(tsk)  ((tsk)->thread.task_size)
+#define TASK_SIZE  TASK_SIZE_OF(current)
 
 /*
  * This decides where the kernel will search for a free chunk of vm
diff -puN 
include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
 include/asm-mips/processor.h
--- 
lxc/include/asm-mips/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
2007-10-15 17:29:22.0 -0700
+++ lxc-dave/include/asm-mips/processor.h   2007-10-15 17:34:12.0 
-0700
@@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
  * space during mmap's.
  */
 #define TASK_UNMAPPED_BASE (PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)  \
+   (test_tsk_thread_flag(tak, TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #ifdef CONFIG_64BIT
@@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
 #define TASK_UNMAPPED_BASE \
(test_thread_flag(TIF_32BIT_ADDR) ? \
PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
+#define TASK_SIZE_OF(tsk)  \
+   (test_tsk_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
 #endif
 
 #define NUM_FPU_REGS   32
diff -puN 
include/asm-parisc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
 include/asm-parisc/processor.h
--- 
lxc/include/asm-parisc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
  2007-10-15 17:29:22.0 -0700
+++ lxc-dave/include/asm-parisc/processor.h 2007-10-15 17:31:39.0 
-0700
@@ -32,7 +32,8 @@
 #endif
 #define current_text_addr() ({ void *pc; current_ia(pc); pc; })
 
-#define TASK_SIZE   (current->thread.task_size)
+#define TASK_SIZE_OF(tsk)   ((tsk)->thread.task_size)
+#define TASK_SIZE  TASK_SIZE_OF(current)
 #define TASK_UNMAPPED_BASE  (current->thread.map_base)
 
 #define DEFAULT_TASK_SIZE32(0xFFF0UL)
diff -puN 
include/asm-powerpc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
 include/asm-powerpc/processor.h
--- 
lxc/include/asm-powerpc/processor.h~PATCH_2_11_maps3-_introduce_task_size_of_for_all_arches
 2007-10-15 17:29:22.0 -0700
+++ lxc-dave/include/asm-powerpc/processor.h2007-10-15 17:32:00.0 
-0700
@@ -99,8 +99,9 @@ extern struct task_struct *last_task_use
  */
 #define TASK_SIZE_USER32 (0x0001UL - (1*PAGE_SIZE))
 
-#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
+#define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
TASK_SIZE_USER32 : TASK_SIZE_USER64)
+#define TASK_SIZETASK_SIZE_OF(current)
 
 /* This decides where the kernel will search for a free chunk of vm
  * space during mmap's.
diff -puN 

Re: WANTED: kernel projects for CS students

2007-10-15 Thread david

On Mon, 15 Oct 2007, Mark Gross wrote:


On Sun, Oct 14, 2007 at 07:01:28PM -0400, Rik van Riel wrote:

The kernel newbies community often gets inquiries from CS students who
need a project for their studies and would like to do something with
the Linux kernel, but would also like their code to be useful to the
community afterwards.

In order to make it easier for them, I am trying to put together a
page with projects that:
- Are self contained enough that the students can implement the
  project by themselves, since that is often a university requirement.
- Are self contained enough that Linux could merge the code (maybe
  with additional changes) after the student has been working on it
  for a few months.
- Are large enough to qualify as a student project, luckily there is
  flexibility here since we get inquiries for anything from 6 week
  projects to 6 month projects.

If you have ideas on what projects would be useful, please add them
to this page (or email me):

http://kernelnewbies.org/KernelProjects


Is there already a make config option that will do a good job at setting
a default .config file based on what is already running on a system?

I get tiered of trimming down my .config for my laptop build so it takes
less than 30min to build a kernel.

Bonus credit to additional "expert" options (like those powertop puts
out) for target uses, laptop, HPC, home file share, embedded targets

Oh, and lets make the expert configs easily extensible.


another config thing that would be nice would be to take something like 
Rob Landley's miniconfig tool and make it work well enough to be 
integrated (it creates a version of .config that only contains the things 
that need to be set, not everything that's at a default that doesn't make 
any difference)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 10/11] maps3: add /proc/kpagecount and /proc/kpageflags interfaces

2007-10-15 Thread Matt Mackall
On Mon, Oct 15, 2007 at 04:34:57PM -0700, Dave Hansen wrote:
> On Mon, 2007-10-15 at 18:11 -0500, Matt Mackall wrote:
> > > Could we just have /proc/kpagereferenced?  Is there a legitimate need
> > > for other flags to be visible?
> > 
> > Referenced, dirty, uptodate, lru, active, slab, writeback, reclaim,
> > and buddy all look like they might be interesting to me from the point
> > of view of watching what's happening in the VM graphically in
> > real-time.
> 
> This is true, but it forces a lot of logic from the kernel to be run in
> userspace to figure out what is going on.  Looking at mainline today:
> 
> #define PG_reclaim  17  /* To be reclaimed asap */
> ...
> #define PG_readaheadPG_reclaim /* Reminder to do async read-ahead 
> */
> 
> All of a sudden, to figure out which flag it actually is, we need to
> have all of the logic that the kernel does.  
> 
> Does this establish a fixed user<->kernel ABI that will keep us from
> doing this in the future:
> 
> -#define PG_slab  7  /* slab debug (Suparna wants this) */
> +#define PG_slab  14  /* slab debug (Suparna wants this) 
> */
> 
> Or, even something like this:
> 
> -#define PageSlab(page)  test_bit(PG_slab, &(page)->flags)
> +#define PageSlab(page)  (!PageLRU(page) && !PageHighmem(page))

Yeah, there are a bunch of flags that aren't mutually exclusive and we
could probably recover a few.

> If we actually had several (or even still one file) that exposed this
> state, independent of the actual content of page->flags, I think we'd be
> better off.  I think that's the difference between a fun, super-useful
> debugging feature and one that can stay in mainline and have
> applications stay using it (without breaking) for a long time.
> 
> The flags you listed are things that I would imagine will always exist,
> logically.  But, we might not always have a specific page flag for pages
> under writeback or in the buddy list for that matter.  PG_buddy isn't
> that old.  Perhaps that would be better abstracted to something like
> page_in_main_allocator().

Perhaps we need something like:

flags = page->flags;
userflags = 
  FLAG_BIT(USER_REFERENCED, flags & PG_referenced) |
  ...

etc. for the flags we want to export. This will let us change to

 FLAG_BIT(USER_SLAB, PageSlab(page)) |

if we make a virtual slab bit.

And it shows up in grep.

Unfortunately, i386 test_bit is an asm inline and not a macro so we
can't hope for the compiler to fold up a bunch of identity bit
mappings for us.


-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Make m68k cross compile like every other architecture.

2007-10-15 Thread Rob Landley
On Monday 15 October 2007 3:25:35 pm Geert Uytterhoeven wrote:
> 64-bit parisc tests if /usr/bin/hppa64-linux-gnu- exists.
> If yes, it sets CROSS_COMPILE to hppa64-linux-gnu-.
> If no, it sets CROSS_COMPILE to hppa64-linux-
>
> 32-bit parisc unconditionally sets CROSS_COMPILE to hppa-linux-.
>
> This still breaks Rob's setup if his compiler is called differently.

Another thing to take into account is that kconfig was recently changed to 
save ARCH and CROSS_COMPILE in the .config file:

http://lwn.net/Articles/253889/

Presumably that means you'd only have to specify your arch and cross compiler 
during config, and then if you re-used that config it would re-use those 
settings.  But the existing makefile discards anything that isn't explicitly 
overridden on the make command line at each stage of the build.

It seems to me any fix should only reset CROSS_COMPILE if there isn't already 
a value for it.  (Otherwise there's a potentially subtle bug where a year 
from now you might have "m68k-linux-gnu-gcc" and "m68k-linux-gnu-pcc" and 
want to compare the results of building with the different compilers.)

I still lean towards considering any attempt to cross compile without setting 
CROSS_COMPILE an error, and not guessing at what the user meant.  But perhaps 
that's just personal preference...

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/11] maps3: make page monitoring /proc file optional

2007-10-15 Thread Matt Mackall
On Tue, Oct 16, 2007 at 10:03:39AM +1000, Rusty Russell wrote:
> On Tuesday 16 October 2007 08:51:17 Jeremy Fitzhardinge wrote:
> > Dave Hansen wrote:
> > > On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
> > >> +config PROC_PAGE_MONITOR
> > >> +   default y
> > >> +   bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS &&
> > >> MMU +   help
> > >> + Various /proc files exist to monitor process memory
> > >> utilization: + /proc/pid/smaps, /proc/pid/clear_refs,
> > >> /proc/pid/pagemap, + /proc/kpagecount, and /proc/kpageflags.
> > >> Disabling these +  interfaces will reduce the size of the kernel
> > >> by approximately 4kb.
> > >
> > > How about pulling the EMBEDDED off there?  I certainly want it for
> > > non-embedded reasons. ;)
> >
> > That means it will only bother asking you if you've set EMBEDDED;
> > otherwise its always on.
> 
> But it's at the least confusing.  Surely this option should depend on MMU and 
> PROC_FS, and the prompt depend on EMBEDDED?
> 
> That might be implied by the Kconfig layout, but AFAICT this patch removed 
> the 
> explicit MMU dependency.
> 
> Rusty.

Wasn't this your patch? You're right, it ought to say "depends PROC_FS
&& MMU". Will fix.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] docbook: fix usb content

2007-10-15 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

Fix USB docbook warnings.

Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:487): No description 
found for parameter 'g'
Warning(linux-2.6.23-git8//include/linux/usb/gadget.h:506): No description 
found for parameter 'g'

Warning(linux-2.6.23-git8//drivers/usb/core/hub.c:1416): No description found 
for parameter 'usb_dev'

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 drivers/usb/core/hub.c |6 +-
 include/linux/usb/gadget.h |4 ++--
 2 files changed, 7 insertions(+), 3 deletions(-)

--- linux-2.6.23-git8.orig/include/linux/usb/gadget.h
+++ linux-2.6.23-git8/include/linux/usb/gadget.h
@@ -481,7 +481,7 @@ static inline void *get_gadget_data (str
 
 /**
  * gadget_is_dualspeed - return true iff the hardware handles high speed
- * @gadget: controller that might support both high and full speeds
+ * @g: controller that might support both high and full speeds
  */
 static inline int gadget_is_dualspeed(struct usb_gadget *g)
 {
@@ -497,7 +497,7 @@ static inline int gadget_is_dualspeed(st
 
 /**
  * gadget_is_otg - return true iff the hardware is OTG-ready
- * @gadget: controller that might have a Mini-AB connector
+ * @g: controller that might have a Mini-AB connector
  *
  * This is a runtime test, since kernels with a USB-OTG stack sometimes
  * run on boards which only have a Mini-B (or Mini-A) connector.
--- linux-2.6.23-git8.orig/drivers/usb/core/hub.c
+++ linux-2.6.23-git8/drivers/usb/core/hub.c
@@ -1407,7 +1407,11 @@ fail:
 
 
 /**
- * Similar to usb_disconnect()
+ * usb_deauthorize_device - deauthorize a device (usbcore-internal)
+ * @usb_dev: USB device
+ *
+ * Move the USB device to a very basic state where interfaces are disabled
+ * and the device is in fact unconfigured and unusable.
  *
  * We share a lock (that we have) with device_del(), so we need to
  * defer its call.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] docbook: fix filesystems content

2007-10-15 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

Fix filesystems docbook warnings.

Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for 
parameter 'name'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for 
parameter 'mode'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for 
parameter 'parent'
Warning(linux-2.6.23-git8//fs/debugfs/file.c:241): No description found for 
parameter 'value'
Warning(linux-2.6.23-git8//include/linux/jbd.h:404): No description found for 
parameter 'h_lockdep_map'

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 fs/debugfs/file.c   |   41 +++--
 include/linux/jbd.h |1 +
 2 files changed, 36 insertions(+), 6 deletions(-)

--- linux-2.6.23-git8.orig/fs/debugfs/file.c
+++ linux-2.6.23-git8/fs/debugfs/file.c
@@ -227,15 +227,24 @@ DEFINE_SIMPLE_ATTRIBUTE(fops_x16, debugf
 
 DEFINE_SIMPLE_ATTRIBUTE(fops_x32, debugfs_u32_get, debugfs_u32_set, 
"0x%08llx\n");
 
-/**
- * debugfs_create_x8 - create a debugfs file that is used to read and write an 
unsigned 8-bit value
- * debugfs_create_x16 - create a debugfs file that is used to read and write 
an unsigned 16-bit value
- * debugfs_create_x32 - create a debugfs file that is used to read and write 
an unsigned 32-bit value
+/*
+ * debugfs_create_x{8,16,32} - create a debugfs file that is used to read and 
write an unsigned {8,16,32}-bit value
  *
- * These functions are exactly the same as the above functions, (but use a hex
- * output for the decimal challenged) for details look at the above unsigned
+ * These functions are exactly the same as the above functions (but use a hex
+ * output for the decimal challenged). For details look at the above unsigned
  * decimal functions.
  */
+
+/**
+ * debugfs_create_x8 - create a debugfs file that is used to read and write an 
unsigned 8-bit value
+ * @name: a pointer to a string containing the name of the file to create.
+ * @mode: the permission that the file should have
+ * @parent: a pointer to the parent dentry for this file.  This should be a
+ *  directory dentry if set.  If this parameter is %NULL, then the
+ *  file will be created in the root of the debugfs filesystem.
+ * @value: a pointer to the variable that the file should read to and write
+ * from.
+ */
 struct dentry *debugfs_create_x8(const char *name, mode_t mode,
 struct dentry *parent, u8 *value)
 {
@@ -243,6 +252,16 @@ struct dentry *debugfs_create_x8(const c
 }
 EXPORT_SYMBOL_GPL(debugfs_create_x8);
 
+/**
+ * debugfs_create_x16 - create a debugfs file that is used to read and write 
an unsigned 16-bit value
+ * @name: a pointer to a string containing the name of the file to create.
+ * @mode: the permission that the file should have
+ * @parent: a pointer to the parent dentry for this file.  This should be a
+ *  directory dentry if set.  If this parameter is %NULL, then the
+ *  file will be created in the root of the debugfs filesystem.
+ * @value: a pointer to the variable that the file should read to and write
+ * from.
+ */
 struct dentry *debugfs_create_x16(const char *name, mode_t mode,
 struct dentry *parent, u16 *value)
 {
@@ -250,6 +269,16 @@ struct dentry *debugfs_create_x16(const 
 }
 EXPORT_SYMBOL_GPL(debugfs_create_x16);
 
+/**
+ * debugfs_create_x32 - create a debugfs file that is used to read and write 
an unsigned 32-bit value
+ * @name: a pointer to a string containing the name of the file to create.
+ * @mode: the permission that the file should have
+ * @parent: a pointer to the parent dentry for this file.  This should be a
+ *  directory dentry if set.  If this parameter is %NULL, then the
+ *  file will be created in the root of the debugfs filesystem.
+ * @value: a pointer to the variable that the file should read to and write
+ * from.
+ */
 struct dentry *debugfs_create_x32(const char *name, mode_t mode,
 struct dentry *parent, u32 *value)
 {
--- linux-2.6.23-git8.orig/include/linux/jbd.h
+++ linux-2.6.23-git8/include/linux/jbd.h
@@ -372,6 +372,7 @@ struct jbd_revoke_table_s;
  * @h_sync: flag for sync-on-close
  * @h_jdata: flag to force data journaling
  * @h_aborted: flag indicating fatal error on handle
+ * @h_lockdep_map: lockdep info for debugging lock problems
  **/
 
 /* Docbook can't yet cope with the bit fields, but will leave the documentation
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] docbook: fix libata content

2007-10-15 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

Fix libata docbook warnings.

Warning(linux-2.6.23-git8//drivers/ata/libata-scsi.c:3251): No description 
found for parameter 'dev'

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 drivers/ata/libata-scsi.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.23-git8.orig/drivers/ata/libata-scsi.c
+++ linux-2.6.23-git8/drivers/ata/libata-scsi.c
@@ -3239,7 +3239,7 @@ static void ata_scsi_handle_link_detach(
 
 /**
  * ata_scsi_media_change_notify - send media change event
- * @atadev: Pointer to the disk device with media change event
+ * @dev: Pointer to the disk device with media change event
  *
  * Tell the block layer to send a media change notification
  * event.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] docbook: fix kernel-api content

2007-10-15 Thread Randy Dunlap
From: Randy Dunlap <[EMAIL PROTECTED]>

Fix kernel-api docbook warnings.

Warning(linux-2.6.23-git8//drivers/message/fusion/mptscsih.c:2618): No 
description found for parameter 'sc'

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 drivers/message/fusion/mptscsih.c |   10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

--- linux-2.6.23-git8.orig/drivers/message/fusion/mptscsih.c
+++ linux-2.6.23-git8/drivers/message/fusion/mptscsih.c
@@ -2605,14 +2605,10 @@ mptscsih_set_scsi_lookup(MPT_ADAPTER *io
 }
 
 /**
- * SCPNT_TO_LOOKUP_IDX
- *
- * search's for a given scmd in the ScsiLookup[] array list
- *
+ * SCPNT_TO_LOOKUP_IDX - searches for a given scmd in the ScsiLookup[] array 
list
  * @ioc: Pointer to MPT_ADAPTER structure
- * @scmd: scsi_cmnd pointer
- *
- **/
+ * @sc: scsi_cmnd pointer
+ */
 static int
 SCPNT_TO_LOOKUP_IDX(MPT_ADAPTER *ioc, struct scsi_cmnd *sc)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


More Large blocksize benchmarks

2007-10-15 Thread Chris Mason
Hello everyone,

I'm stealing the cc list and reviving and old thread because I've
finally got some numbers to go along with the Btrfs variable blocksize
feature.  The basic idea is to create a read/write interface to
map a range of bytes on the address space, and use it in Btrfs for all
metadata operations (file operations have always been extent based).

So, instead of casting buffer_head->b_data to some structure, I read and
write at offsets in a struct extent_buffer.  The extent buffer is very
small and backed by an address space, and I get large block sizes the
same way file_write gets to write to 16k at a time, by finding the
appropriate page in the addess space.  This is an over simplification
since I try to cache these mapping decisions to avoid using too much
CPU, but hopefully you get the idea.

The advantage to this approach is the changes are all inside Btrfs.  No
extra kernel patches were required.

Dave reported that XFS saw much higher write throughput with large
blocksizes, but so far I'm seeing the most benefits during reads.

The next step is a bunch more benchmarks.  I've done the first round
and posted it here:

http://oss.oracle.com/~mason/blocksizes/

The Btrfs code makes it relatively easy to experiment, and so this may
be a good step toward figuring out if some automagic solution is worth
it in general.  I can even use different sizes for nodes and leaves,
although I haven't done much testing at all there yet.

-chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-15 Thread Nick Piggin
On Tue, Oct 16, 2007 at 12:08:01AM +0200, Mikulas Patocka wrote:
> > On Mon, 15 Oct 2007 22:47:42 +0200 (CEST)
> > Mikulas Patocka <[EMAIL PROTECTED]> wrote:
> > 
> > > > According to latest memory ordering specification documents from
> > > > Intel and AMD, both manufacturers are committed to in-order loads
> > > > from cacheable memory for the x86 architecture. Hence, smp_rmb()
> > > > may be a simple barrier.
> > > >
> > > > http://developer.intel.com/products/processor/manuals/318147.pdf 
> > > > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf
> > > 
> > > Hi
> > > 
> > > I'm just wondering about one thing --- what is LFENCE instruction
> > > good for?
> > > 
> > > SFENCE is for enforcing ordering in write-combining buffers (it
> > > doesn't have sense in write-back cache mode).
> > > MFENCE is for preventing of moving stores past loads.
> > > 
> > > But what is LFENCE for? I read the above documents and they already
> > > say that CPUs have ordered loads.
> > > 
> > 
> > The cpus also have an explicit set of instructions that deliberately do 
> > unordered stores/loads, and s/lfence etc are mostly designed for those.
> 
> I know about unordered stores (movnti & similar) --- they basically use 
> write-combining method on memory that is normally write-back --- and they 
> need sfence. But which one instruction does unordered load and needs 
> lefence?

Also, for non-wb memory. I don't think the Intel document referenced
says anything about this, but the AMD document says that loads can pass
loads (page 8, rule b).

This is why our rmb() is still an lfence.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] maps3: add proportional set size accounting in smaps

2007-10-15 Thread Matt Mackall
On Mon, Oct 15, 2007 at 04:36:38PM -0700, David Rientjes wrote:
> On Mon, 15 Oct 2007, Matt Mackall wrote:
> 
> > Index: l/fs/proc/task_mmu.c
> > ===
> > --- l.orig/fs/proc/task_mmu.c   2007-10-14 13:35:31.0 -0500
> > +++ l/fs/proc/task_mmu.c2007-10-14 13:36:56.0 -0500
> > @@ -122,6 +122,27 @@ struct mem_size_stats
> > unsigned long private_clean;
> > unsigned long private_dirty;
> > unsigned long referenced;
> > +
> > +   /*
> > +* Proportional Set Size(PSS): my share of RSS.
> > +*
> > +* PSS of a process is the count of pages it has in memory, where each
> > +* page is divided by the number of processes sharing it.  So if a
> > +* process has 1000 pages all to itself, and 1000 shared with one other
> > +* process, its PSS will be 1500.   - Matt Mackall, lwn.net
> > +*/
> > +   u64   pss;
> > +   /*
> > +* To keep (accumulated) division errors low, we adopt 64bit pss and
> > +* use some low bits for division errors. So (pss >> PSS_DIV_BITS)
> > +* would be the real byte count.
> > +*
> > +* A shift of 12 before division means(assuming 4K page size):
> > +*  - 1M 3-user-pages add up to 8KB errors;
> > +*  - supports mapcount up to 2^24, or 16M;
> > +*  - supports PSS up to 2^52 bytes, or 4PB.
> > +*/
> > +#define PSS_DIV_BITS   12
> >  };
> >  
> 
> I know this gets moved again in the eighth patch of the series, but the 
> #define still has no place inside the struct definition.

Agreed.
 
> The pss is going to need accessor functions, preferably inlined, and the 
> comment adjusted stating that all accesses should be through those 
> functions and not directly to the mem_size_stats struct.
> 
>   static inline u64 pss_up(unsigned long pss)
>   {
>   return pss << PSS_DIV_BITS;
>   }
> 
>   static inline unsigned long pss_down(u64 pss)
>   {
>   return pss >> PSS_DIV_BITS;
>   }

I think that's overkill for something that has exactly one use of each.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] cpuset update_cgroup_cpus_allowed

2007-10-15 Thread Paul Menage

Paul Jackson wrote:

Paul M wrote:

Here's an alternative for consideration, below.


I don't see the alternative -- I just see my patch, with the added
blurbage:

  #12 - /usr/local/google/home/menage/kernel9/linux/kernel/cpuset.c 
  # action=edit type=text

Should I be increasing my caffeine intake?



Bah. Trying again:


Here's an alternative for consideration, below. The main differences are:

- currently against an older kernel with pre-cgroup cpusets, so it uses 
tasklist_lock and do_each_thread(); a cgroup version would use cgroup iterators 
as yours does

- solves the race between sched_setaffinity() and update_cpumask() by having 
sched_setaffinity() check for changes to cpuset_cpus_allowed() after doing 
set_cpus_allowed()

- guarantees to only act on each process once (so guarantees forward progress, 
in the absence of fork bombs. (And could be adapted to handle fork bombs too)

- uses a priority heap to pick the processes to act on, based on start time

- uses lock_cpu_hotplug() to avoid races with CPU hotplug; sadly I think this 
is gone in more recent kernels, so some other synchronization would be needed



   Cause writes to cpuset "cpus" file to update cpus_allowed for
   member tasks:

   - collect batches of tasks under tasklist_lock and then call
 set_cpus_allowed() on them outside the lock (since this can
 sleep).

   - add a simple generic priority heap type to allow efficient
 collection of batches of tasks to be processed without
 duplicating or missing any tasks in subsequent batches.

   - avoid races with hotplug events via lock_cpu_hotplug()

   - make "cpus" file update a no-op if the mask hasn't changed

   - fix race between update_cpumask() and sched_setaffinity() by
 making sched_setaffinity() to post-check that it's not
 running on any cpus outside cpuset_cpus_allowed().



include/linux/prio_heap.h |   56 +
kernel/cpuset.c   |  103 --
kernel/sched.c|   13 +
lib/Makefile  |2
lib/prio_heap.c   |   68 ++
5 files changed, 238 insertions(+), 4 deletions(-)
--- /dev/null   1969-12-31 16:00:00.0 -0800
+++ linux/include/linux/prio_heap.h 2007-10-12 16:43:27.0 -0700
@@ -0,0 +1,56 @@
+#ifndef _LINUX_PRIO_HEAP_H
+#define _LINUX_PRIO_HEAP_H
+
+/*
+ * Simple insertion-only static-sized priority heap containing
+ * pointers, based on CLR, chapter 7
+ */
+
+#include 
+
+/**
+ * struct ptr_heap - simple static-sized priority heap
+ * @ptrs - pointer to data area
+ * @max - max number of elements that can be stored in @ptrs
+ * @size - current number of valid elements in @ptrs (in the range [EMAIL 
PROTECTED]
+ */
+struct ptr_heap {
+   void **ptrs;
+   int max;
+   int size;
+};
+
+/**
+ * heap_init - initialize an empty heap with a given memory size
+ * @heap: the heap structure to be initialized
+ * @size: amount of memory to use in bytes
+ * @gfp_mask: mask to pass to kmalloc()
+ */
+extern int heap_init(struct ptr_heap *heap, size_t size, gfp_t gfp_mask);
+
+/**
+ * heap_free - release a heap's storage
+ * @heap: the heap structure whose data should be released
+ */
+void heap_free(struct ptr_heap *heap);
+
+/**
+ * heap_insert - insert a value into the heap and return any overflowed value
+ * @heap: the heap to be operated on
+ * @p: the pointer to be inserted
+ * @gt: comparison operator, which should implement "greater than"
+ *
+ * Attempts to insert the given value into the priority heap. If the
+ * heap is full prior to the insertion, then the resulting heap will
+ * consist of the smallest @max elements of the original heap and the
+ * new element; the greatest element will be removed from the heap and
+ * returned. Note that the returned element will be the new element
+ * (i.e. no change to the heap) if the new element is greater than all
+ * elements currently in the heap.
+ */
+extern void *heap_insert(struct ptr_heap *heap, void *p,
+int (*gt)(void *, void *));
+
+
+
+#endif /* _LINUX_PRIO_HEAP_H */
 linux/kernel/cpuset.c
--- linux/kernel/cpuset.c   2007-10-05 17:46:09.0 -0700
+++ linux/kernel/cpuset.c   2007-10-12 16:24:49.0 -0700
@@ -37,6 +37,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
#include 
@@ -839,6 +840,36 @@
unlock_cpu_hotplug();
}

+static int inline started_after_time(struct task_struct *t1,
+struct timespec *time,
+struct task_struct *t2)
+{
+   int start_diff = timespec_compare(>start_time, time);
+   if (start_diff > 0) {
+   return 1;
+   } else if (start_diff < 0) {
+   return 0;
+   } else {
+   /*
+* Arbitrarily, if two processes started at the same
+* time, we'll 

Re: [RFC] cpuset update_cgroup_cpus_allowed

2007-10-15 Thread Paul Jackson
Paul M wrote:
> Here's an alternative for consideration, below.

I don't see the alternative -- I just see my patch, with the added
blurbage:

  #12 - /usr/local/google/home/menage/kernel9/linux/kernel/cpuset.c 
  # action=edit type=text

Should I be increasing my caffeine intake?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: LFENCE instruction (was: [rfc][patch 3/3] x86: optimise barriers)

2007-10-15 Thread H. Peter Anvin

Mikulas Patocka wrote:


I know about unordered stores (movnti & similar) --- they basically use 
write-combining method on memory that is normally write-back --- and they 
need sfence. But which one instruction does unordered load and needs 
lefence?




PREFETCHNTA.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] ecryptfs: clean up attribute mess

2007-10-15 Thread Greg KH
It isn't that hard to add simple kset attributes, so don't go through
all the gyrations of creating your own object type and show and store
functions.  Just use the functions that are already present.  This makes
things much simpler.

Note, the version_str string violates the "one value per file" rule for
sysfs.  I suggest changing this now (individual files per type supported
is one suggested way.)


Cc: Michael A. Halcrow <[EMAIL PROTECTED]>
Cc: Michael C. Thompson <[EMAIL PROTECTED]>
Cc: Tyler Hicks <[EMAIL PROTECTED]>
Signed-off-by: Greg Kroah-Hartman <[EMAIL PROTECTED]>

---
 fs/ecryptfs/main.c |   88 -
 1 file changed, 20 insertions(+), 68 deletions(-)

--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -689,58 +689,14 @@ static int ecryptfs_init_kmem_caches(voi
return 0;
 }
 
-struct ecryptfs_obj {
-   char *name;
-   struct list_head slot_list;
-   struct kobject kobj;
-};
-
-struct ecryptfs_attribute {
-   struct attribute attr;
-   ssize_t(*show) (struct ecryptfs_obj *, char *);
-   ssize_t(*store) (struct ecryptfs_obj *, const char *, size_t);
-};
-
-static ssize_t
-ecryptfs_attr_store(struct kobject *kobj,
-   struct attribute *attr, const char *buf, size_t len)
-{
-   struct ecryptfs_obj *obj = container_of(kobj, struct ecryptfs_obj,
-   kobj);
-   struct ecryptfs_attribute *attribute =
-   container_of(attr, struct ecryptfs_attribute, attr);
-
-   return (attribute->store ? attribute->store(obj, buf, len) : 0);
-}
+static decl_subsys(ecryptfs, NULL, NULL);
 
-static ssize_t
-ecryptfs_attr_show(struct kobject *kobj, struct attribute *attr, char *buf)
-{
-   struct ecryptfs_obj *obj = container_of(kobj, struct ecryptfs_obj,
-   kobj);
-   struct ecryptfs_attribute *attribute =
-   container_of(attr, struct ecryptfs_attribute, attr);
-
-   return (attribute->show ? attribute->show(obj, buf) : 0);
-}
-
-static struct sysfs_ops ecryptfs_sysfs_ops = {
-   .show = ecryptfs_attr_show,
-   .store = ecryptfs_attr_store
-};
-
-static struct kobj_type ecryptfs_ktype = {
-   .sysfs_ops = _sysfs_ops
-};
-
-static decl_subsys(ecryptfs, _ktype, NULL);
-
-static ssize_t version_show(struct ecryptfs_obj *obj, char *buff)
+static ssize_t version_show(struct kset *kset, char *buff)
 {
return snprintf(buff, PAGE_SIZE, "%d\n", ECRYPTFS_VERSIONING_MASK);
 }
 
-static struct ecryptfs_attribute sysfs_attr_version = __ATTR_RO(version);
+static struct subsys_attribute version_attr = __ATTR_RO(version);
 
 static struct ecryptfs_version_str_map_elem {
u32 flag;
@@ -753,7 +709,7 @@ static struct ecryptfs_version_str_map_e
{ECRYPTFS_VERSIONING_XATTR, "metadata in extended attribute"}
 };
 
-static ssize_t version_str_show(struct ecryptfs_obj *obj, char *buff)
+static ssize_t version_str_show(struct kset *kset, char *buff)
 {
int i;
int remaining = PAGE_SIZE;
@@ -780,34 +736,33 @@ out:
return total_written;
 }
 
-static struct ecryptfs_attribute sysfs_attr_version_str = 
__ATTR_RO(version_str);
+static struct subsys_attribute version_attr_str = __ATTR_RO(version_str);
+
+static struct attribute *attributes[] = {
+   _attr.attr,
+   _attr_str.attr,
+   NULL,
+};
+
+static struct attribute_group attr_group = {
+   .attrs = attributes,
+};
 
 static int do_sysfs_registration(void)
 {
int rc;
 
-   if ((rc = subsystem_register(_subsys))) {
-   printk(KERN_ERR
-  "Unable to register ecryptfs sysfs subsystem\n");
-   goto out;
-   }
-   rc = sysfs_create_file(_subsys.kobj,
-  _attr_version.attr);
+   rc = subsystem_register(_subsys);
if (rc) {
printk(KERN_ERR
-  "Unable to create ecryptfs version attribute\n");
-   subsystem_unregister(_subsys);
+  "Unable to register ecryptfs sysfs subsystem\n");
goto out;
}
-   rc = sysfs_create_file(_subsys.kobj,
-  _attr_version_str.attr);
+   rc = sysfs_create_group(_subsys.kobj, _group);
if (rc) {
printk(KERN_ERR
-  "Unable to create ecryptfs version_str attribute\n");
-   sysfs_remove_file(_subsys.kobj,
- _attr_version.attr);
+  "Unable to create ecryptfs version attributes\n");
subsystem_unregister(_subsys);
-   goto out;
}
 out:
return rc;
@@ -815,10 +770,7 @@ out:
 
 static void do_sysfs_unregistration(void)
 {
-   sysfs_remove_file(_subsys.kobj,
- _attr_version.attr);
-   sysfs_remove_file(_subsys.kobj,
- _attr_version_str.attr);
+   

Re: [git pull] scheduler updates for v2.6.24

2007-10-15 Thread Nick Piggin
On Tuesday 16 October 2007 00:17, Ingo Molnar wrote:
> Linus, please pull the latest scheduler git tree from:
>
>git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git
>
> It contains lots of scheduler updates from lots of people - hopefully
> the last big one for quite some time. Most of the focus was on
> performance (both micro-performance and scalability/balancing), but
> there's the fair-scheduling feature now Kconfig selectable too. Find the
> shortlog below.

Nice work...

However it's a pity all the balancing stuff got wildly changed
in 2.6.23 and then somewhat changed back again now.

Despite appearances, a lot of those things weren't actually
*completely* arbitrary values. I fear that it will make finding
performance regressions harder than it should have...

Anyway.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] git scsi misc include fix

2007-10-15 Thread Paul Jackson
James wrote:
> The requirement for struct scatterlist is the same
> before and after the gid scsi-misc patch. 

Not so.  The git-scsi-misc.patch in 2.6.23-mm1 clearly adds the line:

struct scatterlist sense_sgl;

as part of the added struct scsi_eh_save in scsi/scsi_eh.h.

This bit me while I was doing a bisection on 2.6.23-mm1, for another
problem, in git-sched, which is discussed in the lkml thread:

git-sched patch won't boot on SN arch, 2.6.23-mm1

This is using sn2_defconfig.  The full 2.6.23-mm1 patch set builds ok,
because another patch, git-block.patch as I recall, includes
scatterlist.h some other way, but for the following range of patches in
2.6.23-mm1, on the configuration sn2_defconfig, the build is broken,
due to 'struct scatterlist' being an incomplete type:

git-scsi-misc.patch
git-scsi-misc-include-fix.patch
git-scsi-misc-fixup.patch
qla2xxx-printk-fixes.patch
pci-error-recovery-symbios-scsi-base-support.patch
pci-error-recovery-symbios-scsi-first-failure.patch
nsp32_restart_autoscsi-remove-error-check.patch
scsi-send-media-state-change-modification-events.patch
scsi-early-detection-of-medium-not-present-updated.patch
mptbase-reset-ioc-initiator-during-pci-resume.patch
scsi-use-notifier-chain-for-asynchronous-event.patch
initio-fix-conflict-when-loading-driver.patch
git-block.patch

> it should also fail with vanilla 2.6.23

I don't know about the vanilla 2.6.23 case.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC 2/2] paravirt: clean up lazy mode handling

2007-10-15 Thread Rusty Russell
On Saturday 13 October 2007 06:40:36 Jeremy Fitzhardinge wrote:
> [ Changes since last post: fixed up lguest ]

This is really nice.  Thanks Jeremy!

This will conflict a little with my own churn (file movement), but no great 
drama if it goes in soon.

Acked-by: Rusty Russell <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 11/11] maps3: make page monitoring /proc file optional

2007-10-15 Thread Rusty Russell
On Tuesday 16 October 2007 08:51:17 Jeremy Fitzhardinge wrote:
> Dave Hansen wrote:
> > On Mon, 2007-10-15 at 17:26 -0500, Matt Mackall wrote:
> >> +config PROC_PAGE_MONITOR
> >> +   default y
> >> +   bool "Enable /proc page monitoring" if EMBEDDED && PROC_FS &&
> >> MMU +   help
> >> + Various /proc files exist to monitor process memory
> >> utilization: + /proc/pid/smaps, /proc/pid/clear_refs,
> >> /proc/pid/pagemap, + /proc/kpagecount, and /proc/kpageflags.
> >> Disabling these +  interfaces will reduce the size of the kernel
> >> by approximately 4kb.
> >
> > How about pulling the EMBEDDED off there?  I certainly want it for
> > non-embedded reasons. ;)
>
> That means it will only bother asking you if you've set EMBEDDED;
> otherwise its always on.

But it's at the least confusing.  Surely this option should depend on MMU and 
PROC_FS, and the prompt depend on EMBEDDED?

That might be implied by the Kconfig layout, but AFAICT this patch removed the 
explicit MMU dependency.

Rusty.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...

2007-10-15 Thread David Miller
From: David Brownell <[EMAIL PROTECTED]>
Date: Mon, 15 Oct 2007 16:39:10 -0700

> > Bad news, even with the rwsem after a lot more testing I can still
> > trigger the hang in ohci_hub_control() :-(
> >
> > I think we need to go back to considering the total serialization
> > approach to this problem.
> 
> We shouldn't need that.  What happens if you add an msleep(5)
> before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem?

What happens is the heisenbug will go away for another week.

> The theory there being that the switch triggered by setting CF
> doesn't take effect instantaneously, contrary to the effective
> assumption of that code.  A delay of 5 msec seems like it should
> be more than enough, but that's kind of a guess ... it's good to
> keep that low, since unfortunately that's in the critical path
> for OLPC "resume from idle".

I want to help with this, but if I even breath on the kernel the bug
goes away.  The race just gets harder to trigger, and if we just keep
adding things it'll make the problem go away but for the absolutely
wrong reasons.

The only way we will provably fix this is to make sure EHCI initialize
fully, first, regardless of kernel config or what userland does.

Also, David, you haven't done anything with the feedback I gave to the
most recent revision of the OHCI hub reset anti-wedge patch.  You
removed the debug logging when the outer-loop timeout expires, and I
asked that you put that back so that if it happens there is some
chance to know that this is what happened.  If it's not supposed to
happen, there is no harm in putting the debugging log message there
so that if the impossible does happen we find out about it.

I really don't think it's appropriate for that bug fix to sit yet
another week.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] git scsi misc include fix

2007-10-15 Thread Andrew Morton
On Mon, 15 Oct 2007 19:35:30 -0400
James Bottomley <[EMAIL PROTECTED]> wrote:

> On Sat, 2007-10-13 at 22:35 -0700, Paul Jackson wrote:
> > From: Paul Jackson <[EMAIL PROTECTED]>
> > 
> > The added line in scsi_eh.h:
> > struct scatterlist sense_sgl;
> > fails to compile, with the error:
> > field 'sense_sgl' has incomplete type
> > unless scatterlist.h happens to be included
> > somehow already ... which it isn't always.
> > 
> > So include scatterlist.h in scsi_eh.h directly.
> > 
> > Signed-off-by: Paul Jackson <[EMAIL PROTECTED]>
> > 
> > ---
> > 
> > This patch goes after the patch 'git-scsi-misc.patch'
> > 
> >  include/scsi/scsi_eh.h |1 +
> >  1 file changed, 1 insertion(+)
> > 
> > --- 2.6.23-mm1.orig/include/scsi/scsi_eh.h  2007-10-13 01:13:26.568876534 
> > -0700
> > +++ 2.6.23-mm1/include/scsi/scsi_eh.h   2007-10-13 01:31:32.911855338 
> > -0700
> > @@ -2,6 +2,7 @@
> >  #define _SCSI_SCSI_EH_H
> >  
> >  #include 
> > +#include 
> >  struct scsi_device;
> >  struct Scsi_Host;
> 
> 
> I've added linux-scsi which should be cc'd on all SCSI issues.
> 
> I don't quite believe this, though.  The requirement for struct
> scatterlist is the same before and after the gid scsi-misc patch.  If
> the compile fails with git-scsi-misc because of a missing scatterlist
> include, it should also fail with vanilla 2.6.23 without the git
> patch ... could you see if you can find out why it doesn't?
> 

git-scsi-misc adds this:

struct scsi_eh_save {
int result;
enum dma_data_direction data_direction;
unsigned char cmd_len;
unsigned char cmnd[MAX_COMMAND_SIZE];

void *buffer;
unsigned bufflen;
unsigned short use_sg;
int resid;

struct scatterlist sense_sgl;
};

which will not compile unless the includer has earlier included
scatterlist.h.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Killing a network connection

2007-10-15 Thread Bodo Eggert
Andi Kleen <[EMAIL PROTECTED]> wrote:
> Stefan Monnier <[EMAIL PROTECTED]> writes:

>> The main use for me is to deal with dangling connections due to taking
>> network interfaces up with different IP addresses (typically the wlan0
>> interface where the IP is different because I've modes from an AP to
>> another).  Of course, maybe there's another way to solve this particular
>> problem, in case I'd like to hear about it as well.
> 
> Long ago I did a 2.4 patch that solved exactly this problem. It introduced
> a new ifconfig flag "dynamic" and when a dynamic address went down
> all TCP connections originating from it were killed. It's still available
> in older SUSE releases. I might post a forward port later.

There is a /proc/sys/net/ipv4/ip_dynaddr sysctl in 2.6.21.
-- 
If at first you don't succeed, call it version 1.0 

Friß, Spammer: [EMAIL PROTECTED]
 [EMAIL PROTECTED] [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread Julian Calaby
[adding back CCs which were dropped because I'm stupid - sorry!]

On 10/16/07, Rob Landley <[EMAIL PROTECTED]> wrote:
> On Monday 15 October 2007 5:27:55 am Julian Calaby wrote:
> > On 10/15/07, Rob Landley <[EMAIL PROTECTED]> wrote:
> > > On Monday 15 October 2007 4:06:20 am Julian Calaby wrote:
> > > > On 10/15/07, Rob Landley <[EMAIL PROTECTED]> wrote:
> > > > > I note that the eth0 and eth1 names are dynamically assigned on a
> > > > > first come first serve basis (like scsi).  This never causes me a
> > > > > problem because the driver loading order is constant, and once you
> > > > > figure out that eth0 is gigabit and eth1 is the 80211g it _stays_
> > > > > that way across reboots, reliably. Yeah, it's a heuristic.  Hands up
> > > > > everybody relying on such a heuristic in the real world.
> > > >
> > > > Umm, not quite, from my experiences with pre-production wireless
> > > > drivers, (another story, another time) fancy stuff is being done in
> > > > udev to make sure that your gigabit card is always assigned to eth0.
> > >
> > > I remember building a 2.4 kernel, statically linking in all the drivers,
> > > and getting the ethernet devices showing up in a reliable order for
> > > years.  Where does the need for fancy stuff come in?
> >
> > I remember that too. In fact, I have had no issues with network card
> > enumeration order, outside my own inexperience and stupidity.
> >
> > However, this sort of thing is needed now because of the various types
> > of hotpluggable networking devices, e.g. USB 802.11 cards, USB
> > ethernet cards, PCMCIA, etc.
>
> I thought the strategy was just to scan the hotpluggable busses after the
> non-hotpluggable busses.

My (practical) experience is that I couldn't guarantee which card was
which. (I remember once where it changed over a kernel re-compile) So
my solution, before Debian's persistent naming scheme appeared, was to
check it after every new kernel and make sure my config matched up
with the names of the physical interfaces.

> > And yes, PCMCIA worked fine for ages, but
> > usually you'd never have more than one PCMCIA network card.
>
> Still don't, but presumably the slots are scanned in a reliable order so if
> the cards are always present on bootup in the same slots, they'd stay in that
> order.

Well, yes and no. My gut feeling is that it's probed like PCI cards
are. They're initialised when the drivers are loaded, and not before,
as such, there are no guarantees which card will be initialised first.
- and anyway, what happens if you plug them in in a different order?

> > Personally, I use 2 different usb network cards, and I'm quite
> > comforted to know that the 802.11a one is always wlan0, and the
> > 802.11b/g one is always wlan1.
>
> So if I have a USB 100baseT adapter, and I boot with it plugged in, it'll
> potentially come before my built-in wireless card due to ordering based on
> device type?

Ok, firstly the 100baseT adapter will be named something like ethX,
the wireless card will most likely be named something like wlanX.

Now let's say your laptop has a built in ethernet card.

So, we'll assume a modular kernel, with the module "usbnet" for the
usb card and "e100" for the onboard card:

If the "usbnet" module is loaded first, then initially, according to
the kernel, the usb card will be eth0 and the built in one eth1.

Now let's assume that, on the PCI bus, the USB controller is in a
lower slot number than the network card. (highly likely, given that
the network card is most likely external to the chipset of the laptop)
It's pretty likely that the USB controller will have it's module
loaded first, before the built in network card. At this point, it'll
send out hotplug events for all it's children (root hubs, etc.) and
eventually an event will be sent out for the usb network card. Now, at
this point, it's impossible to say which one will claim eth0 first.

Now, in my case, with my two wireless cards, what happens if I plug
the 802.11b/g one in first? If this fancy renaming didn't happen, it'd
end up with the name wlan0 and, hence, try to connect to the network
which the 802.11a one is supposed to connect to.

This is not a good thing.

I also have to make the point that this has been happening all over
the kernel, well before I started using it. Video4Linux and DVB
devices can be USB, and the order the /dev/videoX nodes appear in is
determined by the plugging order. IRDA cards, sound cards, usb
devices, framebuffers, mice, keyboards, loopback devices, etc. all
have the same "issue". (and annoyingly, they all have different ways
of getting around it, or not)

And to make one final point, getting right back to the initial parts
of the discussion, at the end of the day, your SATA disk, IDE disk,
USB disk and the CF card in your camera are all mass storage devices -
they all work in a fairly similar way. You want to mount filesystems
from all of them, and when you run low level tools, like parted or
whatever, you want them all to behave in 

Re: [patch 1/4] Linux Kernel Markers - Architecture Independent Code

2007-10-15 Thread Roland McGrath
> I think the main issue with the solution you propose is that it doesn't
> deal with markers in modules, am I right ?

My suggestion applies as well to modules as anything else.  
What "like Module.symvers" means is something like:

name1   vmlinux %s
name2   fs/nfs/nfs  %d

All the modules built by the same kernel build go into this one file.

Modules packaged separately for the same kernel could provide additional
files of the same kind.

> I will soon come with a marker iterator and a module that provides a
> userspace -and in kernel- interface to enable/disable markers. Actually,
> I already have the code ready in my LTTng snapshots. I can provide a
> link if you want to have a look.

That's clearly straightforward to do given the basic markers data structures.

It does not address the need for an offline list of markers available in a
particular kernel build or set of modules that you are not running right now.
The approach now available for that is grovelling through the markers data
structures extracted from vmlinux and .ko ELF files offline.  That is more
work than one should have to do, and has lots of problems with coping with
different packaging details, etc.


Thanks,
Roland
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24

2007-10-15 Thread Zachary Amsden
On Tue, 2007-10-16 at 00:03 +0200, Andi Kleen wrote:
> > Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid
> > upsetting Xen
> 
> This should be probably done unconditionally because it's a undefined
> dangerous condition everywhere.

Should be done unconditionally.  One could remap the underlying physical
space to include an MMIO region, and speculative reads from the
cacheable virtual mapping of that region could move the robot arm,
destroying the world.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/11] maps3: introduce task_size_of for all arches

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/include/asm-mips/processor.h
> ===
> --- l.orig/include/asm-mips/processor.h   2007-10-09 17:37:58.0 
> -0500
> +++ l/include/asm-mips/processor.h2007-10-10 11:46:30.0 -0500
> @@ -45,6 +45,8 @@ extern unsigned int vced_count, vcei_cou
>   * space during mmap's.
>   */
>  #define TASK_UNMAPPED_BASE   (PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)\
> + (test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #ifdef CONFIG_64BIT
> @@ -65,6 +67,8 @@ extern unsigned int vced_count, vcei_cou
>  #define TASK_UNMAPPED_BASE   \
>   (test_thread_flag(TIF_32BIT_ADDR) ? \
>   PAGE_ALIGN(TASK_SIZE32 / 3) : PAGE_ALIGN(TASK_SIZE / 3))
> +#define TASK_SIZE_OF(tsk)\
> + (test_thread_flag(TIF_32BIT_ADDR) ? TASK_SIZE32 : TASK_SIZE)
>  #endif
>  
>  #define NUM_FPU_REGS 32

These need to use test_tsk_thread_flag(tsk, TIF_32BIT_ADDR).

> Index: l/include/asm-parisc/processor.h
> ===
> --- l.orig/include/asm-parisc/processor.h 2007-10-09 17:36:49.0 
> -0500
> +++ l/include/asm-parisc/processor.h  2007-10-10 11:46:30.0 -0500
> @@ -32,7 +32,8 @@
>  #endif
>  #define current_text_addr() ({ void *pc; current_ia(pc); pc; })
>  
> -#define TASK_SIZE   (current->thread.task_size)
> +#define TASK_SIZE_OF(tsk)   ((tsk)->thread.task_size)
> +#define TASK_SIZE (current->thread.task_size)
>  #define TASK_UNMAPPED_BASE  (current->thread.map_base)
>  
>  #define DEFAULT_TASK_SIZE32  (0xFFF0UL)

TASK_SIZE_OF() should be defined in terms of TASK_SIZE, just like it is 
for ia64.

> Index: l/include/asm-powerpc/processor.h
> ===
> --- l.orig/include/asm-powerpc/processor.h2007-10-09 17:37:58.0 
> -0500
> +++ l/include/asm-powerpc/processor.h 2007-10-10 11:46:30.0 -0500
> @@ -99,7 +99,9 @@ extern struct task_struct *last_task_use
>   */
>  #define TASK_SIZE_USER32 (0x0001UL - (1*PAGE_SIZE))
>  
> -#define TASK_SIZE (test_thread_flag(TIF_32BIT) ? \
> +#define TASK_SIZE  (test_thread_flag(TIF_32BIT) ? \
> + TASK_SIZE_USER32 : TASK_SIZE_USER64)
> +#define TASK_SIZE_OF(tsk) (test_tsk_thread_flag(tsk, TIF_32BIT) ? \
>   TASK_SIZE_USER32 : TASK_SIZE_USER64)
>  

Same.

>  /* This decides where the kernel will search for a free chunk of vm
> Index: l/include/asm-s390/processor.h
> ===
> --- l.orig/include/asm-s390/processor.h   2007-10-09 17:37:58.0 
> -0500
> +++ l/include/asm-s390/processor.h2007-10-10 11:46:30.0 -0500
> @@ -75,6 +75,8 @@ extern struct task_struct *last_task_use
>  
>  # define TASK_SIZE   (test_thread_flag(TIF_31BIT) ? \
>   (0x8000UL) : (0x400UL))
> +# define TASK_SIZE_OF(tsk)   (test_tsk_thread_flag(tsk, TIF_31BIT) ? \
> + (0x8000UL) : (0x400UL))
>  # define TASK_UNMAPPED_BASE  (TASK_SIZE / 2)
>  # define DEFAULT_TASK_SIZE   (0x400UL)
>  

Same.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sched domain sysctl: free kstrdup allocations

2007-10-15 Thread Milton Miller
The procnames for the cpu and domain were allocated via kstrdup and so
should also be freed.   The names for the files are static, but we
can differentiate them by the presence of the proc_handler.  If a
kstrdup (of < 32 characters) fails the sysctl code will not see the
procname or remaining table entries, but any child tables and names
will be reclaimed upon free.

Signed-off-by: Milton Miller <[EMAIL PROTECTED]>
--- 

Hi Ingo.

It occurred to me this morning that the procname field was dynamically
allocated and needed to be freed.  I started to put in break statements
when allocation failed but it was approaching 50% error handling code.

I came up with this alternative of looping while entry->mode is set and
checking proc_handler instead of ->table.  Alternatively, the string
version of the domain name and cpu number could be stored the structs.

I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
counts after taking a cpuset exclusive and back.

milton

Index: kernel/kernel/sched.c
===
--- kernel.orig/kernel/sched.c  2007-10-15 12:21:38.0 -0500
+++ kernel/kernel/sched.c   2007-10-15 12:22:12.0 -0500
@@ -5290,11 +5290,20 @@ static struct ctl_table *sd_alloc_ctl_en
 
 static void sd_free_ctl_entry(struct ctl_table **tablep)
 {
-   struct ctl_table *entry = *tablep;
+   struct ctl_table *entry;
 
-   for (entry = *tablep; entry->procname; entry++)
+   /*
+* In the intermediate directories, both the child directory and
+* procname are dynamically allocated and could fail but the mode
+* will always be set.  In the lowest directory the names are
+* static strings and all have proc handlers.
+*/
+   for (entry = *tablep; entry->mode; entry++) {
if (entry->child)
sd_free_ctl_entry(>child);
+   if (entry->proc_handler == NULL)
+   kfree(entry->procname);
+   }
 
kfree(*tablep);
*tablep = NULL;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: What still uses the block layer?

2007-10-15 Thread Neil Brown
On Monday October 15, [EMAIL PROTECTED] wrote:
> > Therefore it is best to not have stable single-number naming schemes
> > for any devices on any machines.  Why?  Because it ensure there will
> > not be any second class citizens.
> 
> This is where we disagree.  The existence of devices you cannot stably 
> enumerate does not eliminate the existence of devices you trivially can.

No, but it dramatically reduces that value of being able to enumerate
those devices.

> 
> Pulling out the "IBM numa cluster with multiple SAS enclosures _and_ 
> firewire" 
> infrastructure to find the root partition on my hard drive may be good for 
> the IBM numa clusters, but only at the expense of complicating this part of 
> my laptop's infrastructure by an order of magnitude, and making embedded 
> systems nearly impossible to put together.  If "one size fits all" were true, 
> my cell phone would be running Red Hat Enterprise.
> 
> > If some devices that are even reasonably common (e.g. IDE drives) are
> > stable, then some application developers or system integrators will
> > work under the assumption of stability and whatever they build will
> > break when you try it on different hardware.
> 
> So you break the IDE drives to get laptop users to debug the Niagra set?  The 

Breaking old behaviour is always bad... My computers with IDE
interfaces still see stable "/dev/hda" devices.  Are you saying the
devices that used to be "hda" are now "sdb" ??  Maybe there is a
.config option...

> solution is to make the easy cases hard?

Is it really that hard?

> > Note that stable names a still a very real option.  udev provides
> > several.  /dev/disk-by-path/XXX will be stable for lots of "screwed
> > in" devices.  /dev/disk-by-id will be stable for devices the report a
> > unique id. etc.
> 
> Here it's
> 
>   ls /dev/disk/by-path/
>   pci-:00:1f.2-scsi-0:0:0:0pci-:00:1f.2-scsi-0:0:0:0-part4
>   pci-:00:1f.2-scsi-0:0:0:0-part1  pci-:00:1f.2-scsi-0:0:0:0-part5
>   pci-:00:1f.2-scsi-0:0:0:0-part2  pci-:00:1f.2-scsi-0:0:0:0-part6
>   pci-:00:1f.2-scsi-0:0:0:0-part3  pci-:00:1f.2-scsi-1:0:0:0
> 
> And this is an improvement?

Depends on your metric.

"Easy to type" - I guess /dev/hda1 wins hands down.
"Can be used in a script or config file and is guaranteed always to
work until a screwdriver is used to change that device or it's
controller"
  I think
  /dev/disk/by-path/pci-:00:1f.2-scsi-0:0:0:0-part1
is quite acceptable.
What is your metric?


> 
> > The different between IDE, SATA, SCSI and even USB is peripheral for
> > the large majority of uses, and I think maintaining the distinction in
> > the major/minor number or in the "primary" /dev name is - for the
> > above reasons - more of a cost that a value.
> 
> Is your definition of "the large majority of uses" where ncr Voyager, the 
> Amiga, and current macintosh laptops are all one use each, or is your 
> definition of "the large majority of uses" the one where each "use" is an 
> installation, of which there are millions of PCs (and even more ARM cell 
> phones), and something like three instances of Voyager?

My definition of "the large majority or uses" is "mkfs, fsck, mount,
fdisk, system-install-process".

Different people differentiate devices in different ways.  A system
integrator might know about the hardware path.  An end user might know
about drive brands or sizes.  A casual user might just think "internal
or external".  The kernel cannot support all these different
approaches to naming.  It really is best if it uses arbitrary names,
and provides access to descriptions that the user can choose between.
udev facilitates this with links in /dev/disk/.  A system install can
facilitate this even more by reporting size/manufacturer information etc.

> 
> I realize that both views are valid.  This is why the US has a house and a 
> senate, and filters things through both views.  My gripe is that forcing my 
> laptop to look at my USB devices to find my SATA hard drive is aligned with 
> only one of those viewpoints, and completely opposed to the other.

I'm guessing you are talking about mount-by-uuid? This effectively has
to look at the filesystem of all devices to discover which one has the
correct UUID, though it can cache the information for efficiency.

Maybe it is just an implementation issue.  Suppose that everytime a
device were discovered, it were examined to see what was stored on it,
and this information was stored in a cache.
Then to find a particular filesystem to mount, you just look in the
cache and if the info isn't there yet, just wait or fail as
appropriate. 
Then we don't "look at my USB devices to find my SATA hard drive" but
rather "look at each device as it is attached to find out what is in
it", which seems like a sensible thing to do...

> 
> An approach that makes things much easier on laptops is seen to hurt big 
> iron, 
> not because it the approach itself has a direct negative impact on big 

Lots of disk activity on resume from s2ram

2007-10-15 Thread Johan Brannlund
Hi. I've noticed that with recent kernels (starting somewhere between 
2.6.20 and 2.6.22) I sometimes get *lots* of disk activity on resume from 
suspend to ram. About 2/3 of the time, the system resumes normally but in 
the remaining 1/3 of the time, the hard drive light stays on almost solid 
and the machine is very, very slow to respond. The only way I've found to 
reliably recover from this is if I can get to a command prompt fast 
enough and do "shutdown -r now". There's not much cpu activity at all, it 
just seems to be disk io that's killing interactivity. This also happens 
if I resume just from an empty gnome desktop with no applications 
running, so I don't think it's due to swapping.

Some kernels affected: 2.6.22, 2.6.23+hrt patches, Ubuntu Gutsy kernel 
(2.6.22-14). These are all 64-bit kernels running on an HP nx6125 laptop 
- single-core Turion 64 processor, 1 gig of ram. To the best of my 
recollection, this problem did *not* appear with 2.6.20.

I put some dmesg output from the vm block dumping and some vmstat output 
at http://nullinfinity.org/tmp/s2ram/ . The dmesg logs are from the same 
resume, just a little while apart. The vmstat is from a different resume.

Any workarounds, patches, tips for further debugging etc would be 
appreciated.

- Johan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Linux-usb-users] OHCI root_port_reset() deadly loop...

2007-10-15 Thread David Brownell
> Bad news, even with the rwsem after a lot more testing I can still
> trigger the hang in ohci_hub_control() :-(
>
> I think we need to go back to considering the total serialization
> approach to this problem.

We shouldn't need that.  What happens if you add an msleep(5)
before ehci-hcd::ehci_run() drops ehci_cf_port_reset_rwsem?

The theory there being that the switch triggered by setting CF
doesn't take effect instantaneously, contrary to the effective
assumption of that code.  A delay of 5 msec seems like it should
be more than enough, but that's kind of a guess ... it's good to
keep that low, since unfortunately that's in the critical path
for OLPC "resume from idle".

- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/11] maps3: add proportional set size accounting in smaps

2007-10-15 Thread David Rientjes
On Mon, 15 Oct 2007, Matt Mackall wrote:

> Index: l/fs/proc/task_mmu.c
> ===
> --- l.orig/fs/proc/task_mmu.c 2007-10-14 13:35:31.0 -0500
> +++ l/fs/proc/task_mmu.c  2007-10-14 13:36:56.0 -0500
> @@ -122,6 +122,27 @@ struct mem_size_stats
>   unsigned long private_clean;
>   unsigned long private_dirty;
>   unsigned long referenced;
> +
> + /*
> +  * Proportional Set Size(PSS): my share of RSS.
> +  *
> +  * PSS of a process is the count of pages it has in memory, where each
> +  * page is divided by the number of processes sharing it.  So if a
> +  * process has 1000 pages all to itself, and 1000 shared with one other
> +  * process, its PSS will be 1500.   - Matt Mackall, lwn.net
> +  */
> + u64   pss;
> + /*
> +  * To keep (accumulated) division errors low, we adopt 64bit pss and
> +  * use some low bits for division errors. So (pss >> PSS_DIV_BITS)
> +  * would be the real byte count.
> +  *
> +  * A shift of 12 before division means(assuming 4K page size):
> +  *  - 1M 3-user-pages add up to 8KB errors;
> +  *  - supports mapcount up to 2^24, or 16M;
> +  *  - supports PSS up to 2^52 bytes, or 4PB.
> +  */
> +#define PSS_DIV_BITS 12
>  };
>  

I know this gets moved again in the eighth patch of the series, but the 
#define still has no place inside the struct definition.

The pss is going to need accessor functions, preferably inlined, and the 
comment adjusted stating that all accesses should be through those 
functions and not directly to the mem_size_stats struct.

static inline u64 pss_up(unsigned long pss)
{
return pss << PSS_DIV_BITS;
}

static inline unsigned long pss_down(u64 pss)
{
return pss >> PSS_DIV_BITS;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >