date:20070911

Re: Socket-related problem in x86_64 Kernel (2.6.16.53-0.8-smp)?

2007-09-11 Thread Ulrich Windl

On 11 Sep 2007 at 17:04, Al Viro wrote:

> On Tue, Sep 11, 2007 at 05:54:38PM +0200, Ulrich Windl wrote:
> 
> > If not, any clues on debugging/tracing? There's a 
> > /usr/src/linux/Documentation/oops-tracing.txt, but no "segfault-tracing".
> 
> That would be because it has fsck-all to do with the kernel.  Get the
> coredump, then use gdb to deal with it.

Ok, but why is the message there at all? I think in Windows/XP the offending 
code 
and the registers are shown in such occasions. I'd say either drop the message, 
or 
improve it. It's also difficult to find the code after the program is gone due 
to 
mapping of shared libraries. I managed to get a core dump of the application 
however, and I did modify some code. I'll report once I have results.

Maybe it's "mea culpa" for my program, but powersaved and slapd are still to be 
examined.

Regards,
Ulrich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [announce] CFS-devel, performance improvements

2007-09-11 Thread Mike Galbraith

On Tue, 2007-09-11 at 22:04 +0200, Ingo Molnar wrote:
> fresh back from the Kernel Summit, Peter Zijlstra and me are pleased to 
> announce the latest iteration of the CFS scheduler development tree. Our 
> main focus has been on simplifications and performance - and as part of 
> that we've also picked up some ideas from Roman Zippel's 'Really Fair 
> Scheduler' patch as well and integrated them into CFS. We'd like to ask 
> people go give these patches a good workout, especially with an eye on 
> any interactivity regressions.

Initial test-drive looks good here, but I do see a regression.  First
the good news.

fairtest2 is perfect, more perfect than ever seen before in fact.  Mixed
interval sleepers/hog looks fine as well (can't say perfect due to
startup differences with the various proggies, but cpu% looks perfect).
Amarok song switch time under hefty kbuild load is fine as well.  I
haven't done heavy multimedia testing yet, but will give it a more
thorough workout later (errands).

The regression:  I see some GUI lurch, easily reproducible by running a
make -j5 and moving the mouse in a circle... perceptible (100ms or so)
lurches not present in rc5. 

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUGFIX] x86_64: NX bit handling in change_page_attr

2007-09-11 Thread Huang, Ying

On Tue, 2007-09-11 at 20:23 -0700, Andrew Morton wrote:
> On Fri, 17 Aug 2007 13:28:38 +0800 "Huang, Ying" <[EMAIL PROTECTED]> wrote:
> 
> > This patch fixes a bug of change_page_attr/change_page_attr_addr on
> > Intel x86_64 CPU. After changing page attribute to be executable with
> > these functions, the page remains un-executable on Intel x86_64
> > CPU. Because on Intel x86_64 CPU, only if the "NX" bits of all four
> > level page tables are cleared, the corresponding page is executable
> > (refer to section 4.13.2 of Intel 64 and IA-32 Architectures Software
> > Developer's Manual). So, the bug is fixed through clearing the "NX"
> > bit of PMD when splitting the huge PMD.
> > 
> > Signed-off-by: Huang Ying <[EMAIL PROTECTED]>
> > 
> > ---
> > 
> > Index: linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c
> > ===
> > --- linux-2.6.23-rc2-mm2.orig/arch/x86_64/mm/pageattr.c 2007-08-17 
> > 12:50:25.0 +0800
> > +++ linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c  2007-08-17 
> > 12:50:48.0 +0800
> > @@ -147,6 +147,7 @@
> > split = split_large_page(address, prot, ref_prot2);
> > if (!split)
> > return -ENOMEM;
> > +   pgprot_val(ref_prot2) &= ~_PAGE_NX;
> > set_pte(kpte, mk_pte(split, ref_prot2));
> > kpte_page = split;
> > }
> 
> What happened with this?  Still valid?

I am waiting for reviewing or merging. And I think it is still valid.

Best Regards,
Huang Ying
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Problem charging blackberry 8700c with berry_charge (2.6.22.6)

2007-09-11 Thread Matt LaPlante

On Mon, 10 Sep 2007 23:35:02 -0700
Greg KH <[EMAIL PROTECTED]> wrote:

> 
> > Sep  9 13:49:01 prizm kernel: [  584.407498] drivers/usb/core/inode.c: 
> > creating file '003'
> > Sep  9 13:49:01 prizm kernel: [  584.407509] hub 5-0:1.0: state 7 ports 8 
> > chg  evt 0004
> > Sep  9 13:49:01 prizm kernel: [  584.407520] hub 1-0:1.0: state 7 ports 2 
> > chg  evt 0004
> > Sep  9 13:49:03 prizm kernel: [  586.405512] usb 1-2: usb auto-suspend
> > Sep  9 13:49:03 prizm kernel: [  586.421471] hub 5-0:1.0: hub_suspend
> > Sep  9 13:49:03 prizm kernel: [  586.421481] ehci_hcd :00:10.4: suspend 
> > root hub
> > Sep  9 13:49:03 prizm kernel: [  586.421496] usb usb5: usb auto-suspend
> > Sep  9 13:49:05 prizm kernel: [  588.421351] hub 1-0:1.0: hub_suspend
> > Sep  9 13:49:05 prizm kernel: [  588.421361] usb usb1: suspend_rh
> > Sep  9 13:49:05 prizm kernel: [  588.421481] usb usb1: usb auto-suspend
> 
> Ah, oh wait, now we just turned the power off.
> 
> Try disabling CONFIG_USB_SUSPEND and see if that fixes this issue.  Or
> you can manually turn the power back on to your blackberry by writing to
> the autosuspend file for the usb device in sysfs, but that can be a
> pain.
> 
> Let me know if just changing that config option works for you.
> 

And now for the dramatic conclusion...

To begin, I have no access to the original machine at the moment, as I'm now 
out of that area for a couple weeks.  I built a similar kernel (same version) 
on another box that I have at my current location.  The new machine is 
different hardware, so some kernel re-configuring was required, but I kept with 
the same USB settings (and similar overall design).  Interestingly, this 
machine didn't reproduce the "magic command failed" error, but it did fail very 
similarly to the original at charging the device.  I disabled 
CONFIG_USB_SUSPEND as suggested, and lo and behold, it now charges the berry.  
Looks like an excellent diagnosis to me, doctor.  

Thanks! :)

> thanks,
> 
> greg k-h

-- 
Matt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] doc: about email clients for Linux kernel patches

2007-09-11 Thread Adrian Bunk

On Wed, Sep 12, 2007 at 01:24:13PM +0800, WANG Cong wrote:
> On Tue, Sep 11, 2007 at 08:29:26PM +0200, Adrian Bunk wrote:
> >On Tue, Sep 11, 2007 at 10:16:44AM -0700, Randy Dunlap wrote:
> >>...
> >> +~~
> >> +Mutt (TUI)
> >> +
> >> +Plenty of Linux developers use mutt, so it must work pretty well.
> >> +
> >> +Are there any special config options that are needed??
> >>...
> >
> >It should work with default settings.
> 
> 
> I can't agree with this.
> 
> It took me lots of time to configure mutt to work well for me in the first
> time. Just default settings are far _not_ enough, especially for us
> non-english-speakers. One common setting is the encoding, of course, lkml
> prefers UTF-8, so I must set my mutt with `set send_charset="us-ascii:utf-8"`.

This makes sense, but it's not really a mutt specific issue and 
problems because mutt prefers iso-8859-1 over UTF-8 by default are
quite rare.

> Manuals of mutt told me to add "subscribe linux-kernel@vger.kernel.org" if I
> subscribed lkml, but in fact, we'd better _not_ add this, or it will drop
> myself from cc list.
> 
> Or other things like these.
>...

Whether or not people want to get personal copies of answers to mailing
list posts is a religious issue being second only to the vi<->emacs wars...

But as far as I understand it, this documentation is intended to help 
people to get sending patches right (no line wrap etc.), not as a 
generic documentation for mail clients.

> Regards.

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [-mm patch] remove ide_get_error_location()

2007-09-11 Thread Jens Axboe

On Tue, Sep 11 2007, Bartlomiej Zolnierkiewicz wrote:
> On Sunday 09 September 2007, Adrian Bunk wrote:
> > On Fri, Aug 31, 2007 at 09:58:22PM -0700, Andrew Morton wrote:
> > >...
> > > Changes since 2.6.23-rc3-mm1:
> > >...
> > >  git-block.patch
> > >...
> > >  git trees
> > >...
> > 
> > ide_get_error_location() is no longer used.
> > 
> > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 
> Signed-off-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
> 
> Since git-block contains the patch which removes the only user of
> ide_get_error_location() I think that this patch should be also merged
> through block tree.  Jens?

Yeah, I'll add it there.

> PS none of the blkdev_issue_flush() users uses *error_sector argument
> so it can be probably removed as well

I had hoped that the existance was enough incentive, but it didn't
happen. I'll make a note to kill that again.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Do not deprecate binary semaphore or do allow mutex in software interrupt contexts

2007-09-11 Thread Matti Linnanvuori

The following code seems to me to be a valid example of a binary semaphore 
(mutex) in a timer:

//timer called 10 times a second 
static void status_timer(unsigned long device)
{
struct etp_device_private *dp = (struct etp_device_private *)device;
if (unlikely(dp->status_interface == 0))
dp->status_interface = INTERFACES_PER_DEVICE - 1;
else
dp->status_interface--;
//DBG_PRINT ("%s: In status timer, interface:0x%x.\n",etp_NAME, 
dp->status_interface);
idt_los_interrupt_1(dp, dp->status_interface);
if (likely(!dp->reset))
// reset the timer:
mod_timer(&dp->status_timer, jiffies + HZ / 10);
}

static inline void read_idt_register_interrupt(struct etp_device_private *dp,
   unsigned reg)
{
DBG_PRINT("read_idt_register_interrupt to mutex_lock.\n");
if (unlikely(down_trylock(&dp->semaphore)))
return;/* Do not read because failed to lock. */
if (likely
(!dp->status
 && !(inl((void *)(dp->ioaddr + REG_E1_CTRL)) & E1_ACCESS_ON))) {
outl(((reg << E1_REGISTER_SHIFT) & E1_REGISTER_MASK)
 | E1_DIR_READ | E1_ACCESS_ON,
 (void *)(dp->ioaddr + REG_E1_CTRL));
dp->status = 1;
DBG_PRINT("read_idt_register_interrupt set status read.\n");
} else
DBG_PRINT
("read_idt_register_interrupt did not set status %u read.\n",
 dp->status);
DBG_PRINT
("read_idt_register_interrupt do not wait for result here, read in 
tasklet.\n");
}


//for getting los information with interrupt:

void idt_los_interrupt_1(struct etp_device_private *dp, unsigned interface)

{

read_idt_register_if_interrupt(dp, E1_TRNCVR_LINE_STATUS0_REG,

   interface);

}


static void e1_access_task(unsigned long data)//called after 
e1_access_interrupt
{
struct etp_device_private *dp = (struct etp_device_private *)data;
struct etp_interface_private *ip;
unsigned int interface, error;
bool los;

//check if los status was read:
if (unlikely(!dp->status)) {
DBG_PRINT("e1_access_task wakes up user.\n");
wake_up(&dp->e1_access_q);
return;
}
error =
idt_los_interrupt_2(dp->ioaddr, &interface, &los,
dp->pci_dev->device);
//DBG_PRINT ("%s: In e1 task, error:0x%x, interface:0x%x, los:0x%x.\n",
// etp_NAME, error, interface, los);
dp->status = 0;
up(&dp->semaphore);
DBG_PRINT("e1_access_task got error %u.\n", error);
if (unlikely(error))
return;
//update los status:
ip = &dp->interface_privates[interface];
ip->los = los;
//update status:
if ((ip->if_mode == IF_MODE_CLOSED) ||//interface closed or
(ip->los)) {//link down 
set_led(LED_CTRL_OFF, ip);
if (netif_carrier_ok(ip->ch_priv.this_netdev))
netif_carrier_off(ip->ch_priv.this_netdev);
} else {//link up and interface opened
if (!netif_carrier_ok(ip->ch_priv.this_netdev))
netif_carrier_on(ip->ch_priv.this_netdev);
if (ip->if_mode == IF_MODE_HDLC) {
set_led(LED_CTRL_TRAFFIC, ip);
} else {//ip->if_mode == IF_MODE_TIMESLOT
set_led(LED_CTRL_ON, ip);
}
}
}

int idt_los_interrupt_2(u8 * ioaddr, unsigned *interface, bool * los,
unsigned pci_device_id)
{//returns 0 in success
unsigned int value = inl((void *)(ioaddr + REG_E1_CTRL));
//if access not ended:
if (value & E1_ACCESS_ON) {
return 1;
}
//if access not to los status register 
if ((value & E1_REGISTER_MASK_NO_IF) !=
(E1_TRNCVR_LINE_STATUS0_REG << E1_REGISTER_SHIFT)) {
return 1;
}
//get interface
*interface =
idt_if_to_if((value & E1_REGISTER_MASK_IF) >>
 E1_REGISTER_SHIFT_IF, pci_device_id);
*los = value & 0x1;
return 0;
}

int write_idt_register_lock(unsigned device, unsigned reg, u32 value)
{
struct etp_device_private *etp = get_dev_priv(device);
unsigned ctrl;
DBG_PRINT("write_idt_register_lock to mutex lock device %u.\n", device);
down(&etp->semaphore);
if (unlikely(etp->reset)) {
up(&etp->semaphore);
DBG_PRINT
("write_idt_register_lock device %u unusable.\n", device);
return -ENXIO;
}
DBG_PRINT("write_idt_register_lock mutex locked device %u.\n", device);
do {
DBG_PRINT
("write_idt_register_lock to wait_event_timeout device %u.\n",
 device);
wait_event_timeout(etp->e1_access_q,
   !((ctrl =
  inl((void *)(etp->ioaddr + REG_E1_CTRL)))
 & E1_ACCESS_ON), HZ / 500);
}
while (ctrl & E1_ACCESS_ON);
DBG_PRINT("write_idt_register_lock to outl device %u.\n", device);
outl(((reg << E1_REGISTER_SHIFT) & E1_REGISTER_MASK) |
 E1_DIR_WRITE | E1_ACCESS_ON | (value & E1_DATA_MASK

Re: SYSFS: need a noncaching read

2007-09-11 Thread Robert Schwebel

On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote:
> I have developed a device driver and use the sysFS to export some
> registers to userspace.

Uuuh, uggly. Don't do that. Device drivers are there to abstract things,
not to play around with registers from userspace.

> I opened the sysFS File for one register and did some reads from this
> File, but I alwas becoming the same value from the register, whats not
> OK, because they are changing. So I found out that the sysFS caches
> the reads ... :-(

Yes, it does. What you can do is close()ing the file handle between
accesses, which makes it work but is slow.

> Is there a way to retrigger the reads (in that way, that the sysFS
> rereads the values from the driver), without closing and opening the
> sysFS Files? Or must I better use the ioctl () Driver-interface for
> exporting these registers?

What kind of problem do you want to solve? Userspace is for
applications, and applications usually don't have to know about hardware
details like registers. So if you have to do something every 10 ms from
userspace, your design is probably wrong.

If you absolutely need to do such things from userspace, have a look at
uio. But in most cases the answer is: make a proper abstraction for the
problem you wanna solve and write a proper driver.

Robert
-- 
Pengutronix - Linux Solutions for Science and Industry
Entwicklungszentrum Nord http://www.pengutronix.de
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] doc: about email clients for Linux kernel patches

2007-09-11 Thread WANG Cong

On Tue, Sep 11, 2007 at 08:29:26PM +0200, Adrian Bunk wrote:
>On Tue, Sep 11, 2007 at 10:16:44AM -0700, Randy Dunlap wrote:
>>...
>> +~~
>> +Mutt (TUI)
>> +
>> +Plenty of Linux developers use mutt, so it must work pretty well.
>> +
>> +Are there any special config options that are needed??
>>...
>
>It should work with default settings.

I can't agree with this.

It took me lots of time to configure mutt to work well for me in the first
time. Just default settings are far _not_ enough, especially for us
non-english-speakers. One common setting is the encoding, of course, lkml
prefers UTF-8, so I must set my mutt with `set send_charset="us-ascii:utf-8"`.

Manuals of mutt told me to add "subscribe linux-kernel@vger.kernel.org" if I
subscribed lkml, but in fact, we'd better _not_ add this, or it will drop
myself from cc list.

Or other things like these.

>
>mutt doesn't come with an editor, so whatever editor you use should be 
>used in a way that there are no automatic linebreaks. Most editors have 
>an "insert file" option that inserts the contents of a file unaltered.
>

Yes, you can `set editor="vi"` or other editors you prefer.

Regards.

-- 
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Calling PnP bios routines like get device node from x86_84 arch

2007-09-11 Thread mkrameshid


Hi Alan,

To be specific I want to call the function 0x60 ,0x61 and few more 
specified in the BBS specification (attached). These functions or alive 
in 16 bit mode (0xf000 segment)
We can call this functions in i386 using the pnpbios driver 
(bioscalls.c). I must call this functions to change the boot order and 
reboot from linux x86_64 using my driver.

I have seen in some forum that we can off the ACPI during the linux boot.

Can you help on this?

-mkr
Alan Cox wrote:
Actually I want to call the BIOS run time functions as per the 
PNPBIOSSpecification-v1.0a (attached).



We use ACPI for x86_64, which means you need to use the ACPI methods not
the PnPBIOS ones. PnPBIOS isn't valid when ACPI is in use.

Alan

  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Query on Real Time Signalling in Linux 2.6.16 kernel

2007-09-11 Thread sreenath Angadi

Hi,
 I would like to use the Real time signalling mechanism. When I try to
use "F_SETAUXFL", compilation fails. While looking through the
archives I found a patch  one-sig-perfd-2.4.4.patch.gz at
http://www.uwsg.iu.edu/hypermail/linux/kernel/0105.2/0642.html which
needs to be applied for the support.
I am using Linux 2.6.16 kernel. So I can't apply this patch directly.
Is there a different patch for this kernel or is the support in built.
If the support is already present, then any pointers on usage(RT
Signalling) would be of great help.

Thanks and Regards,

Sreenath.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] powerpc: add new required termio functions

2007-09-11 Thread Geoff Levand

Michael Neuling wrote:
> The "tty: termios locking functions break with new termios type" patch
> (f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile.
> This adds the required API to asm-powerpc.
> 
> Signed-off-by: Michael Neuling <[EMAIL PROTECTED]>
> --
> This needs to go up for 2.6.23.
> 
> Should we really put these definitions in asm-generic/termios.h as I'm
> guessing other architectures are broken too?

I think it would be better to do so, as that is where we pickup the defs for
the original kernel_termios_to_user_termios and user_termios_to_kernel_termios.

-Geoff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] doc: about email clients for Linux kernel patches

2007-09-11 Thread WANG Cong

On Tue, Sep 11, 2007 at 08:52:14PM +0200, Peter Zijlstra wrote:
>
>On Tue, 2007-09-11 at 14:38 -0400, Lee Revell wrote:
>> On 9/11/07, Peter Zijlstra <[EMAIL PROTECTED]> wrote:
>> > On Tue, 2007-09-11 at 10:16 -0700, Randy Dunlap wrote:
>> >
>> > > +~~
>> > > +Evolutions (GUI)
>> >
>> > I take it you mean: Evolution
>> >
>> > > +Some people seem to use this successfully for patches.
>> > > +
>> > > +What config options are needed?
>> >
>> > When composing mail select: Preformat
>> >   from Format->Heading->Preformatted (Ctrl-7)
>> >   or the toolbar
>> >
>> > Then use:
>> >   Insert->Text File... (Alt-n x)
>> >
>> > to insert the patch.
>> 
>> You can also diff -Nru old.c new.c | xclip, select Preformat, then
>> paste with the middle button.
>
>Ah, I shall try:
>
>  cat `quilt top` | xclip
>
>next time I have a single patch to send.
>

Oh, great! Thank you for this hint.

-- 
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git pull] Input updates for 2.6.23-rc6

2007-09-11 Thread Dmitry Torokhov

Hi Linus,

Please consider pulling from:

git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input.git for-linus
or
master.kernel.org:/pub/scm/linux/kernel/git/dtor/input.git for-linus

to receive updates for input subsystem.

Changelog:
--
Elvis Pranskevichus (1):
  Input: i8042 - add HP Pavilion DV4270ca to the MUX blacklist

Ralf Baechle (1):
  Input: i8042 - fix modpost warning

Samuel Thibault (1):
  Input: add more Braille keycodes

Vladimir Shebordaev (1):
  Input: usbtouchscreen - correctly set 'phys'

Diffstat:
-
 drivers/input/serio/i8042-x86ia64io.h  |   10 ++
 drivers/input/serio/i8042.c|2 +-
 drivers/input/touchscreen/usbtouchscreen.c |2 +-
 include/linux/input.h  |2 ++
 include/linux/keyboard.h   |4 +++-
 5 files changed, 17 insertions(+), 3 deletions(-)

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: r8169: can't send magic packet for Wake-On-Lan

2007-09-11 Thread Xavier Bestel

Le mardi 11 septembre 2007 à 23:30 +0200, Francois Romieu a écrit :
> Xavier Bestel <[EMAIL PROTECTED]> :
> [...]
> > with the r8169 I can't send a magic packet anymore. I'm using ethtool
> > for that, with the previous one (an rtl8139b) it was working very well.
> > ethtool -D apparently says it could send the packet ok.
> 
> I see no "-D" option in the sources from the git repository of ethtool.
> 
> Where did you find it ?

Err sorry, I mixed up everything ... I'm using *etherwake* to make the
WOL magic packet, and ethtool to check the interface options.

Xav


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] bfin_twi: Remove useless twi_lock mutex

2007-09-11 Thread Bryan Wu


This patch removes this unneeded mutex. Indeed it was used
to serialized access to the hardware, but this is already
done by the i2c-core layer, see 'bus_lock' mutex used by
i2c_transfer().

Signed-off-by: Francis Moreau <[EMAIL PROTECTED]>
Acked-by: Bryan Wu <[EMAIL PROTECTED]>
Acked-by: Sonic Zhang <[EMAIL PROTECTED]>
---
 drivers/i2c/busses/i2c-bfin-twi.c |   16 
 1 files changed, 0 insertions(+), 16 deletions(-)

diff --git a/drivers/i2c/busses/i2c-bfin-twi.c 
b/drivers/i2c/busses/i2c-bfin-twi.c
index 6311039..67224a4 100644
--- a/drivers/i2c/busses/i2c-bfin-twi.c
+++ b/drivers/i2c/busses/i2c-bfin-twi.c
@@ -44,7 +44,6 @@
 #define TWI_I2C_MODE_COMBINED  0x04
 
 struct bfin_twi_iface {
-   struct mutextwi_lock;
int irq;
spinlock_t  lock;
charread_write;
@@ -228,12 +227,8 @@ static int bfin_twi_master_xfer(struct i2c_adapter *adap,
if (!(bfin_read_TWI_CONTROL() & TWI_ENA))
return -ENXIO;
 
-   mutex_lock(&iface->twi_lock);
-
while (bfin_read_TWI_MASTER_STAT() & BUSBUSY) {
-   mutex_unlock(&iface->twi_lock);
yield();
-   mutex_lock(&iface->twi_lock);
}
 
ret = 0;
@@ -310,9 +305,6 @@ static int bfin_twi_master_xfer(struct i2c_adapter *adap,
break;
}
 
-   /* Release mutex */
-   mutex_unlock(&iface->twi_lock);
-
return ret;
 }
 
@@ -330,12 +322,8 @@ int bfin_twi_smbus_xfer(struct i2c_adapter *adap, u16 addr,
if (!(bfin_read_TWI_CONTROL() & TWI_ENA))
return -ENXIO;
 
-   mutex_lock(&iface->twi_lock);
-
while (bfin_read_TWI_MASTER_STAT() & BUSBUSY) {
-   mutex_unlock(&iface->twi_lock);
yield();
-   mutex_lock(&iface->twi_lock);
}
 
iface->writeNum = 0;
@@ -502,9 +490,6 @@ int bfin_twi_smbus_xfer(struct i2c_adapter *adap, u16 addr,
 
rc = (iface->result >= 0) ? 0 : -1;
 
-   /* Release mutex */
-   mutex_unlock(&iface->twi_lock);
-
return rc;
 }
 
@@ -555,7 +540,6 @@ static int i2c_bfin_twi_probe(struct platform_device *dev)
struct i2c_adapter *p_adap;
int rc;
 
-   mutex_init(&(iface->twi_lock));
spin_lock_init(&(iface->lock));
init_completion(&(iface->complete));
iface->irq = IRQ_TWI;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] add-a-rounddown_pow_of_two-routine-to-log2h.patch fix

2007-09-11 Thread Andrew Morton

On Sat, 1 Sep 2007 07:55:36 +0200 Mariusz Kozlowski <[EMAIL PROTECTED]> wrote:

> Hello,
> 
>   This patch fixes the unbalanced parenthesis inroduced by
> add-a-rounddown_pow_of_two-routine-to-log2h.patch.
> 
> Signed-off-by: Mariusz Kozlowski <[EMAIL PROTECTED]>
> 
>  include/linux/log2.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-2.6.23-rc4-mm1-a/include/linux/log2.h   2007-09-01 
> 07:23:28.0 +0200
> +++ linux-2.6.23-rc4-mm1-b/include/linux/log2.h   2007-09-01 
> 07:29:27.0 +0200
> @@ -186,7 +186,7 @@ unsigned long __rounddown_pow_of_two(uns
>  (\
>   __builtin_constant_p(n) ? ( \
>   (n == 1) ? 0 :  \
> - (1UL << ilog2(n)) : \
> + (1UL << ilog2(n))) :\
>   __rounddown_pow_of_two(n)   \
>   )

umm, could we get some users of this thing, preferably in some code
path which people use?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2] doc: about email clients for Linux patches

2007-09-11 Thread Randy Dunlap

From: Randy Dunlap <[EMAIL PROTECTED]>

Requested by Jeff Garzik.
v2, updated from lkml comments.

Add info about various email clients and their applicability
in being used to send Linux kernel patches.

Some notes takes from http://mbligh.org/linuxdocs/Email/Clients
Portions used with permission.

Signed-off-by: Randy Dunlap <[EMAIL PROTECTED]>
---
 Documentation/email-clients.txt |  210 
 1 file changed, 210 insertions(+)

--- /dev/null
+++ linux-2.6.23-rc5-git1/Documentation/email-clients.txt
@@ -0,0 +1,210 @@
+Email clients info for Linux
+==
+
+General Preferences
+--
+Patches for the Linux kernel are submitted via email, preferably as
+inline text in the body of the email.  Some maintainers accept
+attachments, but then the attachments should have content-type
+"text/plain".  However, attachments are generally frowned upon because
+it makes quoting portions of the patch more difficult in the patch
+review process.
+
+Email clients that are used for Linux kernel patches should send the
+patch text untouched.  For example, they should not modify or delete tabs
+or spaces, even at the beginning or end of lines.
+
+Don't send patches with "format=flowed".  This can cause unexpected
+and unwanted line breaks.
+
+Don't let your email client do automatic word wrapping for you.
+This can also corrupt your patch.
+
+
+They also should not modify the character set encoding of the text.
+
+Email clients should generate and maintain References: or In-Reply-To:
+headers so that mail threading is not broken.
+
+Copy-and-paste (or cut-and-paste) usually does not work for patches
+because tabs are converted to spaces.  I have seen comments that
+xclipboard, xclip, and/or xcutsel do work, but I cannot confirm this.
+
+Don't use PGP/GPG signatures in mail that contains patches.
+This breaks many scripts that read and apply the patches.
+(This should be fixable. ??)
+
+It's a good idea to send a patch to yourself, save the received message,
+and successfully apply it with 'patch' before sending patches to Linux
+mailing lists.
+
+
+Some email client (MUA) hints
+--
+Legend:
+TUI = text-based user interface
+GUI = graphical user interface
+
+~~
+Alpine (TUI)
+
+Config options:
+In the "Sending Preferences" section:
+
+- "Do Not Send Flowed Text" must be enabled
+- "Strip Whitespace Before Sending" must be disabled
+
+When composing the message, the cursor should be placed where the patch
+should appear, and then pressing CTRL-R let you specify the patch file
+to insert into the message.
+
+~~
+Evolution (GUI)
+
+Some people use this successfully for patches.
+
+When composing mail select: Preformat
+  from Format->Heading->Preformatted (Ctrl-7)
+  or the toolbar
+
+Then use:
+  Insert->Text File... (Alt-n x)
+to insert the patch.
+
+You can also "diff -Nru old.c new.c | xclip", select Preformat, then
+paste with the middle button.
+
+~~
+Kmail (GUI)
+
+Some people use Kmail successfully for patches.
+
+The default setting of not composing in HTML is appropriate; do not
+enable it.
+
+When composing an email, under options, uncheck "word wrap". The only
+disadvantage is any text you type in the email will not be word-wrapped
+so you will have to manually word wrap text before the patch. The easiest
+way around this is to compose your email with word wrap enabled, then save
+it as a draft. Once you pull it up again from your drafts it is now hard
+word-wrapped and you can uncheck "word wrap" without losing the existing
+wrapping.
+
+At the bottom of your email, put the commonly-used patch delimiter before
+inserting your patch:  three hyphens (---).
+
+Then from the "Message" menu item, select insert file and choose your patch.
+As an added bonus I recommend customising the message creation toolbar menu
+and putting the "insert file" icon there.
+
+You can safely GPG sign attachments, but inlined text is preferred for
+patches so do not GPG sign them.  Signing patches that have been inserted
+as inlined text will make them tricky to extract from their 7-bit encoding.
+
+If you absolutely must send patches as attachments instead of inlining
+them as text, right click on the attachment and select properties, and
+highlight "Suggest automatic display" to make the attachment inlined to
+make it more viewable.
+
+When saving patches that are sent as inlined text, select the email that
+contains the patch from the message list pane, right click and select
+"save as".  You can use the whole email unmodified as a patch if it was
+properly composed.  There is no option currently to save the email when
+you are actually viewing it in its own window -

Re: [PATCH/RFC] doc: about email clients for Linux kernel patches

2007-09-11 Thread Randy Dunlap

On Tue, 11 Sep 2007 19:36:42 +0200 Peter Zijlstra wrote:

> On Tue, 2007-09-11 at 10:16 -0700, Randy Dunlap wrote:
> 
> > +~~
> > +Evolutions (GUI)
> 
> I take it you mean: Evolution

Yep, lousy keyboard.  ;)

I've updated the text file and will resend it shortly.

Thanks for everyone's comments.
(not replying to each one indiviually)

---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Configurable tap interface MTU

2007-09-11 Thread Herbert Xu

Ed Swierk <[EMAIL PROTECTED]> wrote:
> 
> The patch caps the MTU somewhat arbitrarily at 16000 bytes. This is
> slightly lower than the value used by the e1000 driver, so it seems
> like a safe upper limit.

Please make it 65535 without an Ethernet header and 65521
with an Ethernet header.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] doc: about email clients for Linux kernel patches

2007-09-11 Thread Chris Friesen


Jeff Garzik wrote:

Chris Friesen wrote:


Can someone describe the problems with just attaching the patch in 
Thunderbird?  It's what Martin says he does on the linked document...


Email clients don't like to quote attachments, even text/plain ones, 
which then makes attached patches much more difficult to review and 
comment on (i.e. you greatly reduce the number of reviewers).


Thunderbird, at least, will automatically inline a single text/plain 
attachment when replying. (At least with my current settings, it does.)


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mm: fix blkdev size calculation in generic_write_checks

2007-09-11 Thread Andrew Morton

On Wed, 15 Aug 2007 17:52:28 +0400 Dmitry Monakhov <[EMAIL PROTECTED]> wrote:

> Currently block device size calculated regardless its
> bd_block_size. This may result attempt to write outside
> block device if i_size not aligned to bdev->bd_block_size
> and result in EIO.
> 
>  TEST_CASE_BEGIN
> # fdisk -l /dev/sdc
> Disk /dev/sdc: 36.7 GB, 36703918080 bytes
> 255 heads, 63 sectors/track, 4462 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
> 
>Device Boot  Start End  Blocks   Id  System
> /dev/sdc1   *   1 254 2040223+  83  Ldinux
> /dev/sdc2 255 379 1004062+  83  Linux
> 
>  /dev/sdc2 size not aligned to 4K
> 
>  at this time bd_block_size == 512 so generic_write_check
>  performed correctly
> # dd if=/dev/zero of=/dev/sdc2 bs=1k count=7 seek=1004058
> dd: writing `/dev/sdc2': No space left on device
> 5+0 records in
> 4+0 records out
> 
>  this bdev contain ext4fs with blksize = 4K
> # mount /dev/sdc2 /mnt/
>  after we mounted this bdev bd_block_size == fsblksize == 4K
> 
>  the same write operation failed with EIO
> # dd if=/dev/zero of=/dev/sdc2 bs=1k count=7 seek=1004058
> dd: writing `/dev/sdc2': Input/output error
> 3+0 records in
> 2+0 records out
>  Attempt to write whole fsblock result write access outside
>  blkdevice and cause -EIO (returned by blkdev_get_block)
>  TEST_CASE_END
> 
> Signed-off-by: Dmitry Monakhov <[EMAIL PROTECTED]>
> ---
>  mm/filemap.c |4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2c8776b..a23ee8a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1867,9 +1867,11 @@ inline int generic_write_checks(struct file *file, 
> loff_t *pos, size_t *count, i
>   } else {
>  #ifdef CONFIG_BLOCK
>   loff_t isize;
> + unsigned int blksize;
>   if (bdev_read_only(I_BDEV(inode)))
>   return -EPERM;
> - isize = i_size_read(inode);
> + blksize = block_size(I_BDEV(inode));
> + isize = i_size_read(inode) & ~(blksize - 1);
>   if (*pos >= isize) {
>   if (*count || *pos > isize)
>   return -ENOSPC;

Can't say I really like the idea of adding additional overhead in this
hotpath for such an odd case.  Is there a faster way of doing it?  Maybe
adjust i_size, perhaps when the blocksize gets changed?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [BUGFIX] x86_64: NX bit handling in change_page_attr

2007-09-11 Thread Andrew Morton

On Fri, 17 Aug 2007 13:28:38 +0800 "Huang, Ying" <[EMAIL PROTECTED]> wrote:

> This patch fixes a bug of change_page_attr/change_page_attr_addr on
> Intel x86_64 CPU. After changing page attribute to be executable with
> these functions, the page remains un-executable on Intel x86_64
> CPU. Because on Intel x86_64 CPU, only if the "NX" bits of all four
> level page tables are cleared, the corresponding page is executable
> (refer to section 4.13.2 of Intel 64 and IA-32 Architectures Software
> Developer's Manual). So, the bug is fixed through clearing the "NX"
> bit of PMD when splitting the huge PMD.
> 
> Signed-off-by: Huang Ying <[EMAIL PROTECTED]>
> 
> ---
> 
> Index: linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c
> ===
> --- linux-2.6.23-rc2-mm2.orig/arch/x86_64/mm/pageattr.c   2007-08-17 
> 12:50:25.0 +0800
> +++ linux-2.6.23-rc2-mm2/arch/x86_64/mm/pageattr.c2007-08-17 
> 12:50:48.0 +0800
> @@ -147,6 +147,7 @@
>   split = split_large_page(address, prot, ref_prot2);
>   if (!split)
>   return -ENOMEM;
> + pgprot_val(ref_prot2) &= ~_PAGE_NX;
>   set_pte(kpte, mk_pte(split, ref_prot2));
>   kpte_page = split;
>   }

What happened with this?  Still valid?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Moxa: Fix tiny compiler warning when building withoug CONFIG_PCI

2007-09-11 Thread Andrew Morton

On Fri, 17 Aug 2007 00:08:58 +0200 Jesper Juhl <[EMAIL PROTECTED]> wrote:

> 
> Fix this tiny compiler warning in Moxa driver : 
>   drivers/char/mxser.c:386: warning: 'mxser_get_PCI_conf' declared 'static' 
> but never defined
> when building without CONFIG_PCI.
> 
> 
> Signed-off-by: Jesper Juhl <[EMAIL PROTECTED]>
> ---
> 
>  drivers/char/mxser.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/char/mxser.c b/drivers/char/mxser.c
> index 2aee3fe..83b15b5 100644
> --- a/drivers/char/mxser.c
> +++ b/drivers/char/mxser.c
> @@ -383,7 +383,9 @@ static int mxser_init(void);
>  
>  /* static void   mxser_poll(unsigned long); */
>  static int mxser_get_ISA_conf(int, struct mxser_hwconf *);
> +#ifdef CONFIG_PCI
>  static int mxser_get_PCI_conf(int, int, int, struct mxser_hwconf *);
> +#endif
>  static void mxser_do_softint(struct work_struct *);
>  static int mxser_open(struct tty_struct *, struct file *);
>  static void mxser_close(struct tty_struct *, struct file *);
> 

mxser_get_PCI_conf() is defined before it is used anwyay.  So that
prototype is a stupid waste of space and just adds problems.

--- 
a/drivers/char/mxser.c~mxser-fix-compiler-warning-when-building-withoug-config_pci
+++ a/drivers/char/mxser.c
@@ -383,7 +383,6 @@ static int mxser_init(void);
 
 /* static void   mxser_poll(unsigned long); */
 static int mxser_get_ISA_conf(int, struct mxser_hwconf *);
-static int mxser_get_PCI_conf(int, int, int, struct mxser_hwconf *);
 static void mxser_do_softint(struct work_struct *);
 static int mxser_open(struct tty_struct *, struct file *);
 static void mxser_close(struct tty_struct *, struct file *);
_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SYSFS: need a noncaching read

2007-09-11 Thread Michael Ellerman

On Wed, 2007-09-12 at 12:05 +1000, David Gibson wrote:
> On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote:
> > Hello,
> > 
> > I have developed a device driver and use the sysFS to export some
> > registers to userspace. I opened the sysFS File for one register and did
> > some reads from this File, but I alwas becoming the same value from the
> > register, whats not OK, because they are changing. So I found out that
> > the sysFS caches the reads ... :-(
> > 
> > Is there a way to retrigger the reads (in that way, that the sysFS
> > rereads the values from the driver), without closing and opening the
> > sysFS Files? Or must I better use the ioctl () Driver-interface for
> > exporting these registers?
> > 
> > I am asking this, because I must read every 10 ms 2 registers, so
> > doing a open/read/close for reading one registers is a little bit too
> > much overhead.
> > 
> > I made a sysFS seek function, which retriggers the read, and that works
> > fine, but I have again 2 syscalls, whats also is not optimal.
> > 
> > Or can we make a open () with a (new?)Flag, that informs the sysFS to
> > always reread the values from the underlying driver?
> > 
> > Or a new flag in the "struct attribute_group" in include/linux/sysfs.h,
> > which let the sysfs rereading the values?
> 
> This sounds more like sysfs is really not the right interface for
> polling your registers.  You would probably be better off having your
> driver export a character device from which the register values could
> be read.

I thought relay(fs) was the trendy way to do this these days?

Documentation/filesystems/relay.txt

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part

Re: [PATCH 21/23] mm: per device dirty threshold

2007-09-11 Thread John Stoffel


Peter> Scale writeback cache per backing device, proportional to its
Peter> writeout speed.  By decoupling the BDI dirty thresholds a
Peter> number of problems we currently have will go away, namely:

Ah, this clarifies my questions!  Thanks!

Peter>  - mutual interference starvation (for any number of BDIs);
Peter>  - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

Peter> It might be that all dirty pages are for a single BDI while
Peter> other BDIs are idling. By giving each BDI a 'fair' share of the
Peter> dirty limit, each one can have dirty pages outstanding and make
Peter> progress.

Question, can you change (shrink) the limit on a BDI while it has IO
in flight?  And what will that do to the system?  I.e. if you have one
device doing IO, so that it has a majority of the dirty limit.  Then
another device starts IO, and it's a *faster* device, how
quickly/slowly does the BDI dirty limits change for both the old and
new device?  

Peter> A global threshold also creates a deadlock for stacked BDIs;
Peter> when A writes to B, and A generates enough dirty pages to get
Peter> throttled, B will never start writeback until the dirty pages
Peter> go away. Again, by giving each BDI its own 'independent' dirty
Peter> limit, this problem is avoided.

Peter> So the problem is to determine how to distribute the total
Peter> dirty limit across the BDIs fairly and efficiently. A DBI that

You mean BDI here, not DBI.  

Peter> has a large dirty limit but does not have any dirty pages
Peter> outstanding is a waste.

Peter> What is done is to keep a floating proportion between the DBIs
Peter> based on writeback completions. This way faster/more active
Peter> devices get a larger share than slower/idle devices.

Does a slower device get a BDI which is calculated to keep it's limit
under a certain number of seconds of outstanding IO?  This way no
device can build up more than say 15 seconds of outstanding IO to
flush at any one time.  

Thanks!
John
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] powerpc: add new required termio functions

2007-09-11 Thread Tony Breeds

On Tue, Sep 11, 2007 at 07:17:42PM -0700, Linus Torvalds wrote:
 
> Really?
> 
> It shouldn't. The use of kernel_termios_to_user_termios_1() is conditional 
> on the architecture having a define for TCGETS2, and I think they match 
> up. I see:
> 
>   [EMAIL PROTECTED] linux]$ git grep -l kernel_termios_to_user_termios_1 
> include | wc -l
>   10
>   [EMAIL PROTECTED] linux]$ git grep -l TCGETS2 include | wc -l
>   10
> 
> and in neither case is ppc in that list of architecures.
> 
> So maybe you just read the patch without actually testing whether it 
> actually broke powerpc?
> 
> Or is something subtler going on?

As far as I can see TIOCSLCKTRMIOS and TIOCGLCKTRMIOS aren't protected
by TCGETS2 guards.  Do they need to be ...  Perhaps


From: Tony Breeds <[EMAIL PROTECTED]>

Add Guards around TIOCSLCKTRMIOS and TIOCGLCKTRMIOS.

Signed-off-by: Tony Breeds <[EMAIL PROTECTED]>

---

 drivers/char/tty_ioctl.c |   14 ++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/drivers/char/tty_ioctl.c b/drivers/char/tty_ioctl.c
index 4a8969c..3ee73cf 100644
--- a/drivers/char/tty_ioctl.c
+++ b/drivers/char/tty_ioctl.c
@@ -795,6 +795,19 @@ int n_tty_ioctl(struct tty_struct * tty, struct file * 
file,
if (L_ICANON(tty))
retval = inq_canon(tty);
return put_user(retval, (unsigned int __user *) arg);
+#ifndef TCGETS2
+   case TIOCGLCKTRMIOS:
+   if (kernel_termios_to_user_termios((struct termios 
__user *)arg, real_tty->termios_locked))
+   return -EFAULT;
+   return 0;
+
+   case TIOCSLCKTRMIOS:
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+   if 
(user_termios_to_kernel_termios(real_tty->termios_locked, (struct termios 
__user *) arg))
+   return -EFAULT;
+   return 0;
+#else
case TIOCGLCKTRMIOS:
if (kernel_termios_to_user_termios_1((struct termios 
__user *)arg, real_tty->termios_locked))
return -EFAULT;
@@ -806,6 +819,7 @@ int n_tty_ioctl(struct tty_struct * tty, struct file * file,
if 
(user_termios_to_kernel_termios_1(real_tty->termios_locked, (struct termios 
__user *) arg))
return -EFAULT;
return 0;
+#endif
 
case TIOCPKT:
{

Yours Tony

  linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/
  Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] powerpc: add new required termio functions

2007-09-11 Thread Michael Neuling

> On Wed, 12 Sep 2007, Michael Neuling wrote:
> >
> > The "tty: termios locking functions break with new termios type" patch
> > (f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile.
> 
> Really?
> 
> It shouldn't. The use of kernel_termios_to_user_termios_1() is conditional 
> on the architecture having a define for TCGETS2, and I think they match 
> up. I see:
> 
>   [EMAIL PROTECTED] linux]$ git grep -l kernel_termios_to_user_termios_1 
> in
clude | wc -l
>   10
>   [EMAIL PROTECTED] linux]$ git grep -l TCGETS2 include | wc -l
>   10
> 
> and in neither case is ppc in that list of architecures.
> 
> So maybe you just read the patch without actually testing whether it 
> actually broke powerpc?

Not, I actually compiled it.

> Or is something subtler going on?

Looks like those new calls are not protected by the TCGETS2 define.
Adding those ifdefs seems like the correct fix.  

Mikey


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/23] per device dirty throttling -v10

2007-09-11 Thread John Stoffel


Peter> Per device dirty throttling patches These patches aim to
Peter> improve balance_dirty_pages() and directly address three
Peter> issues:

Peter>   1) inter device starvation
Peter>   2) stacked device deadlocks
Peter>   3) inter process starvation

Peter> 1 and 2 are a direct result from removing the global dirty
Peter> limit and using per device dirty limits. By giving each device
Peter> its own dirty limit is will no longer starve another device,
Peter> and the cyclic dependancy on the dirty limit is broken.

Ye haa!  This should be a big improvement.  

Peter> In order to efficiently distribute the dirty limit across the
Peter> independant devices a floating proportion is used, this will
Peter> allocate a share of the total limit proportional to the
Peter> device's recent activity.

I'm not sure I like or agree with this.  Shouldn't we be limiting
based on the device's capability to sustain traffic?  So if I have a
RAID device which can read/write a total of 100Mb/sec, while at the
same time I've got a CF device which can do 5Mb/sec, shouldn't we be
more strongly limiting the CF device, even if it is the only device
being written to?  

Of course, I haven't read the patches yet, nor am I qualified to
comment on them in any meanginful way I think.  Hopefully I'm just
missing something key here in the explanation.

Peter> 3 is done by also scaling the dirty limit proportional to the
Peter> current task's recent dirty rate.

Do you mean task or device here?  I'm just wondering how well this
works with a bunch of devices with wildly varying speeds.  

John
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] powerpc: add new required termio functions

2007-09-11 Thread Linus Torvalds

On Wed, 12 Sep 2007, Michael Neuling wrote:
>
> The "tty: termios locking functions break with new termios type" patch
> (f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile.

Really?

It shouldn't. The use of kernel_termios_to_user_termios_1() is conditional 
on the architecture having a define for TCGETS2, and I think they match 
up. I see:

[EMAIL PROTECTED] linux]$ git grep -l kernel_termios_to_user_termios_1 
include | wc -l
10
[EMAIL PROTECTED] linux]$ git grep -l TCGETS2 include | wc -l
10

and in neither case is ppc in that list of architecures.

So maybe you just read the patch without actually testing whether it 
actually broke powerpc?

Or is something subtler going on?

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] ssb: Make pcmciahost depend on PCMCIA=y

2007-09-11 Thread Paul Mundt

SSB uses a bool (SSB_PCMCIAHOST_POSSIBLE) to determine whether to
build in PCMCIA support or not, as the PCMCIA host code itself is
also only a bool, make SSB_PCMCIAHOST_POSSIBLE depend on PCMCIA=y.

Without this, SSB_PCMCIAHOST_POSSIBLE evaluates to y when PCMCIA
is built as a module, which results in link errors due to the
pcmcia_access_configuration_register() accesses, where the symbol
is only defined in a module.

Signed-off-by: Paul Mundt <[EMAIL PROTECTED]>

--

 drivers/ssb/Kconfig |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.23-rc4-mm1.orig/drivers/ssb/Kconfig   2007-09-11 
15:15:52.0 +0900
+++ linux-2.6.23-rc4-mm1/drivers/ssb/Kconfig2007-09-12 10:51:53.0 
+0900
@@ -37,7 +37,7 @@
 
 config SSB_PCMCIAHOST_POSSIBLE
bool
-   depends on SSB && PCMCIA && EXPERIMENTAL
+   depends on SSB && PCMCIA=y && EXPERIMENTAL
default y
 
 config SSB_PCMCIAHOST
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Union Mount: Readdir approaches

2007-09-11 Thread hooanon05


"Josef 'Jeff' Sipek":
> So, if I understand correctly, you create the entire block as if you were
> going to write to disk? Unionfs keeps the data in a linked list.

Basically yes.
But the dir block in cache has no hole which is contiguous memory.


Junjiro Okajima
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] fs: define file_fsync() even for CONFIG_BLOCK=n

2007-09-11 Thread Paul Mundt

There's nothing that is problematic for file_fsync() with CONFIG_BLOCK=n,
and it's built in unconditionally anyways, so move the prototype out to
reflect that. Without this, the unionfs build bails out.

  CC  fs/unionfs/file.o
fs/unionfs/file.c:148: error: 'file_fsync' undeclared here (not in a function)
make[2]: *** [fs/unionfs/file.o] Error 1
make[2]: *** Waiting for unfinished jobs
make[1]: *** [fs/unionfs] Error 2

Signed-off-by: Paul Mundt <[EMAIL PROTECTED]>

--

 include/linux/buffer_head.h |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-2.6.23-rc4-mm1.orig/include/linux/buffer_head.h   2007-09-11 
15:15:56.0 +0900
+++ linux-2.6.23-rc4-mm1/include/linux/buffer_head.h2007-09-12 
10:18:57.0 +0900
@@ -14,6 +14,8 @@
 #include 
 #include 
 
+int file_fsync(struct file *, struct dentry *, int);
+
 #ifdef CONFIG_BLOCK
 
 enum bh_state_bits {
@@ -225,7 +227,6 @@
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int generic_commit_write(struct file *, struct page *, unsigned, unsigned);
 int block_truncate_page(struct address_space *, loff_t, get_block_t *);
-int file_fsync(struct file *, struct dentry *, int);
 int nobh_prepare_write(struct page*, unsigned, unsigned, get_block_t*);
 int nobh_commit_write(struct file *, struct page *, unsigned, unsigned);
 int nobh_truncate_page(struct address_space *, loff_t);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] powerpc: add new required termio functions

2007-09-11 Thread Michael Neuling

The "tty: termios locking functions break with new termios type" patch
(f629307c857c030d5a3dd777fee37c8bb395e171) breaks the powerpc compile.
This adds the required API to asm-powerpc.

Signed-off-by: Michael Neuling <[EMAIL PROTECTED]>
--
This needs to go up for 2.6.23.

Should we really put these definitions in asm-generic/termios.h as I'm
guessing other architectures are broken too?

[EMAIL PROTECTED]/ % git grep kernel_termios_to_user_termios_1
asm-arm/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-cris/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-h8300/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-i386/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-ia64/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-m32r/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-m68k/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-mips/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-v850/termios.h:#define kernel_termios_to_user_termios_1(u, k)
asm-x86_64/termios.h:#define kernel_termios_to_user_termios_1(u, k)

 include/asm-powerpc/termios.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux-2.6-ozlabs/include/asm-powerpc/termios.h
===
--- linux-2.6-ozlabs.orig/include/asm-powerpc/termios.h
+++ linux-2.6-ozlabs/include/asm-powerpc/termios.h
@@ -80,6 +80,9 @@ struct termio {
 
 #include 
 
+#define user_termios_to_kernel_termios_1(k, u) copy_from_user(k, u, 
sizeof(struct termios))
+#define kernel_termios_to_user_termios_1(u, k) copy_to_user(u, k, 
sizeof(struct termios))
+
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_POWERPC_TERMIOS_H */

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SYSFS: need a noncaching read

2007-09-11 Thread David Gibson

On Tue, Sep 11, 2007 at 11:43:17AM +0200, Heiko Schocher wrote:
> Hello,
> 
> I have developed a device driver and use the sysFS to export some
> registers to userspace. I opened the sysFS File for one register and did
> some reads from this File, but I alwas becoming the same value from the
> register, whats not OK, because they are changing. So I found out that
> the sysFS caches the reads ... :-(
> 
> Is there a way to retrigger the reads (in that way, that the sysFS
> rereads the values from the driver), without closing and opening the
> sysFS Files? Or must I better use the ioctl () Driver-interface for
> exporting these registers?
> 
> I am asking this, because I must read every 10 ms 2 registers, so
> doing a open/read/close for reading one registers is a little bit too
> much overhead.
> 
> I made a sysFS seek function, which retriggers the read, and that works
> fine, but I have again 2 syscalls, whats also is not optimal.
> 
> Or can we make a open () with a (new?)Flag, that informs the sysFS to
> always reread the values from the underlying driver?
> 
> Or a new flag in the "struct attribute_group" in include/linux/sysfs.h,
> which let the sysfs rereading the values?

This sounds more like sysfs is really not the right interface for
polling your registers.  You would probably be better off having your
driver export a character device from which the register values could
be read.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 08/10] ia64: Convert cpu_sibling_map to a per_cpu data array (v3)

2007-09-11 Thread travis

Convert cpu_sibling_map to a per_cpu cpumask_t array for the ia64
architecture.  This fixes build errors in block/blktrace.c and
kernel/sched.c when CONFIG_SCHED_SMT is defined.


There was one access to cpu_sibling_map before the per_cpu data
area was created, so that step was moved to after the per_cpu
area is setup.

Tested and verified on an A4700.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/ia64/kernel/setup.c|4 
 arch/ia64/kernel/smpboot.c  |   18 ++
 arch/ia64/mm/contig.c   |6 ++
 include/asm-ia64/smp.h  |2 +-
 include/asm-ia64/topology.h |2 +-
 5 files changed, 18 insertions(+), 14 deletions(-)

--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -528,10 +528,6 @@
 
 #ifdef CONFIG_SMP
cpu_physical_id(0) = hard_smp_processor_id();
-
-   cpu_set(0, cpu_sibling_map[0]);
-   cpu_set(0, cpu_core_map[0]);
-
check_for_logical_procs();
if (smp_num_cpucores > 1)
printk(KERN_INFO
--- a/arch/ia64/kernel/smpboot.c
+++ b/arch/ia64/kernel/smpboot.c
@@ -138,7 +138,9 @@
 EXPORT_SYMBOL(cpu_possible_map);
 
 cpumask_t cpu_core_map[NR_CPUS] __cacheline_aligned;
-cpumask_t cpu_sibling_map[NR_CPUS] __cacheline_aligned;
+DEFINE_PER_CPU_SHARED_ALIGNED(cpumask_t, cpu_sibling_map);
+EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
+
 int smp_num_siblings = 1;
 int smp_num_cpucores = 1;
 
@@ -650,12 +652,12 @@
 {
int i;
 
-   for_each_cpu_mask(i, cpu_sibling_map[cpu])
-   cpu_clear(cpu, cpu_sibling_map[i]);
+   for_each_cpu_mask(i, per_cpu(cpu_sibling_map, cpu))
+   cpu_clear(cpu, per_cpu(cpu_sibling_map, i));
for_each_cpu_mask(i, cpu_core_map[cpu])
cpu_clear(cpu, cpu_core_map[i]);
 
-   cpu_sibling_map[cpu] = cpu_core_map[cpu] = CPU_MASK_NONE;
+   per_cpu(cpu_sibling_map, cpu) = cpu_core_map[cpu] = CPU_MASK_NONE;
 }
 
 static void
@@ -666,7 +668,7 @@
if (cpu_data(cpu)->threads_per_core == 1 &&
cpu_data(cpu)->cores_per_socket == 1) {
cpu_clear(cpu, cpu_core_map[cpu]);
-   cpu_clear(cpu, cpu_sibling_map[cpu]);
+   cpu_clear(cpu, per_cpu(cpu_sibling_map, cpu));
return;
}
 
@@ -807,8 +809,8 @@
cpu_set(i, cpu_core_map[cpu]);
cpu_set(cpu, cpu_core_map[i]);
if (cpu_data(cpu)->core_id == cpu_data(i)->core_id) {
-   cpu_set(i, cpu_sibling_map[cpu]);
-   cpu_set(cpu, cpu_sibling_map[i]);
+   cpu_set(i, per_cpu(cpu_sibling_map, cpu));
+   cpu_set(cpu, per_cpu(cpu_sibling_map, i));
}
}
}
@@ -839,7 +841,7 @@
 
if (cpu_data(cpu)->threads_per_core == 1 &&
cpu_data(cpu)->cores_per_socket == 1) {
-   cpu_set(cpu, cpu_sibling_map[cpu]);
+   cpu_set(cpu, per_cpu(cpu_sibling_map, cpu));
cpu_set(cpu, cpu_core_map[cpu]);
return 0;
}
--- a/include/asm-ia64/smp.h
+++ b/include/asm-ia64/smp.h
@@ -58,7 +58,7 @@
 
 extern cpumask_t cpu_online_map;
 extern cpumask_t cpu_core_map[NR_CPUS];
-extern cpumask_t cpu_sibling_map[NR_CPUS];
+DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 extern int smp_num_siblings;
 extern int smp_num_cpucores;
 extern void __iomem *ipi_base_addr;
--- a/include/asm-ia64/topology.h
+++ b/include/asm-ia64/topology.h
@@ -112,7 +112,7 @@
 #define topology_physical_package_id(cpu)  (cpu_data(cpu)->socket_id)
 #define topology_core_id(cpu)  (cpu_data(cpu)->core_id)
 #define topology_core_siblings(cpu)(cpu_core_map[cpu])
-#define topology_thread_siblings(cpu)  (cpu_sibling_map[cpu])
+#define topology_thread_siblings(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define smt_capable()  (smp_num_siblings > 1)
 #endif
 
--- a/arch/ia64/mm/contig.c
+++ b/arch/ia64/mm/contig.c
@@ -212,6 +212,12 @@
cpu_data += PERCPU_PAGE_SIZE;
per_cpu(local_per_cpu_offset, cpu) = 
__per_cpu_offset[cpu];
}
+   /*
+* cpu_sibling_map is now a per_cpu variable - it needs to
+* be accessed after per_cpu_init() sets up the per_cpu area.
+*/
+   cpu_set(0, per_cpu(cpu_sibling_map, 0));
+   cpu_set(0, cpu_core_map[0]);
}
return __per_cpu_start + __per_cpu_offset[smp_processor_id()];
 }

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 09/10] ppc64: Convert cpu_sibling_map to a per_cpu data array (v3)

2007-09-11 Thread travis

Convert cpu_sibling_map to a per_cpu cpumask_t array for the ppc64
architecture.  This fixes build errors in block/blktrace.c and
kernel/sched.c when CONFIG_SCHED_SMT is defined.

Note: these changes have not been built nor tested.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/powerpc/kernel/setup-common.c|4 ++--
 arch/powerpc/kernel/smp.c |4 ++--
 arch/powerpc/platforms/cell/cbe_cpufreq.c |2 +-
 include/asm-powerpc/smp.h |4 +++-
 include/asm-powerpc/topology.h|2 +-
 5 files changed, 9 insertions(+), 7 deletions(-)

--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -415,9 +415,9 @@
 * Do the sibling map; assume only two threads per processor.
 */
for_each_possible_cpu(cpu) {
-   cpu_set(cpu, cpu_sibling_map[cpu]);
+   cpu_set(cpu, cpu_sibling_map(cpu));
if (cpu_has_feature(CPU_FTR_SMT))
-   cpu_set(cpu ^ 0x1, cpu_sibling_map[cpu]);
+   cpu_set(cpu ^ 0x1, cpu_sibling_map(cpu));
}
 
vdso_data->processorCount = num_present_cpus();
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -61,11 +61,11 @@
 
 cpumask_t cpu_possible_map = CPU_MASK_NONE;
 cpumask_t cpu_online_map = CPU_MASK_NONE;
-cpumask_t cpu_sibling_map[NR_CPUS] = { [0 ... NR_CPUS-1] = CPU_MASK_NONE };
+DEFINE_PER_CPU(cpumask_t, cpu_sibling_map) = CPU_MASK_NONE;
 
 EXPORT_SYMBOL(cpu_online_map);
 EXPORT_SYMBOL(cpu_possible_map);
-EXPORT_SYMBOL(cpu_sibling_map);
+EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
--- a/arch/powerpc/platforms/cell/cbe_cpufreq.c
+++ b/arch/powerpc/platforms/cell/cbe_cpufreq.c
@@ -119,7 +119,7 @@
policy->cur = cbe_freqs[cur_pmode].frequency;
 
 #ifdef CONFIG_SMP
-   policy->cpus = cpu_sibling_map[policy->cpu];
+   policy->cpus = cpu_sibling_map(policy->cpu);
 #endif
 
cpufreq_frequency_table_get_attr(cbe_freqs, policy->cpu);
--- a/include/asm-powerpc/smp.h
+++ b/include/asm-powerpc/smp.h
@@ -25,6 +25,7 @@
 
 #ifdef CONFIG_PPC64
 #include 
+#include 
 #endif
 
 extern int boot_cpuid;
@@ -58,7 +59,8 @@
(smp_hw_index[(cpu)] = (phys))
 #endif
 
-extern cpumask_t cpu_sibling_map[NR_CPUS];
+DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
+#define cpu_sibling_map(cpu) per_cpu(cpu_sibling_map, cpu)
 
 /* Since OpenPIC has only 4 IPIs, we use slightly different message numbers.
  *
--- a/include/asm-powerpc/topology.h
+++ b/include/asm-powerpc/topology.h
@@ -108,7 +108,7 @@
 #ifdef CONFIG_PPC64
 #include 
 
-#define topology_thread_siblings(cpu)  (cpu_sibling_map[cpu])
+#define topology_thread_siblings(cpu)  (cpu_sibling_map(cpu))
 #endif
 #endif
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 07/10] x86: acpi-use-cpu_physical_id (v3)

2007-09-11 Thread travis

This is from an earlier message from Christoph Lameter:

processor_core.c currently tries to determine the apicid by special casing
for IA64 and x86. The desired information is readily available via

cpu_physical_id()

on IA64, i386 and x86_64.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Additionally, boot_cpu_id needed to be exported to fix compile errors in
dma code when !CONFIG_SMP.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/mpparse.c  |2 ++
 drivers/acpi/processor_core.c |8 +---
 2 files changed, 3 insertions(+), 7 deletions(-)

--- a/drivers/acpi/processor_core.c
+++ b/drivers/acpi/processor_core.c
@@ -419,12 +419,6 @@
return 0;
 }
 
-#ifdef CONFIG_IA64
-#define arch_cpu_to_apicid ia64_cpu_to_sapicid
-#else
-#define arch_cpu_to_apicid x86_cpu_to_apicid
-#endif
-
 static int map_madt_entry(u32 acpi_id)
 {
unsigned long madt_end, entry;
@@ -498,7 +492,7 @@
return apic_id;
 
for (i = 0; i < NR_CPUS; ++i) {
-   if (arch_cpu_to_apicid[i] == apic_id)
+   if (cpu_physical_id(i) == apic_id)
return i;
}
return -1;
--- a/arch/x86_64/kernel/mpparse.c
+++ b/arch/x86_64/kernel/mpparse.c
@@ -57,6 +57,8 @@
 
 /* Processor that is doing the boot up */
 unsigned int boot_cpu_id = -1U;
+EXPORT_SYMBOL(boot_cpu_id);
+
 /* Internal processor count */
 unsigned int num_processors __cpuinitdata = 0;
 

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 10/10] sparc64: Convert cpu_sibling_map to a per_cpu data array (v3)

2007-09-11 Thread travis

Convert cpu_sibling_map to a per_cpu cpumask_t array for the sparc64
architecture.  This fixes build errors in block/blktrace.c and
kernel/sched.c when CONFIG_SCHED_SMT is defined.

Note: these changes have not been built nor tested.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/sparc64/kernel/smp.c  |   17 -
 include/asm-sparc64/smp.h  |3 ++-
 include/asm-sparc64/topology.h |2 +-
 3 files changed, 11 insertions(+), 11 deletions(-)

--- a/arch/sparc64/kernel/smp.c
+++ b/arch/sparc64/kernel/smp.c
@@ -52,14 +52,13 @@
 
 cpumask_t cpu_possible_map __read_mostly = CPU_MASK_NONE;
 cpumask_t cpu_online_map __read_mostly = CPU_MASK_NONE;
-cpumask_t cpu_sibling_map[NR_CPUS] __read_mostly =
-   { [0 ... NR_CPUS-1] = CPU_MASK_NONE };
+DEFINE_PER_CPU(cpumask_t, cpu_sibling_map) = CPU_MASK_NONE;
 cpumask_t cpu_core_map[NR_CPUS] __read_mostly =
{ [0 ... NR_CPUS-1] = CPU_MASK_NONE };
 
 EXPORT_SYMBOL(cpu_possible_map);
 EXPORT_SYMBOL(cpu_online_map);
-EXPORT_SYMBOL(cpu_sibling_map);
+EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 EXPORT_SYMBOL(cpu_core_map);
 
 static cpumask_t smp_commenced_mask;
@@ -1259,16 +1258,16 @@
for_each_present_cpu(i) {
unsigned int j;
 
-   cpus_clear(cpu_sibling_map[i]);
+   cpus_clear(per_cpu(cpu_sibling_map, i));
if (cpu_data(i).proc_id == -1) {
-   cpu_set(i, cpu_sibling_map[i]);
+   cpu_set(i, per_cpu(cpu_sibling_map, i));
continue;
}
 
for_each_present_cpu(j) {
if (cpu_data(i).proc_id ==
cpu_data(j).proc_id)
-   cpu_set(j, cpu_sibling_map[i]);
+   cpu_set(j, per_cpu(cpu_sibling_map, i));
}
}
 }
@@ -1340,9 +1339,9 @@
cpu_clear(cpu, cpu_core_map[i]);
cpus_clear(cpu_core_map[cpu]);
 
-   for_each_cpu_mask(i, cpu_sibling_map[cpu])
-   cpu_clear(cpu, cpu_sibling_map[i]);
-   cpus_clear(cpu_sibling_map[cpu]);
+   for_each_cpu_mask(i, per_cpu(cpu_sibling_map, cpu))
+   cpu_clear(cpu, per_cpu(cpu_sibling_map, i));
+   cpus_clear(per_cpu(cpu_sibling_map, cpu));
 
c = &cpu_data(cpu);
 
--- a/include/asm-sparc64/smp.h
+++ b/include/asm-sparc64/smp.h
@@ -28,8 +28,9 @@
  
 #include 
 #include 
+#include 
 
-extern cpumask_t cpu_sibling_map[NR_CPUS];
+DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 extern cpumask_t cpu_core_map[NR_CPUS];
 extern int sparc64_multi_core;
 
--- a/include/asm-sparc64/topology.h
+++ b/include/asm-sparc64/topology.h
@@ -5,7 +5,7 @@
 #define topology_physical_package_id(cpu)  (cpu_data(cpu).proc_id)
 #define topology_core_id(cpu)  (cpu_data(cpu).core_id)
 #define topology_core_siblings(cpu)(cpu_core_map[cpu])
-#define topology_thread_siblings(cpu)  (cpu_sibling_map[cpu])
+#define topology_thread_siblings(cpu)  (per_cpu(cpu_sibling_map, cpu))
 #define mc_capable()   (sparc64_multi_core)
 #define smt_capable()  (sparc64_multi_core)
 #endif /* CONFIG_SMP */

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 04/10] x86: Convert cpu_sibling_map to be a per cpu variable (v3)

2007-09-11 Thread travis

Convert cpu_sibling_map from a static array sized by NR_CPUS to a
per_cpu variable.  This saves sizeof(cpumask_t) * NR unused cpus.
Access is mostly from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/i386/kernel/cpu/cpufreq/p4-clockmod.c   |2 -
 arch/i386/kernel/cpu/cpufreq/speedstep-ich.c |2 -
 arch/i386/kernel/io_apic.c   |4 +--
 arch/i386/kernel/smpboot.c   |   36 +--
 arch/i386/oprofile/op_model_p4.c |2 -
 arch/i386/xen/smp.c  |4 +--
 arch/x86_64/kernel/smpboot.c |   26 +--
 block/blktrace.c |2 -
 include/asm-i386/smp.h   |2 -
 include/asm-i386/topology.h  |2 -
 include/asm-x86_64/smp.h |6 +++-
 include/asm-x86_64/topology.h|2 -
 kernel/sched.c   |8 +++---
 13 files changed, 50 insertions(+), 48 deletions(-)

--- a/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
+++ b/arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
@@ -200,7 +200,7 @@
unsigned int i;
 
 #ifdef CONFIG_SMP
-   policy->cpus = cpu_sibling_map[policy->cpu];
+   policy->cpus = per_cpu(cpu_sibling_map, policy->cpu);
 #endif
 
/* Errata workaround */
--- a/arch/i386/kernel/cpu/cpufreq/speedstep-ich.c
+++ b/arch/i386/kernel/cpu/cpufreq/speedstep-ich.c
@@ -322,7 +322,7 @@
 
/* only run on CPU to be set, or on its sibling */
 #ifdef CONFIG_SMP
-   policy->cpus = cpu_sibling_map[policy->cpu];
+   policy->cpus = per_cpu(cpu_sibling_map, policy->cpu);
 #endif
 
cpus_allowed = current->cpus_allowed;
--- a/arch/i386/kernel/io_apic.c
+++ b/arch/i386/kernel/io_apic.c
@@ -378,7 +378,7 @@
 
 #define IRQ_ALLOWED(cpu, allowed_mask) cpu_isset(cpu, allowed_mask)
 
-#define CPU_TO_PACKAGEINDEX(i) (first_cpu(cpu_sibling_map[i]))
+#define CPU_TO_PACKAGEINDEX(i) (first_cpu(per_cpu(cpu_sibling_map, i)))
 
 static cpumask_t balance_irq_affinity[NR_IRQS] = {
[0 ... NR_IRQS-1] = CPU_MASK_ALL
@@ -598,7 +598,7 @@
 * (A+B)/2 vs B
 */
load = CPU_IRQ(min_loaded) >> 1;
-   for_each_cpu_mask(j, cpu_sibling_map[min_loaded]) {
+   for_each_cpu_mask(j, per_cpu(cpu_sibling_map, min_loaded)) {
if (load > CPU_IRQ(j)) {
/* This won't change cpu_sibling_map[min_loaded] */
load = CPU_IRQ(j);
--- a/arch/i386/kernel/smpboot.c
+++ b/arch/i386/kernel/smpboot.c
@@ -70,8 +70,8 @@
 int cpu_llc_id[NR_CPUS] __cpuinitdata = {[0 ... NR_CPUS-1] = BAD_APICID};
 
 /* representing HT siblings of each logical CPU */
-cpumask_t cpu_sibling_map[NR_CPUS] __read_mostly;
-EXPORT_SYMBOL(cpu_sibling_map);
+DEFINE_PER_CPU(cpumask_t, cpu_sibling_map);
+EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 
 /* representing HT and core siblings of each logical CPU */
 DEFINE_PER_CPU(cpumask_t, cpu_core_map);
@@ -319,8 +319,8 @@
for_each_cpu_mask(i, cpu_sibling_setup_map) {
if (c[cpu].phys_proc_id == c[i].phys_proc_id &&
c[cpu].cpu_core_id == c[i].cpu_core_id) {
-   cpu_set(i, cpu_sibling_map[cpu]);
-   cpu_set(cpu, cpu_sibling_map[i]);
+   cpu_set(i, per_cpu(cpu_sibling_map, cpu));
+   cpu_set(cpu, per_cpu(cpu_sibling_map, i));
cpu_set(i, per_cpu(cpu_core_map, cpu));
cpu_set(cpu, per_cpu(cpu_core_map, i));
cpu_set(i, c[cpu].llc_shared_map);
@@ -328,13 +328,13 @@
}
}
} else {
-   cpu_set(cpu, cpu_sibling_map[cpu]);
+   cpu_set(cpu, per_cpu(cpu_sibling_map, cpu));
}
 
cpu_set(cpu, c[cpu].llc_shared_map);
 
if (current_cpu_data.x86_max_cores == 1) {
-   per_cpu(cpu_core_map, cpu) = cpu_sibling_map[cpu];
+   per_cpu(cpu_core_map, cpu) = per_cpu(cpu_sibling_map, cpu);
c[cpu].booted_cores = 1;
return;
}
@@ -351,12 +351,12 @@
/*
 *  Does this new cpu bringup a new core?
 */
-   if (cpus_weight(cpu_sibling_map[cpu]) == 1) {
+   if (cpus_weight(per_cpu(cpu_sibling_map, cpu)) == 1) {
/*
 * for each core in package, increment
 * the booted_cores for this new cpu
 */
-   if (first_cpu(cpu_sibling_map[i]) == i)
+   if (first_cpu(per_cpu(cpu_sibling_map, i)) == i)
c[cpu].booted_cores++;

[PATCH 05/10] x86: Convert x86_cpu_to_apicid to be a per cpu variable (v3)

2007-09-11 Thread travis

This patch converts the x86_cpu_to_apicid array to be a per
cpu variable.  This saves sizeof(apicid) * NR unused cpus.
Access is mostly from startup and CPU HOTPLUG functions.

MP_processor_info() is one of the functions that require access
to the x86_cpu_to_apicid array before the per_cpu data area is
setup.  For this case, a pointer to the __initdata array is
initialized in setup_arch() and removed in smp_prepare_cpus()
after the per_cpu data area is initialized.

A second change is included to change the initial array value
of ARCH i386 from 0xff to BAD_APICID to be consistent with
ARCH x86_64.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/i386/kernel/acpi/boot.c  |2 +-
 arch/i386/kernel/smp.c|2 +-
 arch/i386/kernel/smpboot.c|   22 +++---
 arch/x86_64/kernel/genapic.c  |   15 ---
 arch/x86_64/kernel/genapic_flat.c |2 +-
 arch/x86_64/kernel/mpparse.c  |   15 +--
 arch/x86_64/kernel/setup.c|5 +
 arch/x86_64/kernel/smpboot.c  |   23 ++-
 arch/x86_64/mm/numa.c |2 +-
 include/asm-i386/smp.h|6 --
 include/asm-x86_64/ipi.h  |2 +-
 include/asm-x86_64/smp.h  |6 --
 12 files changed, 80 insertions(+), 22 deletions(-)

--- a/arch/i386/kernel/acpi/boot.c
+++ b/arch/i386/kernel/acpi/boot.c
@@ -555,7 +555,7 @@
 
 int acpi_unmap_lsapic(int cpu)
 {
-   x86_cpu_to_apicid[cpu] = -1;
+   per_cpu(x86_cpu_to_apicid, cpu) = -1;
cpu_clear(cpu, cpu_present_map);
num_processors--;
 
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -673,7 +673,7 @@
int i;
 
for (i = 0; i < NR_CPUS; i++) {
-   if (x86_cpu_to_apicid[i] == apic_id)
+   if (per_cpu(x86_cpu_to_apicid, i) == apic_id)
return i;
}
return -1;
--- a/arch/i386/kernel/smpboot.c
+++ b/arch/i386/kernel/smpboot.c
@@ -92,9 +92,17 @@
 struct cpuinfo_x86 cpu_data[NR_CPUS] __cacheline_aligned;
 EXPORT_SYMBOL(cpu_data);
 
-u8 x86_cpu_to_apicid[NR_CPUS] __read_mostly =
-   { [0 ... NR_CPUS-1] = 0xff };
-EXPORT_SYMBOL(x86_cpu_to_apicid);
+/*
+ * The following static array is used during kernel startup
+ * and the x86_cpu_to_apicid_ptr contains the address of the
+ * array during this time.  Is it zeroed when the per_cpu
+ * data area is removed.
+ */
+u8 x86_cpu_to_apicid_init[NR_CPUS] __initdata =
+   { [0 ... NR_CPUS-1] = BAD_APICID };
+void *x86_cpu_to_apicid_ptr;
+DEFINE_PER_CPU(u8, x86_cpu_to_apicid) = BAD_APICID;
+EXPORT_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 
 u8 apicid_2_node[MAX_APICID];
 
@@ -804,7 +812,7 @@
 
irq_ctx_init(cpu);
 
-   x86_cpu_to_apicid[cpu] = apicid;
+   per_cpu(x86_cpu_to_apicid, cpu) = apicid;
/*
 * This grunge runs the startup process for
 * the targeted processor.
@@ -866,7 +874,7 @@
cpu_clear(cpu, cpu_initialized); /* was set by cpu_init() */
cpucount--;
} else {
-   x86_cpu_to_apicid[cpu] = apicid;
+   per_cpu(x86_cpu_to_apicid, cpu) = apicid;
cpu_set(cpu, cpu_present_map);
}
 
@@ -915,7 +923,7 @@
struct warm_boot_cpu_info info;
int apicid, ret;
 
-   apicid = x86_cpu_to_apicid[cpu];
+   apicid = per_cpu(x86_cpu_to_apicid, cpu);
if (apicid == BAD_APICID) {
ret = -ENODEV;
goto exit;
@@ -965,7 +973,7 @@
 
boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
boot_cpu_logical_apicid = logical_smp_processor_id();
-   x86_cpu_to_apicid[0] = boot_cpu_physical_apicid;
+   per_cpu(x86_cpu_to_apicid, 0) = boot_cpu_physical_apicid;
 
current_thread_info()->cpu = 0;
 
--- a/arch/x86_64/kernel/mpparse.c
+++ b/arch/x86_64/kernel/mpparse.c
@@ -86,7 +86,7 @@
return sum & 0xFF;
 }
 
-static void __cpuinit MP_processor_info (struct mpc_config_processor *m)
+static void __cpuinit MP_processor_info(struct mpc_config_processor *m)
 {
int cpu;
cpumask_t tmp_map;
@@ -123,7 +123,18 @@
cpu = 0;
}
bios_cpu_apicid[cpu] = m->mpc_apicid;
-   x86_cpu_to_apicid[cpu] = m->mpc_apicid;
+   /*
+* We get called early in the the start_kernel initialization
+* process when the per_cpu data area is not yet setup, so we
+* use a static array that is removed after the per_cpu data
+* area is created.
+*/
+   if (x86_cpu_to_apicid_ptr) {
+   u8 *x86_cpu_to_apicid = (u8 *)x86_cpu_to_apicid_ptr;
+   x86_cpu_to_apicid[cpu] = m->mpc_apicid;
+   } else {
+   per_cpu(x86_cpu_to_apicid, cpu) = m->mpc_apicid;
+   }
 
cpu_set(cpu, cpu_possible_map);
cpu_set(cpu, cpu_present_map);
--- a/arch/x86_64/kernel/smpboot.c
+++ b/arch/x86_64/kernel/smpboot.c
@@ -701,7 +

[PATCH 06/10] x86: Convert cpu_llc_id to be a per cpu variable (v3)

2007-09-11 Thread travis

Convert cpu_llc_id from a static array sized by NR_CPUS to a
per_cpu variable.  This saves sizeof(cpu_llc_id) * NR unused
cpus.  Access is mostly from startup and CPU HOTPLUG functions.

Note there's an addtional change of the type of cpu_llc_id
from int to u8 for ARCH i386 to correspond with the same
type in ARCH x86_64.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/i386/kernel/cpu/intel_cacheinfo.c |4 ++--
 arch/i386/kernel/smpboot.c |6 +++---
 arch/x86_64/kernel/smpboot.c   |6 +++---
 include/asm-i386/processor.h   |6 +-
 include/asm-x86_64/smp.h   |9 -
 5 files changed, 17 insertions(+), 14 deletions(-)

--- a/arch/i386/kernel/cpu/intel_cacheinfo.c
+++ b/arch/i386/kernel/cpu/intel_cacheinfo.c
@@ -417,14 +417,14 @@
if (new_l2) {
l2 = new_l2;
 #ifdef CONFIG_X86_HT
-   cpu_llc_id[cpu] = l2_id;
+   per_cpu(cpu_llc_id, cpu) = l2_id;
 #endif
}
 
if (new_l3) {
l3 = new_l3;
 #ifdef CONFIG_X86_HT
-   cpu_llc_id[cpu] = l3_id;
+   per_cpu(cpu_llc_id, cpu) = l3_id;
 #endif
}
 
--- a/arch/i386/kernel/smpboot.c
+++ b/arch/i386/kernel/smpboot.c
@@ -67,7 +67,7 @@
 EXPORT_SYMBOL(smp_num_siblings);
 
 /* Last level cache ID of each logical CPU */
-int cpu_llc_id[NR_CPUS] __cpuinitdata = {[0 ... NR_CPUS-1] = BAD_APICID};
+DEFINE_PER_CPU(u8, cpu_llc_id) = BAD_APICID;
 
 /* representing HT siblings of each logical CPU */
 DEFINE_PER_CPU(cpumask_t, cpu_sibling_map);
@@ -348,8 +348,8 @@
}
 
for_each_cpu_mask(i, cpu_sibling_setup_map) {
-   if (cpu_llc_id[cpu] != BAD_APICID &&
-   cpu_llc_id[cpu] == cpu_llc_id[i]) {
+   if (per_cpu(cpu_llc_id, cpu) != BAD_APICID &&
+   per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
cpu_set(i, c[cpu].llc_shared_map);
cpu_set(cpu, c[i].llc_shared_map);
}
--- a/arch/x86_64/kernel/smpboot.c
+++ b/arch/x86_64/kernel/smpboot.c
@@ -65,7 +65,7 @@
 EXPORT_SYMBOL(smp_num_siblings);
 
 /* Last level cache ID of each logical CPU */
-u8 cpu_llc_id[NR_CPUS] __cpuinitdata  = {[0 ... NR_CPUS-1] = BAD_APICID};
+DEFINE_PER_CPU(u8, cpu_llc_id) = BAD_APICID;
 
 /* Bitmask of currently online CPUs */
 cpumask_t cpu_online_map __read_mostly;
@@ -285,8 +285,8 @@
}
 
for_each_cpu_mask(i, cpu_sibling_setup_map) {
-   if (cpu_llc_id[cpu] != BAD_APICID &&
-   cpu_llc_id[cpu] == cpu_llc_id[i]) {
+   if (per_cpu(cpu_llc_id, cpu) != BAD_APICID &&
+   per_cpu(cpu_llc_id, cpu) == per_cpu(cpu_llc_id, i)) {
cpu_set(i, c[cpu].llc_shared_map);
cpu_set(cpu, c[i].llc_shared_map);
}
--- a/include/asm-i386/processor.h
+++ b/include/asm-i386/processor.h
@@ -110,7 +110,11 @@
 #define current_cpu_data boot_cpu_data
 #endif
 
-extern int cpu_llc_id[NR_CPUS];
+/*
+ * the following now lives in the per cpu area:
+ * extern  int cpu_llc_id[NR_CPUS];
+ */
+DECLARE_PER_CPU(u8, cpu_llc_id);
 extern char ignore_fpu_irq;
 
 void __init cpu_detect(struct cpuinfo_x86 *c);
--- a/include/asm-x86_64/smp.h
+++ b/include/asm-x86_64/smp.h
@@ -39,16 +39,14 @@
 extern void smp_send_reschedule(int cpu);
 
 /*
- * cpu_sibling_map and cpu_core_map now live
- * in the per cpu area
- *
+ * the following now live in the per cpu area:
  * extern cpumask_t cpu_sibling_map[NR_CPUS];
  * extern cpumask_t cpu_core_map[NR_CPUS];
+ * extern u8 cpu_llc_id[NR_CPUS];
  */
 DECLARE_PER_CPU(cpumask_t, cpu_sibling_map);
 DECLARE_PER_CPU(cpumask_t, cpu_core_map);
-
-extern u8 cpu_llc_id[NR_CPUS];
+DECLARE_PER_CPU(u8, cpu_llc_id);
 
 #define SMP_TRAMPOLINE_BASE 0x6000
 
@@ -120,6 +118,7 @@
 #ifdef CONFIG_SMP
 #define cpu_physical_id(cpu)   per_cpu(x86_cpu_to_apicid, cpu)
 #else
+extern unsigned int boot_cpu_id;
 #define cpu_physical_id(cpu)   boot_cpu_id
 #endif /* !CONFIG_SMP */
 #endif

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 02/10] x86: fix cpu_to_node references (v3)

2007-09-11 Thread travis

Fix four instances where cpu_to_node is referenced
by array instead of via the cpu_to_node macro.  This
is preparation to moving it to the per_cpu data area.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/vsyscall.c |2 +-
 arch/x86_64/mm/numa.c |4 ++--
 arch/x86_64/mm/srat.c |4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86_64/kernel/vsyscall.c
+++ b/arch/x86_64/kernel/vsyscall.c
@@ -291,7 +291,7 @@
unsigned long *d;
unsigned long node = 0;
 #ifdef CONFIG_NUMA
-   node = cpu_to_node[cpu];
+   node = cpu_to_node(cpu);
 #endif
if (cpu_has(&cpu_data[cpu], X86_FEATURE_RDTSCP))
write_rdtscp_aux((node << 12) | cpu);
--- a/arch/x86_64/mm/numa.c
+++ b/arch/x86_64/mm/numa.c
@@ -261,7 +261,7 @@
   We round robin the existing nodes. */
rr = first_node(node_online_map);
for (i = 0; i < NR_CPUS; i++) {
-   if (cpu_to_node[i] != NUMA_NO_NODE)
+   if (cpu_to_node(i) != NUMA_NO_NODE)
continue;
numa_set_node(i, rr);
rr = next_node(rr, node_online_map);
@@ -543,7 +543,7 @@
 void __cpuinit numa_set_node(int cpu, int node)
 {
cpu_pda(cpu)->nodenumber = node;
-   cpu_to_node[cpu] = node;
+   cpu_to_node(cpu) = node;
 }
 
 unsigned long __init numa_free_all_bootmem(void) 
--- a/arch/x86_64/mm/srat.c
+++ b/arch/x86_64/mm/srat.c
@@ -431,9 +431,9 @@
setup_node_bootmem(i, nodes[i].start, nodes[i].end);
 
for (i = 0; i < NR_CPUS; i++) {
-   if (cpu_to_node[i] == NUMA_NO_NODE)
+   if (cpu_to_node(i) == NUMA_NO_NODE)
continue;
-   if (!node_isset(cpu_to_node[i], node_possible_map))
+   if (!node_isset(cpu_to_node(i), node_possible_map))
numa_set_node(i, NUMA_NO_NODE);
}
numa_init_array();

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 03/10] x86: Convert cpu_core_map to be a per cpu variable (v3)

2007-09-11 Thread travis

This is from an earlier message from 'Christoph Lameter':

cpu_core_map is currently an array defined using NR_CPUS. This means that
we overallocate since we will rarely really use maximum configured cpu.

If we put the cpu_core_map into the per cpu area then it will be allocated
for each processor as it comes online.

This means that the core map cannot be accessed until the per cpu area
has been allocated. Xen does a weird thing here looping over all processors
and zeroing the masks that are not yet allocated and that will be zeroed
when they are allocated. I commented the code out.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c |2 -
 arch/i386/kernel/cpu/cpufreq/powernow-k8.c  |   10 
 arch/i386/kernel/cpu/proc.c |3 +-
 arch/i386/kernel/smpboot.c  |   34 ++--
 arch/i386/xen/smp.c |   14 +--
 arch/x86_64/kernel/mce_amd.c|6 ++--
 arch/x86_64/kernel/setup.c  |3 +-
 arch/x86_64/kernel/smpboot.c|   24 +--
 include/asm-i386/smp.h  |2 -
 include/asm-i386/topology.h |2 -
 include/asm-x86_64/smp.h|8 +-
 include/asm-x86_64/topology.h   |2 -
 12 files changed, 64 insertions(+), 46 deletions(-)

--- a/include/asm-x86_64/smp.h
+++ b/include/asm-x86_64/smp.h
@@ -39,7 +39,13 @@
 extern void smp_send_reschedule(int cpu);
 
 extern cpumask_t cpu_sibling_map[NR_CPUS];
-extern cpumask_t cpu_core_map[NR_CPUS];
+/*
+ * cpu_core_map lives in a per cpu area
+ *
+ * extern cpumask_t cpu_core_map[NR_CPUS];
+ */
+DECLARE_PER_CPU(cpumask_t, cpu_core_map);
+
 extern u8 cpu_llc_id[NR_CPUS];
 
 #define SMP_TRAMPOLINE_BASE 0x6000
--- a/arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c
+++ b/arch/i386/kernel/cpu/cpufreq/acpi-cpufreq.c
@@ -595,7 +595,7 @@
dmi_check_system(sw_any_bug_dmi_table);
if (bios_with_sw_any_bug && cpus_weight(policy->cpus) == 1) {
policy->shared_type = CPUFREQ_SHARED_TYPE_ALL;
-   policy->cpus = cpu_core_map[cpu];
+   policy->cpus = per_cpu(cpu_core_map, cpu);
}
 #endif
 
--- a/arch/i386/kernel/cpu/cpufreq/powernow-k8.c
+++ b/arch/i386/kernel/cpu/cpufreq/powernow-k8.c
@@ -57,7 +57,7 @@
 static int cpu_family = CPU_OPTERON;
 
 #ifndef CONFIG_SMP
-static cpumask_t cpu_core_map[1];
+DEFINE_PER_CPU(cpumask_t, cpu_core_map);
 #endif
 
 /* Return a frequency in MHz, given an input fid */
@@ -664,7 +664,7 @@
 
dprintk("cfid 0x%x, cvid 0x%x\n", data->currfid, data->currvid);
data->powernow_table = powernow_table;
-   if (first_cpu(cpu_core_map[data->cpu]) == data->cpu)
+   if (first_cpu(per_cpu(cpu_core_map, data->cpu)) == data->cpu)
print_basics(data);
 
for (j = 0; j < data->numps; j++)
@@ -818,7 +818,7 @@
 
/* fill in data */
data->numps = data->acpi_data.state_count;
-   if (first_cpu(cpu_core_map[data->cpu]) == data->cpu)
+   if (first_cpu(per_cpu(cpu_core_map, data->cpu)) == data->cpu)
print_basics(data);
powernow_k8_acpi_pst_values(data, 0);
 
@@ -1212,7 +1212,7 @@
if (cpu_family == CPU_HW_PSTATE)
pol->cpus = cpumask_of_cpu(pol->cpu);
else
-   pol->cpus = cpu_core_map[pol->cpu];
+   pol->cpus = per_cpu(cpu_core_map, pol->cpu);
data->available_cores = &(pol->cpus);
 
/* Take a crude guess here.
@@ -1279,7 +1279,7 @@
cpumask_t oldmask = current->cpus_allowed;
unsigned int khz = 0;
 
-   data = powernow_data[first_cpu(cpu_core_map[cpu])];
+   data = powernow_data[first_cpu(per_cpu(cpu_core_map, cpu))];
 
if (!data)
return -EINVAL;
--- a/arch/i386/kernel/cpu/proc.c
+++ b/arch/i386/kernel/cpu/proc.c
@@ -122,7 +122,8 @@
 #ifdef CONFIG_X86_HT
if (c->x86_max_cores * smp_num_siblings > 1) {
seq_printf(m, "physical id\t: %d\n", c->phys_proc_id);
-   seq_printf(m, "siblings\t: %d\n", cpus_weight(cpu_core_map[n]));
+   seq_printf(m, "siblings\t: %d\n",
+   cpus_weight(per_cpu(cpu_core_map, n)));
seq_printf(m, "core id\t\t: %d\n", c->cpu_core_id);
seq_printf(m, "cpu cores\t: %d\n", c->booted_cores);
}
--- a/arch/i386/kernel/smpboot.c
+++ b/arch/i386/kernel/smpboot.c
@@ -74,8 +74,8 @@
 EXPORT_SYMBOL(cpu_sibling_map);
 
 /* representing HT and core siblings of each logical CPU */
-cpumask_t cpu_core_map[NR_CPUS] __read_mostly;
-EXPORT_SYMBOL(cpu_core_map);
+DEFINE_PER_CPU(cpumask_t, cpu_core_map);
+EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 
 /* bitmap of online cpus */
 cpumask_t cpu_online_map __read_mostly;
@@ -300,7 +300,7 @@
 * And for powe

[PATCH 00/10] x86: Reduce Memory Usage and Inter-Node message traffic (v3)

2007-09-11 Thread travis


Note:

This patch consolidates all the previous patches regarding
the conversion of static arrays sized by NR_CPUS into per_cpu
data arrays and is referenced against 2.6.23-rc6 .


v1 Intro:

In x86_64 and i386 architectures most arrays that are sized
using NR_CPUS lay in local memory on node 0.  Not only will most
(99%?) of the systems not use all the slots in these arrays,
particularly when NR_CPUS is increased to accommodate future
very high cpu count systems, but a number of cache lines are
passed unnecessarily on the system bus when these arrays are
referenced by cpus on other nodes.

Typically, the values in these arrays are referenced by the cpu
accessing it's own values, though when passing IPI interrupts,
the cpu does access the data relevant to the targeted cpu/node.
Of course, if the referencing cpu is not on node 0, then the
reference will still require cross node exchanges of cache
lines.  A common use of this is for an interrupt service
routine to pass the interrupt to other cpus local to that node.

Ideally, all the elements in these arrays should be moved to the
per_cpu data area.  In some cases (such as x86_cpu_to_apicid)
the array is referenced before the per_cpu data areas are setup.
In this case, a static array is declared in the __initdata
area and initialized by the booting cpu (BSP).  The values are
then moved to the per_cpu area after it is initialized and the
original static array is freed with the rest of the __initdata.
This patch is referenced against 2.6.23-rc6.
--

Changes for version v2:

> > Note the addtional change of the cpu_llc_id type from u8
> > to int for ARCH x86_64 to correspond with ARCH i386.

> At least currently it cannot be more than 8 bit. So why
> waste memory? It would be better to change i386

Done.  (x86_64 type => u8).

> > Fix four instances where cpu_to_node is referenced
> > > by array instead of via the cpu_to_node macro.  This
> > > is preparation to moving it to the per_cpu data area.

> Shouldn't this patch be logically before the per cpu 
> conversion (which is 3). This way the result would
> be git bisectable.

Done.  (Moved to PATCH 1).

> > processor_core.c currently tries to determine the apicid by special 
> > casing
> > > for IA64 and x86. The desired information is readily available via
> > > 
> > >   cpu_physical_id()
> > > 
> > > on IA64, i386 and x86_64.
> 
> Have you tried this with a !CONFIG_SMP build? The drivers/dma code was doing
> the same and running into problems because it wasn't defined there.

Fixed. (New export in PATCH 1).
--

Changes for version v3:

cpu_sibling_map has been converted to a per_cpu data array to fix
build errors on ia64, ppc64 and sparc64 to accomodate references in
block/blktrace.c and kernel/sched.c when CONFIG_SCHED_SMT is defined.

Warning: ppc64 and sparc64 have not yet been built nor tested.
--

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 01/10] x86: remove x86_cpu_to_log_apicid array (v3)

2007-09-11 Thread travis

This is a copy of an older patch that is in rc3-mm1.  It's needed
to allow the remaining patches to integrate correctly.

Signed-off-by: Mike Travis <[EMAIL PROTECTED]>
---
 arch/x86_64/kernel/genapic.c  |2 --
 arch/x86_64/kernel/genapic_flat.c |1 -
 arch/x86_64/kernel/smpboot.c  |1 -
 include/asm-x86_64/smp.h  |1 -
 4 files changed, 5 deletions(-)

--- a/arch/x86_64/kernel/genapic.c
+++ b/arch/x86_64/kernel/genapic.c
@@ -29,8 +29,6 @@
= { [0 ... NR_CPUS-1] = BAD_APICID };
 EXPORT_SYMBOL(x86_cpu_to_apicid);
 
-u8 x86_cpu_to_log_apicid[NR_CPUS]  = { [0 ... NR_CPUS-1] = BAD_APICID };
-
 struct genapic __read_mostly *genapic = &apic_flat;
 
 /*
--- a/arch/x86_64/kernel/genapic_flat.c
+++ b/arch/x86_64/kernel/genapic_flat.c
@@ -52,7 +52,6 @@
 
num = smp_processor_id();
id = 1UL << num;
-   x86_cpu_to_log_apicid[num] = id;
apic_write(APIC_DFR, APIC_DFR_FLAT);
val = apic_read(APIC_LDR) & ~APIC_LDR_MASK;
val |= SET_APIC_LOGICAL_ID(id);
--- a/arch/x86_64/kernel/smpboot.c
+++ b/arch/x86_64/kernel/smpboot.c
@@ -702,7 +702,6 @@
cpu_clear(cpu, cpu_present_map);
cpu_clear(cpu, cpu_possible_map);
x86_cpu_to_apicid[cpu] = BAD_APICID;
-   x86_cpu_to_log_apicid[cpu] = BAD_APICID;
return -EIO;
}
 
--- a/include/asm-x86_64/smp.h
+++ b/include/asm-x86_64/smp.h
@@ -78,7 +78,6 @@
  * the real APIC ID <-> CPU # mapping.
  */
 extern u8 x86_cpu_to_apicid[NR_CPUS];  /* physical ID */
-extern u8 x86_cpu_to_log_apicid[NR_CPUS];
 extern u8 bios_cpu_apicid[];
 
 static inline int cpu_present_to_apicid(int mps_cpu)

-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread David Chinner

On Tue, Sep 11, 2007 at 04:00:17PM +1000, Nick Piggin wrote:
> > > OTOH, I'm not sure how much buy-in there was from the filesystems guys.
> > > Particularly Christoph H and XFS (which is strange because they already
> > > do vmapping in places).
> >
> > I think they use vmapping because they have to, not because they want
> > to. They might be a lot happier with fsblock if it used contiguous pages
> > for large blocks whenever possible - I don't know for sure. The metadata
> > accessors they might be unhappy with because it's inconvenient but as
> > Christoph Hellwig pointed out at VM/FS, the filesystems who really care
> > will convert.
> 
> Sure, they would rather not to. But there are also a lot of ways you can
> improve vmap more than what XFS does (or probably what darwin does)
> (more persistence for cached objects, and batched invalidates for example).

XFS already has persistence across the object life time (which can be many
tens of seconds for a frequently used buffer) and it also does batched
unmapping of objects as well.

> There are also a lot of trivial things you can do to make a lot of those
> accesses not require vmaps (and less trivial things, but even such things
> as binary searches over multiple pages should be quite possible with a bit
> of logic).

Yes, we already do the many of these things (via xfs_buf_offset()), but
that is not good enough for something like a memcpy that spans multiple
pages in a large block (think btree block compaction, splits and recombines).

IOWs, we already play these vmap harm-minimisation games in the places
where we can, but still the overhead is high and something we'd prefer
to be able to avoid.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] cpuset dirty limits

2007-09-11 Thread Ethan Solomita

Per cpuset dirty ratios

This implements dirty ratios per cpuset. Two new files are added
to the cpuset directories:

background_dirty_ratio  Percentage at which background writeback starts

throttle_dirty_ratioPercentage at which the application is throttled
and we start synchrononous writeout.

Both variables are set to -1 by default which means that the global
limits (/proc/sys/vm/vm_dirty_ratio and /proc/sys/vm/dirty_background_ratio)
are used for a cpuset.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 5/include/linux/cpuset.h 
7/include/linux/cpuset.h
--- 5/include/linux/cpuset.h2007-09-11 14:50:48.0 -0700
+++ 7/include/linux/cpuset.h2007-09-11 14:51:12.0 -0700
@@ -77,6 +77,7 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+extern void cpuset_get_current_ratios(int *background, int *ratio);
 /*
  * We need macros since struct address_space is not defined yet
  */
diff -uprN -X 0/Documentation/dontdiff 5/kernel/cpuset.c 7/kernel/cpuset.c
--- 5/kernel/cpuset.c   2007-09-11 14:50:49.0 -0700
+++ 7/kernel/cpuset.c   2007-09-11 14:56:18.0 -0700
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -92,6 +93,9 @@ struct cpuset {
int mems_generation;
 
struct fmeter fmeter;   /* memory_pressure filter */
+
+   int background_dirty_ratio;
+   int throttle_dirty_ratio;
 };
 
 /* Retrieve the cpuset for a container */
@@ -169,6 +173,8 @@ static struct cpuset top_cpuset = {
.flags = ((1 << CS_CPU_EXCLUSIVE) | (1 << CS_MEM_EXCLUSIVE)),
.cpus_allowed = CPU_MASK_ALL,
.mems_allowed = NODE_MASK_ALL,
+   .background_dirty_ratio = -1,
+   .throttle_dirty_ratio = -1,
 };
 
 /*
@@ -785,6 +791,21 @@ static int update_flag(cpuset_flagbits_t
return 0;
 }
 
+static int update_int(int *cs_int, char *buf, int min, int max)
+{
+   char *endp;
+   int val;
+
+   val = simple_strtol(buf, &endp, 10);
+   if (val < min || val > max)
+   return -EINVAL;
+
+   mutex_lock(&callback_mutex);
+   *cs_int = val;
+   mutex_unlock(&callback_mutex);
+   return 0;
+}
+
 /*
  * Frequency meter - How fast is some event occurring?
  *
@@ -933,6 +954,8 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+   FILE_THROTTLE_DIRTY_RATIO,
+   FILE_BACKGROUND_DIRTY_RATIO,
 } cpuset_filetype_t;
 
 static ssize_t cpuset_common_file_write(struct container *cont,
@@ -997,6 +1020,12 @@ static ssize_t cpuset_common_file_write(
retval = update_flag(CS_SPREAD_SLAB, cs, buffer);
cs->mems_generation = cpuset_mems_generation++;
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   retval = update_int(&cs->background_dirty_ratio, buffer, -1, 
100);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   retval = update_int(&cs->throttle_dirty_ratio, buffer, -1, 100);
+   break;
default:
retval = -EINVAL;
goto out2;
@@ -1090,6 +1119,12 @@ static ssize_t cpuset_common_file_read(s
case FILE_SPREAD_SLAB:
*s++ = is_spread_slab(cs) ? '1' : '0';
break;
+   case FILE_BACKGROUND_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->background_dirty_ratio);
+   break;
+   case FILE_THROTTLE_DIRTY_RATIO:
+   s += sprintf(s, "%d", cs->throttle_dirty_ratio);
+   break;
default:
retval = -EINVAL;
goto out;
@@ -1173,6 +1208,20 @@ static struct cftype cft_spread_slab = {
.private = FILE_SPREAD_SLAB,
 };
 
+static struct cftype cft_background_dirty_ratio = {
+   .name = "background_dirty_ratio",
+   .read = cpuset_common_file_read,
+   .write = cpuset_common_file_write,
+   .private = FILE_BACKGROUND_DIRTY_RATIO,
+};
+
+static struct cftype cft_throttle_dirty_ratio = {
+   .name = "throttle_dirty_ratio",
+   .read = cpuset_common_file_read,
+   .write = cpuset_common_file_write,
+   .private = FILE_THROTTLE_DIRTY_RATIO,
+};
+
 static int cpuset_populate(struct container_subsys *ss, struct container *cont)
 {
int err;
@@ -1193,6 +1242,10 @@ static int cpuset_populate(struct contai
return err;
if ((err = container_add_file(cont, ss, &cft_spread_slab)) < 0)
return err;
+   if ((err = container_add_file(cont, ss, &cft_background_dirty_ratio)) < 
0)
+   return err;
+   if ((err = container_add_file(cont, ss, &cft_throttle_dirty_ratio)) < 0)
+   return err;
/* memory_pressure_enabled is in root cpuset only */
if (err == 0 && !cont->parent)

[PATCH 5/6] cpuset write vm writeout

2007-09-11 Thread Ethan Solomita

Throttle VM writeout in a cpuset aware way

This bases the vm throttling from the reclaim path on the dirty ratio
of the cpuset. Note that a cpuset is only effective if shrink_zone is called
from direct reclaim.

kswapd has a cpuset context that includes the whole machine. VM throttling
will only work during synchrononous reclaim and not  from kswapd.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 4/include/linux/writeback.h 
5/include/linux/writeback.h
--- 4/include/linux/writeback.h 2007-09-11 14:49:47.0 -0700
+++ 5/include/linux/writeback.h 2007-09-11 14:50:52.0 -0700
@@ -94,7 +94,7 @@ static inline void inode_sync_wait(struc
 int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(nodemask_t *nodes,gfp_t gfp_mask);
 
 /* These are exported to sysctl. */
 extern int dirty_background_ratio;
diff -uprN -X 0/Documentation/dontdiff 4/mm/page-writeback.c 
5/mm/page-writeback.c
--- 4/mm/page-writeback.c   2007-09-11 14:49:47.0 -0700
+++ 5/mm/page-writeback.c   2007-09-11 14:50:52.0 -0700
@@ -386,7 +386,7 @@ void balance_dirty_pages_ratelimited_nr(
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-void throttle_vm_writeout(gfp_t gfp_mask)
+void throttle_vm_writeout(nodemask_t *nodes, gfp_t gfp_mask)
 {
struct dirty_limits dl;
 
@@ -401,7 +401,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
}
 
for ( ; ; ) {
-   get_dirty_limits(&dl, NULL, &node_online_map);
+   get_dirty_limits(&dl, NULL, nodes);
 
/*
 * Boost the allowable dirty threshold a bit for page
diff -uprN -X 0/Documentation/dontdiff 4/mm/vmscan.c 5/mm/vmscan.c
--- 4/mm/vmscan.c   2007-09-11 14:50:41.0 -0700
+++ 5/mm/vmscan.c   2007-09-11 14:50:52.0 -0700
@@ -1185,7 +1185,7 @@ static unsigned long shrink_zone(int pri
}
}
 
-   throttle_vm_writeout(sc->gfp_mask);
+   throttle_vm_writeout(&cpuset_current_mems_allowed, sc->gfp_mask);
 
atomic_dec(&zone->reclaim_in_progress);
return nr_reclaimed;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/6] cpuset write throttle

2007-09-11 Thread Ethan Solomita

Make page writeback obey cpuset constraints

Currently dirty throttling does not work properly in a cpuset.

If f.e a cpuset contains only 1/10th of available memory then all of the
memory of a cpuset can be dirtied without any writes being triggered.
If all of the cpusets memory is dirty then only 10% of total memory is dirty.
The background writeback threshold is usually set at 10% and the synchrononous
threshold at 40%. So we are still below the global limits while the dirty
ratio in the cpuset is 100%! Writeback throttling and background writeout
do not work at all in such scenarios.

This patch makes dirty writeout cpuset aware. When determining the
dirty limits in get_dirty_limits() we calculate values based on the
nodes that are reachable from the current process (that has been
dirtying the page). Then we can trigger writeout based on the
dirty ratio of the memory in the cpuset.

We trigger writeout in a a cpuset specific way. We go through the dirty
inodes and search for inodes that have dirty pages on the nodes of the
active cpuset. If an inode fulfills that requirement then we begin writeout
of the dirty pages of that inode.

Adding up all the counters for each node in a cpuset may seem to be quite
an expensive operation (in particular for large cpusets with hundreds of
nodes) compared to just accessing the global counters if we do not have
a cpuset. However, please remember that the global counters were only
introduced recently. Before 2.6.18 we did add up per processor
counters for each processor on each invocation of get_dirty_limits().
We now add per node information which I think is equal or less effort
since there are less nodes than processors.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Signed-off-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 2/mm/page-writeback.c 
3/mm/page-writeback.c
--- 2/mm/page-writeback.c   2007-09-11 14:39:22.0 -0700
+++ 3/mm/page-writeback.c   2007-09-11 14:49:35.0 -0700
@@ -103,6 +103,14 @@ EXPORT_SYMBOL(laptop_mode);
 
 static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
+struct dirty_limits {
+   long thresh_background;
+   long thresh_dirty;
+   unsigned long nr_dirty;
+   unsigned long nr_unstable;
+   unsigned long nr_writeback;
+};
+
 /*
  * Work out the current dirty-memory clamping and background writeout
  * thresholds.
@@ -121,16 +129,20 @@ static void background_writeout(unsigned
  * clamping level.
  */
 
-static unsigned long highmem_dirtyable_memory(unsigned long total)
+static unsigned long highmem_dirtyable_memory(nodemask_t *nodes, unsigned long 
total)
 {
 #ifdef CONFIG_HIGHMEM
int node;
unsigned long x = 0;
 
+   if (nodes == NULL)
+   nodes = &node_online_mask;
for_each_node_state(node, N_HIGH_MEMORY) {
struct zone *z =
&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
+   if (!node_isset(node, nodes))
+   continue;
x += zone_page_state(z, NR_FREE_PAGES)
+ zone_page_state(z, NR_INACTIVE)
+ zone_page_state(z, NR_ACTIVE);
@@ -154,26 +166,74 @@ static unsigned long determine_dirtyable
x = global_page_state(NR_FREE_PAGES)
+ global_page_state(NR_INACTIVE)
+ global_page_state(NR_ACTIVE);
-   x -= highmem_dirtyable_memory(x);
+   x -= highmem_dirtyable_memory(NULL, x);
return x + 1;   /* Ensure that we never return 0 */
 }
 
-static void
-get_dirty_limits(long *pbackground, long *pdirty,
-   struct address_space *mapping)
+static int
+get_dirty_limits(struct dirty_limits *dl, struct address_space *mapping,
+   nodemask_t *nodes)
 {
int background_ratio;   /* Percentages */
int dirty_ratio;
int unmapped_ratio;
long background;
long dirty;
-   unsigned long available_memory = determine_dirtyable_memory();
+   unsigned long available_memory;
+   unsigned long nr_mapped;
struct task_struct *tsk;
+   int is_subset = 0;
 
-   unmapped_ratio = 100 - ((global_page_state(NR_FILE_MAPPED) +
-   global_page_state(NR_ANON_PAGES)) * 100) /
-   available_memory;
+#ifdef CONFIG_CPUSETS
+   if (unlikely(nodes &&
+   !nodes_subset(node_online_map, *nodes))) {
+   int node;
 
+   /*
+* Calculate the limits relative to the current cpuset.
+*
+* We do not disregard highmem because all nodes (except
+* maybe node 0) have either all memory in HIGHMEM (32 bit) or
+* all memory in non HIGHMEM (64 bit). If we would disregard
+* highmem then cpuset throttl

[PATCH 4/6] cpuset write vmscan

2007-09-11 Thread Ethan Solomita

Direct reclaim: cpuset aware writeout

During direct reclaim we traverse down a zonelist and are carefully
checking each zone if its a member of the active cpuset. But then we call
pdflush without enforcing the same restrictions. In a larger system this
may have the effect of a massive amount of pages being dirtied and then either

A. No writeout occurs because global dirty limits have not been reached

or

B. Writeout starts randomly for some dirty inode in the system. Pdflush
   may just write out data for nodes in another cpuset and miss doing
   proper dirty handling for the current cpuset.

In both cases dirty pages in the zones of interest may not be affected
and writeout may not occur as necessary.

Fix that by restricting pdflush to the active cpuset. Writeout will occur
from direct reclaim the same way as without a cpuset.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 3/mm/vmscan.c 4/mm/vmscan.c
--- 3/mm/vmscan.c   2007-09-11 14:41:56.0 -0700
+++ 4/mm/vmscan.c   2007-09-11 14:50:41.0 -0700
@@ -1301,7 +1301,8 @@ unsigned long do_try_to_free_pages(struc
 */
if (total_scanned > sc->swap_cluster_max +
sc->swap_cluster_max / 2) {
-   wakeup_pdflush(laptop_mode ? 0 : total_scanned, NULL);
+   wakeup_pdflush(laptop_mode ? 0 : total_scanned,
+   &cpuset_current_mems_allowed);
sc->may_writepage = 1;
}
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] cpuset write pdflush nodemask

2007-09-11 Thread Ethan Solomita

pdflush: Allow the passing of a nodemask parameter

If we want to support nodeset specific writeout then we need a way
to communicate the set of nodes that an operation should affect.

So add a nodemask_t parameter to the pdflush functions and also
store the nodemask in the pdflush control structure.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 1/fs/buffer.c 2/fs/buffer.c
--- 1/fs/buffer.c   2007-09-11 14:36:24.0 -0700
+++ 2/fs/buffer.c   2007-09-11 14:39:22.0 -0700
@@ -372,7 +372,7 @@ static void free_more_memory(void)
struct zone **zones;
pg_data_t *pgdat;
 
-   wakeup_pdflush(1024);
+   wakeup_pdflush(1024, NULL);
yield();
 
for_each_online_pgdat(pgdat) {
diff -uprN -X 0/Documentation/dontdiff 1/fs/super.c 2/fs/super.c
--- 1/fs/super.c2007-09-11 14:36:05.0 -0700
+++ 2/fs/super.c2007-09-11 14:39:22.0 -0700
@@ -616,7 +616,7 @@ int do_remount_sb(struct super_block *sb
return 0;
 }
 
-static void do_emergency_remount(unsigned long foo)
+static void do_emergency_remount(unsigned long foo, nodemask_t *bar)
 {
struct super_block *sb;
 
@@ -644,7 +644,7 @@ static void do_emergency_remount(unsigne
 
 void emergency_remount(void)
 {
-   pdflush_operation(do_emergency_remount, 0);
+   pdflush_operation(do_emergency_remount, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/fs/sync.c 2/fs/sync.c
--- 1/fs/sync.c 2007-09-11 14:36:05.0 -0700
+++ 2/fs/sync.c 2007-09-11 14:39:22.0 -0700
@@ -21,9 +21,9 @@
  * sync everything.  Start out by waking pdflush, because that writes back
  * all queues in parallel.
  */
-static void do_sync(unsigned long wait)
+static void do_sync(unsigned long wait, nodemask_t *unused)
 {
-   wakeup_pdflush(0);
+   wakeup_pdflush(0, NULL);
sync_inodes(0); /* All mappings, inodes and their blockdevs */
DQUOT_SYNC(NULL);
sync_supers();  /* Write the superblocks */
@@ -38,13 +38,13 @@ static void do_sync(unsigned long wait)
 
 asmlinkage long sys_sync(void)
 {
-   do_sync(1);
+   do_sync(1, NULL);
return 0;
 }
 
 void emergency_sync(void)
 {
-   pdflush_operation(do_sync, 0);
+   pdflush_operation(do_sync, 0, NULL);
 }
 
 /*
diff -uprN -X 0/Documentation/dontdiff 1/include/linux/writeback.h 
2/include/linux/writeback.h
--- 1/include/linux/writeback.h 2007-09-11 14:37:46.0 -0700
+++ 2/include/linux/writeback.h 2007-09-11 14:39:22.0 -0700
@@ -91,7 +91,7 @@ static inline void inode_sync_wait(struc
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+int wakeup_pdflush(long nr_pages, nodemask_t *nodes);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
@@ -122,7 +122,8 @@ balance_dirty_pages_ratelimited(struct a
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
+int pdflush_operation(void (*fn)(unsigned long, nodemask_t *nodes),
+   unsigned long arg0, nodemask_t *nodes);
 int generic_writepages(struct address_space *mapping,
   struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff -uprN -X 0/Documentation/dontdiff 1/mm/page-writeback.c 
2/mm/page-writeback.c
--- 1/mm/page-writeback.c   2007-09-11 14:36:24.0 -0700
+++ 2/mm/page-writeback.c   2007-09-11 14:39:22.0 -0700
@@ -101,7 +101,7 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
+static void background_writeout(unsigned long _min_pages, nodemask_t *nodes);
 
 /*
  * Work out the current dirty-memory clamping and background writeout
@@ -272,7 +272,7 @@ static void balance_dirty_pages(struct a
 */
if ((laptop_mode && pages_written) ||
 (!laptop_mode && (nr_reclaimable > background_thresh)))
-   pdflush_operation(background_writeout, 0);
+   pdflush_operation(background_writeout, 0, NULL);
 }
 
 void set_page_dirty_balance(struct page *page)
@@ -362,7 +362,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
  * writeback at least _min_pages, and keep writing until the amount of dirty
  * memory is less than the background threshold, or until we're all clean.
  */
-static void background_writeout(unsigned long _min_pages)
+static void background_writeout(unsigned long _min_pages, nodemask_t *unused)
 {
long min_pages = _min_pages;
struct writeback_control wbc = {
@@ -402,12 +402,12 @@ static void background_writeout(unsigned
  * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
  * -1 if all pdflush thre

[PATCH 1/6] cpuset write dirty map

2007-09-11 Thread Ethan Solomita


Add a dirty map to struct address_space

In a NUMA system it is helpful to know where the dirty pages of a mapping
are located. That way we will be able to implement writeout for applications
that are constrained to a portion of the memory of the system as required by
cpusets.

This patch implements the management of dirty node maps for an address
space through the following functions:

cpuset_clear_dirty_nodes(mapping)   Clear the map of dirty nodes

cpuset_update_nodes(mapping, page)  Record a node in the dirty nodes map

cpuset_init_dirty_nodes(mapping)First time init of the map


The dirty map may be stored either directly in the mapping (for NUMA
systems with less then BITS_PER_LONG nodes) or separately allocated
for systems with a large number of nodes (f.e. IA64 with 1024 nodes).

Updating the dirty map may involve allocating it first for large
configurations. Therefore we protect the allocation and setting
of a node in the map through the tree_lock. The tree_lock is
already taken when a page is dirtied so there is no additional
locking overhead if we insert the updating of the nodemask there.

The dirty map is only cleared (or freed) when the inode is cleared.
At that point no pages are attached to the inode anymore and therefore it can
be done without any locking. The dirty map therefore records all nodes that
have been used for dirty pages by that inode until the inode is no longer
used.

Signed-off-by: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ethan Solomita <[EMAIL PROTECTED]>

---

Patch against 2.6.23-rc4-mm1

diff -uprN -X 0/Documentation/dontdiff 0/fs/buffer.c 1/fs/buffer.c
--- 0/fs/buffer.c   2007-09-11 14:35:58.0 -0700
+++ 1/fs/buffer.c   2007-09-11 14:36:24.0 -0700
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
 
@@ -723,6 +724,7 @@ static int __set_page_dirty(struct page 
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
+   cpuset_update_dirty_nodes(mapping, page);
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 
diff -uprN -X 0/Documentation/dontdiff 0/fs/fs-writeback.c 1/fs/fs-writeback.c
--- 0/fs/fs-writeback.c 2007-09-11 14:35:58.0 -0700
+++ 1/fs/fs-writeback.c 2007-09-11 14:36:24.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "internal.h"
 
 int sysctl_inode_debug __read_mostly;
@@ -476,6 +477,12 @@ int generic_sync_sb_inodes(struct super_
continue;   /* blockdev has wrong queue */
}
 
+   if (!cpuset_intersects_dirty_nodes(mapping, wbc->nodes)) {
+   /* No pages on the nodes under writeback */
+   list_move(&inode->i_list, &sb->s_dirty);
+   continue;
+   }
+
/* Was this inode dirtied after sync_sb_inodes was called? */
if (time_after(inode->dirtied_when, start))
break;
diff -uprN -X 0/Documentation/dontdiff 0/fs/inode.c 1/fs/inode.c
--- 0/fs/inode.c2007-09-11 14:35:58.0 -0700
+++ 1/fs/inode.c2007-09-11 14:36:24.0 -0700
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This is needed for the following functions:
@@ -157,6 +158,7 @@ static struct inode *alloc_inode(struct 
mapping_set_gfp_mask(mapping, GFP_HIGHUSER_PAGECACHE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
+   cpuset_init_dirty_nodes(mapping);
 
/*
 * If the block_device provides a backing_dev_info for client
@@ -264,6 +266,7 @@ void clear_inode(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+   cpuset_clear_dirty_nodes(inode->i_mapping);
inode->i_state = I_CLEAR;
 }
 
diff -uprN -X 0/Documentation/dontdiff 0/include/linux/cpuset.h 
1/include/linux/cpuset.h
--- 0/include/linux/cpuset.h2007-09-11 14:35:58.0 -0700
+++ 1/include/linux/cpuset.h2007-09-11 14:36:24.0 -0700
@@ -77,6 +77,45 @@ extern void cpuset_track_online_nodes(vo
 
 extern int current_cpuset_is_being_rebound(void);
 
+/*
+ * We need macros since struct address_space is not defined yet
+ */
+#if MAX_NUMNODES <= BITS_PER_LONG
+#define cpuset_update_dirty_nodes(__mapping, __page)   \
+   do {\
+   int node = page_to_nid(__page); \
+   if (!node_isset(node, (__mapping)->dirty_nodes))\
+   node_set(node, (__mapping)->dirty_nodes);   \
+   } while (0)
+
+#define cpuse

Re: [PATCH 0/6] cpuset aware writeback

2007-09-11 Thread Ethan Solomita

Perform writeback and dirty throttling with awareness of cpuset mem_allowed.

The theory of operation has two primary elements:

1. Add a nodemask per mapping which indicates the nodes
   which have set PageDirty on any page of the mappings.

2. Add a nodemask argument to wakeup_pdflush() which is
   propagated down to sync_sb_inodes.

This leaves sync_sb_inodes() with two nodemasks. One is passed to it and
specifies the nodes the caller is interested in syncing, and will either
be null (i.e. all nodes) or will be cpuset_current_mems_allowed in the
caller's context.

The second nodemask is attached to the inode's mapping and shows who has
modified data in the inode. sync_sb_inodes() will then skip syncing of
inodes if the nodemask argument does not intersect with the mapping
nodemask.

cpuset_current_mems_allowed will be passed in to pdflush
background_writeout by try_to_free_pages and balance_dirty_pages.
balance_dirty_pages also passes the nodemask in to writeback_inodes
directly when doing active reclaim.

Other callers do not limit inode writeback, passing in a NULL nodemask
pointer.

A final change is to get_dirty_limits. It takes a nodemask argument, and
when it is null there is no change in behavior. If the nodemask is set,
page statistics are accumulated only for specified nodes, and the
background and throttle dirty ratios will be read from a new per-cpuset
ratio feature.

For testing I did a variety of basic tests, verifying individual
features of the test. To verify that it fixes the core problem, I
created a stress test which involved using cpusets and mems_allowed
to split memory so that all daemons had memory set aside for them, and
my memory stress test had a separate set of memory. The stress test was
mmaping 7GB of a very large file on disk. It then scans the entire 7GB
of memory reading and modifying each byte. 7GB is more than the amount
of physical memory made available to the stress test.

Using iostat I can see the initial period of reading from disk, followed
by a period of simultaneous reads and writes as dirty bytes are pushed
to make room for new reads.

In a separate log-in, in the other cpuset, I am running:

while `true`; do date | tee -a date.txt; sleep 5; done

date.txt resides on the same disk as the large file mentioned above. The
above while-loop serves the dual purpose of providing me visual clues of
progress along with the opportunity for the "tee" command to become
throttled writing to the disk.

The effect of this patchset is straightforward. Without it there are
long hangs between appearances of the date. With it the dates are all 5
(or sometimes 6) seconds apart.

I also added printks to the kernel to verify that, without these
patches, the tee was being throttled (along with lots of other things),
and with the patch only pdflush is being throttled.

These patches are mostly unchanged from Chris Lameter's original
changelist posted previously to linux-mm.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][Intel-IOMMU] Fix for IOMMU early crash

2007-09-11 Thread Keshavamurthy, Anil S

On Wed, Sep 12, 2007 at 05:48:52AM +1000, Paul Mackerras wrote:
> Keshavamurthy, Anil S writes:
> 
> > Subject: Fix IOMMU early crash
> > 
> > This patch avoids copying pci_bus's->sysdata to
> > pci_dev's->sysdata as one can easily obtain
> > the same through pci_dev->bus->sysdata.
> 
> At the moment this will cause ppc64 to crash, since we rely on
> pci_dev->sysdata pointing to some node in the firmware device tree,
> either the device's node or the node for a parent bus.
> 
> We could change the ppc64 code to use pci_dev->bus->sysdata in the
> case when pci_dev->sysdata is NULL, which would fix the problem.  I
> think that change should be incorporated as part of this patch so that
> we don't break git bisection.
Can I expect the ppc64 code changes from you? 
Once I get your, I will merge with mine and post it again.
> 
> In other words I don't want to see this patch applied as it stands.
Yup, I agree with you.

-Anil
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [announce] CFS-devel, performance improvements

2007-09-11 Thread Rob Hussey

Hi Ingo,

When compiling, I get:
In file included from kernel/sched.c:794:
kernel/sched_fair.c: In function 'task_new_fair':
kernel/sched_fair.c:857: error: 'sysctl_sched_child_runs_first'
undeclared (first use in this function)
kernel/sched_fair.c:857: error: (Each undeclared identifier is
reported only once
kernel/sched_fair.c:857: error: for each function it appears in.)

Presumably because sched_fair.c is being included into sched.c before
sysctl_sched_child_runs_first is defined.

Regards,
Rob
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] v3 of IBM power meter driver

2007-09-11 Thread Darrick J. Wong

On Tue, Sep 11, 2007 at 09:23:35AM -0400, Mark M. Hoffman wrote:

> I am not an IPMI expert, so I would appreciate getting an Acked-by from
> someone who knows more about that subsystem.
> 
> Anyway, some comments are below.  This is nowhere near a complete review yet.

Thank you for the review!  Comments interspersed below, though for
brevity the one-liners have been fixed.

> > +config SENSORS_IBMPEX
> > +   tristate "IBM PowerExecutive temperature/power sensors"
> > +   depends on IPMI_SI
> 
> Open question: can we use "select" here?  As written, it took some hunting to
> even get this driver to show up as an option in menuconfig.

Changed, since it seems reasonable that someone looking for PEx support
might not necessarily know that it is based upon IPMI.

> > +struct ibmpex_bmc_data {
> > +   struct list_headlist;
> > +   struct class_device *class_dev;
> 
> My current stack of patches includes one which requires that this be changed
> to 'struct device *hwmon_dev', as 'struct class_device' is going away soon.
> You may rebase on my testing tree[1], or else I will just follow up with a
> patch to fix this up after I eventually merge yours.
> 
> [1] 
> http://lm-sensors.org/kernel?p=kernel/mhoffman/hwmon-2.6.git;a=shortlog;h=testing

Done.

> > +static ssize_t ibmpex_show_sensor(struct device *dev,
> > + struct device_attribute *devattr,
> > + char *buf)
> > +{
> > +   struct sensor_device_attribute *attr = to_sensor_dev_attr(devattr);
> > +   int iface = PEX_INTERFACE(attr->index);
> > +   int sensor = PEX_SENSOR(attr->index);
> > +   int func = PEX_FUNC(attr->index);
> > +   struct ibmpex_bmc_data *data = get_bmc_data(iface);
> 
> ... especially given how many times you're going to call it.  Is there any
> reason you can't use the driver_data field of struct device *dev for that?

I can (and did) update the code to use dev_get/set_drvdata for the
accessors.  However, the "iface" field exists as a mechanism to map
interface numbers to struct ibmpex_bmc_data/struct device data because
the callback that IPMI uses to notify clients that BMCs are going away
only passes the interface number, not the struct device itself.
Unfortunately, this means that get_bmc_data() must remain, but now it is
only used once at the end of life.

> E.g. i2c based hwmon drivers do this at some point during the probe:
> 
>   i2c_set_clientdata(new_client, data);
> 
> (which becomes)
> 
>   dev_set_drvdata(&new_client->dev, data);
> 
> If you could do that, then you no longer need 'iface' at all in the function
> above... *that* may allow you to use the SENSOR_ATTR_2 mechanism from
> hwmon-sysfs.h - much easier to read than the manual number packing for 
> 'sensor'
> and 'func'.

Doesn't look too hard; I'll have a go at it and see how it does.

> > +   err = ibmpex_query_sensor_count(data);
> > +   if (err < 0)
> > +   return -ENOENT;
> > +   data->num_sensors = err;
> > +
> 
> Did you mean 'if (err <= 0)' ?

Yes.

> > +   /* Create attributes */
> > +   for (j = 0; j < PEX_NUM_SENSOR_FUNCS; j++)
> > +   if (create_sensor(data, sensor_type, sensor_counter,
> > + i, j))
> 
> Why not 'err = create_sensor(...)' and propagate the actual error here?

Rough draft syndrome?  'tis fixed. :)

--D


signature.asc
Description: Digital signature

Re: [PATCH -mm] uvesafb: Don't access VGA registers directly when running on non-x86

2007-09-11 Thread Paul Mundt

On Wed, Sep 12, 2007 at 01:09:59AM +0200, Michal Januszewski wrote:
> The VGA registers are only available at their legacy IO locations on x86.
> Don't try to access them when running on other arches.
> 
> Note that the code accessing them directly is just an optimization (limits
> slow BIOS function calls).  We don't lose any functionality by using
> BIOS calls instead of it on non-x86.
> 
If you do that, then you also have to #ifdef CONFIG_X86 around
video/vga.h, as that drags in asm/vga.h, which does not exist on all
platforms. I have little interest in adding a stub vga.h on my
architectures to support a driver that in practice works on nothing but
x86.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [announce] CFS-devel, performance improvements

2007-09-11 Thread Roman Zippel

Hi,

Hi,

Out of curiousity: will I ever get answers to my questions?

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 merge - a little feedback

2007-09-11 Thread Paul Mundt

On Tue, Sep 11, 2007 at 10:34:23PM +0100, Andi Kleen wrote:
> > People do not expect code under arch/i386/ to be used by code under
> > arch/x86_64/ and vice versa.
> >
> > That regularly results in people sending patches that don't compile on
> > the other architecture.
> >
> > With one architecture it's much more obvious that the code is shared.
> 
> Will that cause people to compile test both? I have my doubts that 
> will really work.
> 
> e.g. a similar example would be CONFIG_MMU=n. The code 
> is mostly shared and in the same directories, but people still
> break the MMUless architectures all the time. 
> 
As I was the first one to do CONFIG_MMU=y/n in the same arch directory,
since 2.5, I can tell you that that's simply crap. The only reason
CONFIG_MMU=n gets broken all the time is because people don't think about
it in generic code, it's rarely broken in the architecture code, and even
with the most occasional of build tests most of that gets caught in a
hurry.

You do of course have to consider both cases when writing new code, but
those things tend to be pretty obvious. It's a bit more work for the arch
maintainer, but it's certainly far less confusing and problematic than
separating things out.

In fact, going the opposite route is what leads to endless trouble in the
long run, since you brought up the MMUless example, m68knommu is a good
example. Rather than beginning life in arch/m68k, it was forked off,
mostly to deal with the ColdFire CPUs that weren't planned to have MMUs.
Now that the product line has moved along, adding an MMU to it is in the
roadmap, which means that inevitably they're both going to have to be
combined anyways. Simply dealing with the initial trouble of having them
combined initially would have solved a lot of that mess.

You can ignore the added maintenance for as long as possible, but sooner
or later it's going to be a problem. Procrastination is not something
that bodes particularly well for divergent hardware support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Christoph Lameter

On Wed, 12 Sep 2007, Andrea Arcangeli wrote:

> On Tue, Sep 11, 2007 at 01:41:08PM -0700, Christoph Lameter wrote:
> > The advantages of this approach over Andreas is basically that the 4k 
> > filesystems still can be used as is. 4k is useful for binaries and for 
> 
> If you mean that with my approach you can't use a 4k filesystem as is,
> that's not correct. I even run the (admittedly premature but
> promising) benchmarks on my patch on a 4k blocksized
> filesystem... Guess what, you can even still mount a 1k fs on a 2.6
> kernel.

Right you can use a 4k filesystem. The 4k blocks are buffers in a larger 
page then.

> The main advantage I can see in your patch is that distributions won't
> need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be
> slower).

I would think that your approach would be slower since you always have to 
populate 1 << N ptes when mmapping a file? Plus there is a lot of wastage 
of memory because even a file with one character needs an order N page? So 
there are less pages available for the same workload.

Then you are breaking mmap assumptions of applications becaused the order 
N kernel will no longer be able to map 4k pages.  You likely need a new 
binary format that has pages correctly aligned. I know that we would need 
one on IA64 if we go beyond the established page sizes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Adrian McMenamin

On 12/09/2007, Björn Steinbrink <[EMAIL PROTECTED]> wrote:

>
> A fix would likely initialize "when" to jiffies.
>
> Björn
>

Thanks, I'll try that :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Christoph Lameter

On Tue, 11 Sep 2007, Nick Piggin wrote:

> > Well its seems that we have different interpretations of what was agreed
> > on. My understanding was that the large blocksize patchset was okay
> > provided that I supply an acceptable mmap implementation and put a
> > warning in.
> 
> Yes. I think we differ on our interpretations of "okay". In my interpretation,
> it is not OK to use this patch as a way to solve VM or FS or IO scalability
> issues, especially not while the alternative approaches that do _not_ have
> these problems have not been adequately compared or argued against.

We never talked about not being able to solve scalability issues with this 
patchset. The alternate approaches were discussed at the VM MiniSummit and 
at the VM/FS meeting. You organized the VM/FS summit. I know you were 
there and were arguing for your approach. That was not sufficient?

> > Well even without slab targeted reclaim: Mel's antifrag will sort the
> > dentries into separate blocks of memory and so isolate the issue.
> 
> So even after all this time you do not understand what the fundamental
> problem is with anti-frag and yet you are happy to waste both our time
> in endless flamewars telling me how wrong I am about it.

We obviously have discussed this before and the only point of asking this 
question by you seems to be to have me repeat the whole line argument 
again?

> Forgive me if I'm starting to be rude, Christoph. This is really irritating.

Sorry but I have had too much exposure to philosophy. Talk about absolutes 
like guarantees (that do not exist even for order 0 allocs) and unlikely memory 
fragmentation scenarios to show that something does not work seems to 
be getting into some metaphysical realm where there is no data anymore 
to draw any firm conclusions.

Software reliability is inherent probabilistic otherwise we would not have 
things like CRC sums and SHA1 algorithms. Its just a matter of reducing 
the failure rate sufficiently. The failure rate for lower order 
allocations (0-3) seems to have been significantly reduced in 2.6.23 
through lumpy reclaim.

If antifrag measures are not successful (likely for 2M allocs) then other 
methods (like the large page pools that you skipped when reading my post) 
will need to be used.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: + git-net-broke-ixgbe.patch added to -mm tree

2007-09-11 Thread Kok, Auke


[EMAIL PROTECTED] wrote:

The patch titled
 git-net-broke-ixgbe
has been added to the -mm tree.  Its filename is
 git-net-broke-ixgbe.patch

*** Remember to use Documentation/SubmitChecklist when testing your code ***

See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
out what to do about this

--
Subject: git-net-broke-ixgbe
From: Andrew Morton <[EMAIL PROTECTED]>

igiveup


relax! do not dispair!

I will have a patch for ixgbe to fixup the NAPI API stuff tomorrow! This assumes 
that you have the version that I sent out last week though (v4).


It's running smoke tests right now, should be ready tomorrow.

Cheers,

Auke
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Björn Steinbrink

On 2007.09.12 00:10:19 +0100, Adrian McMenamin wrote:
> On 12/09/2007, Björn Steinbrink <[EMAIL PROTECTED]> wrote:
> > On 2007.09.12 00:19:09 +0200, Rene Herman wrote:
> > > On 09/12/2007 12:15 AM, Adrian McMenamin wrote:
> > >
> > >> On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote:
> > >>> On 09/12/2007 12:05 AM, Adrian McMenamin wrote:
> > >>>
> >  OK, why does this line occasionally return true:
> >
> > What exactly is "occassionally"?  Does it happen more than once per
> > boot? If not, and it happens after a certain time after booting, it
> > might be wrapping of the jiffie counter (see below).
> >
> > 
> >    if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))
> > 
> >  while this one never does (no other changes made):
> > 
> >  if  ((maple_dev->interval > 0) && (time_after(jiffies,
> >  maple_dev->when)))
> > >>> Is maple_dev->when an unsigned long?
> > >>>
> > >> Yes. Does that make a difference?
> > >
> > > If it had been a signed type, it could've wrapped to something you didn't
> > > expect, explaining the difference at least...
> > >
> > > With an unsigned long, the only diference should be that time_after() 
> > > deals
> > > with jiffie wrapping which I assume is not an actual problem here. I'll
> > > retreat into the shades again... ;-(
> >
> > If "occasionally" is limited to once per boot, it might be jiffie
> > wrapping. IIRC jiffies are initialized so that they wrap after about 5
> > minutes of uptime to reveal such bugs without forcing you to wait for
> > ages just to have the counter wrap for the first time.
> >
> 
> No, I mean "works properly" - ie occasionally evaluates as true

Ehrm, yeah, I somehow parsed that as if it had a negation in there.

Anyway, I looked up the patches you posted. "when" is initialized to 0
and only changed if the above condition evaluates to true. Now,
time_after and "<" have different points at which "future" and "past"
are separated. time_after splits (about) equally between future and
past, so 0 can be either, depending on the value of jiffies. But for "<"
0 is almost always in the past, except for the seldom event of jiffies
being 0.

Now, given that jiffies start out at a huge value to make the counter
wrap around early, 0 happens to be in the "future" for time_after, until
the wrap around occurs. So in your case, you just might have to wait
those 5 minutes to get the working behaviour, instead of the common case
in which it breaks after that time ;-)

A fix would likely initialize "when" to jiffies.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git patches] IDE fixes for 2.6.23-rc6

2007-09-11 Thread Jeff Garzik


Bartlomiej Zolnierkiewicz wrote:

Please pull from:

master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6.git/

to receive the following updates:

 drivers/ata/pata_ali.c |7 ++
 drivers/ide/Kconfig|4 +-
 drivers/ide/ide-iops.c |3 +-
 drivers/ide/pci/alim15x3.c |7 ++
 drivers/ide/pci/hpt366.c   |  138 +++-
 drivers/ide/pci/pdc202xx_new.c |9 ++-
 drivers/ide/pci/via82cxxx.c|   15 +++-
 drivers/ide/ppc/mpc8xx.c   |1 -
 drivers/ide/setup-pci.c|   41 +---
 include/linux/ide.h|   13 
 10 files changed, 141 insertions(+), 97 deletions(-)


Bartlomiej Zolnierkiewicz (1):
  via82cxxx: add Arima W730-K8 and other rebadgings to short cables list

Daniel Exner (1):
  pata_ali/alim15x3: override 80-wire cable detection for Toshiba S1800-814

Kumar Gala (1):
  mpc8xx: Only build mpc8xx on arch/ppc

Mikael Pettersson (1):
  pdc202xx_new: PLL detection fix

Sergei Shtylyov (5):
  ide: fix PCI refcounting
  pdc202xx_new: fix PCI refcounting
  hpt366: fix PCI clock detection for HPT374 (take 4)
  ide: add ide_dev_is_sata() helper (take 2)
  hpt366: UltraDMA filter for SATA cards (take 2)

Tony Breeds (1):
  pmac: build fix


diff --git a/drivers/ata/pata_ali.c b/drivers/ata/pata_ali.c
index 94e5edc..71bdc3b 100644
--- a/drivers/ata/pata_ali.c
+++ b/drivers/ata/pata_ali.c
@@ -48,6 +48,13 @@ static struct dmi_system_id cable_dmi_table[] = {
DMI_MATCH(DMI_BOARD_VERSION, "OmniBook N32N-736"),
},
},
+   {
+   .ident = "Toshiba Satelite S1800-814",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "TOSHIBA"),
+   DMI_MATCH(DMI_PRODUCT_NAME, "S1800-814"),
+   },
+   },
{ }
 };
 


ACK


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

irq load balancing

2007-09-11 Thread Venkat Subbiah

Most of the load in my system is triggered by a single ethernet IRQ. 
Essentially the IRQ schedules a tasklet and most of the work is done in the 
taskelet which is scheduled in the IRQ. From what I read looks like the tasklet 
would be executed on the same CPU on which it was scheduled. So this means even 
in an SMP system it will be one processor which is overloaded.

So will using the user space IRQ loadbalancer really help? What I am doubtful 
about is that the user space load balance comes along and changes the affinity 
once in a while. But really what I need is every interrupt to go to a different 
CPU in a round robin fashion.

Looks like the APIC  can distribute IRQ's dynamically? Is this supported in the 
kernel and any config or proc interface to turn this on/off.


Thx,
Venkat

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Andrea Arcangeli

On Tue, Sep 11, 2007 at 01:41:08PM -0700, Christoph Lameter wrote:
> The advantages of this approach over Andreas is basically that the 4k 
> filesystems still can be used as is. 4k is useful for binaries and for 

If you mean that with my approach you can't use a 4k filesystem as is,
that's not correct. I even run the (admittedly premature but
promising) benchmarks on my patch on a 4k blocksized
filesystem... Guess what, you can even still mount a 1k fs on a 2.6
kernel.

The main advantage I can see in your patch is that distributions won't
need to ship a 64k PAGE_SIZE kernel rpm (but your single rpm will be
slower).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Rene Herman


On 09/12/2007 01:09 AM, Björn Steinbrink wrote:

On 2007.09.12 00:19:09 +0200, Rene Herman wrote:

On 09/12/2007 12:15 AM, Adrian McMenamin wrote:


On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote:

On 09/12/2007 12:05 AM, Adrian McMenamin wrote:


OK, why does this line occasionally return true:


What exactly is "occassionally"?  Does it happen more than once per
boot? If not, and it happens after a certain time after booting, it
might be wrapping of the jiffie counter (see below).


  if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))

while this one never does (no other changes made):

if  ((maple_dev->interval > 0) && (time_after(jiffies, 
maple_dev->when)))

Is maple_dev->when an unsigned long?


Yes. Does that make a difference?
If it had been a signed type, it could've wrapped to something you didn't 
expect, explaining the difference at least...


With an unsigned long, the only diference should be that time_after() deals 
with jiffie wrapping which I assume is not an actual problem here. I'll 
retreat into the shades again... ;-(


If "occasionally" is limited to once per boot, it might be jiffie
wrapping. IIRC jiffies are initialized so that they wrap after about 5
minutes of uptime to reveal such bugs without forcing you to wait for
ages just to have the counter wrap for the first time.


Yes, but if jiifie wrapping was the problem, I'd expect the contrary 
behaviour with the time_after() one hitting while the > one does not.


Rene.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -mm] uvesafb: Don't access VGA registers directly when running on non-x86

2007-09-11 Thread Michal Januszewski

The VGA registers are only available at their legacy IO locations on x86.
Don't try to access them when running on other arches.

Note that the code accessing them directly is just an optimization (limits
slow BIOS function calls).  We don't lose any functionality by using
BIOS calls instead of it on non-x86.

Signed-off-by: Michal Januszewski <[EMAIL PROTECTED]>
---
diff --git a/drivers/video/uvesafb.c b/drivers/video/uvesafb.c
index 853323e..74fa7c7 100644
--- a/drivers/video/uvesafb.c
+++ b/drivers/video/uvesafb.c
@@ -935,6 +935,7 @@ static int uvesafb_setpalette(struct uvesafb_pal_entry 
*entries, int count,
if (start + count > 256)
return -EINVAL;
 
+#ifdef CONFIG_X86
/* Use VGA registers if mode is VGA-compatible. */
if (i >= 0 && i < par->vbe_modes_cnt &&
par->vbe_modes[i].mode_attr & VBE_MODE_VGACOMPAT) {
@@ -957,8 +958,10 @@ static int uvesafb_setpalette(struct uvesafb_pal_entry 
*entries, int count,
  "D" (entries),/* EDI */
  "S" (&par->pmi_pal)); /* ESI */
}
-#endif
-   else {
+#endif /* CONFIG_X86_32 */
+   else
+#endif /* CONFIG_X86 */
+   {
task = uvesafb_prep();
if (!task)
return -ENOMEM;
@@ -1102,6 +1105,7 @@ static int uvesafb_blank(int blank, struct fb_info *info)
struct uvesafb_ktask *task;
int err = 1;
 
+#ifdef CONFIG_X86
if (par->vbe_ib.capabilities & VBE_CAP_VGACOMPAT) {
int loop = 1;
u8 seq = 0, crtc17 = 0;
@@ -1124,7 +1128,9 @@ static int uvesafb_blank(int blank, struct fb_info *info)
while (loop--);
vga_wcrt(NULL, 0x17, crtc17);
vga_wseq(NULL, 0x00, 0x03);
-   } else {
+   } else
+#endif /* CONFIG_X86 */
+   {
task = uvesafb_prep();
if (!task)
return -ENOMEM;

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Adrian McMenamin

On 12/09/2007, Björn Steinbrink <[EMAIL PROTECTED]> wrote:
> On 2007.09.12 00:19:09 +0200, Rene Herman wrote:
> > On 09/12/2007 12:15 AM, Adrian McMenamin wrote:
> >
> >> On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote:
> >>> On 09/12/2007 12:05 AM, Adrian McMenamin wrote:
> >>>
>  OK, why does this line occasionally return true:
>
> What exactly is "occassionally"?  Does it happen more than once per
> boot? If not, and it happens after a certain time after booting, it
> might be wrapping of the jiffie counter (see below).
>
> 
>    if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))
> 
>  while this one never does (no other changes made):
> 
>  if  ((maple_dev->interval > 0) && (time_after(jiffies,
>  maple_dev->when)))
> >>> Is maple_dev->when an unsigned long?
> >>>
> >> Yes. Does that make a difference?
> >
> > If it had been a signed type, it could've wrapped to something you didn't
> > expect, explaining the difference at least...
> >
> > With an unsigned long, the only diference should be that time_after() deals
> > with jiffie wrapping which I assume is not an actual problem here. I'll
> > retreat into the shades again... ;-(
>
> If "occasionally" is limited to once per boot, it might be jiffie
> wrapping. IIRC jiffies are initialized so that they wrap after about 5
> minutes of uptime to reveal such bugs without forcing you to wait for
> ages just to have the counter wrap for the first time.
>

No, I mean "works properly" - ie occasionally evaluates as true
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Björn Steinbrink

On 2007.09.12 00:19:09 +0200, Rene Herman wrote:
> On 09/12/2007 12:15 AM, Adrian McMenamin wrote:
>
>> On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote:
>>> On 09/12/2007 12:05 AM, Adrian McMenamin wrote:
>>>
 OK, why does this line occasionally return true:

What exactly is "occassionally"?  Does it happen more than once per
boot? If not, and it happens after a certain time after booting, it
might be wrapping of the jiffie counter (see below).


   if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))

 while this one never does (no other changes made):

 if  ((maple_dev->interval > 0) && (time_after(jiffies, 
 maple_dev->when)))
>>> Is maple_dev->when an unsigned long?
>>>
>> Yes. Does that make a difference?
>
> If it had been a signed type, it could've wrapped to something you didn't 
> expect, explaining the difference at least...
>
> With an unsigned long, the only diference should be that time_after() deals 
> with jiffie wrapping which I assume is not an actual problem here. I'll 
> retreat into the shades again... ;-(

If "occasionally" is limited to once per boot, it might be jiffie
wrapping. IIRC jiffies are initialized so that they wrap after about 5
minutes of uptime to reveal such bugs without forcing you to wait for
ages just to have the counter wrap for the first time.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH -mm] video: uvesafb: Add X86 dependency.

2007-09-11 Thread Michal Januszewski

On Tue, Sep 11, 2007 at 09:31:59PM +0900, Paul Mundt wrote:

> > Anyway, I think it is up to Michal to decide if we should remove the 
> > kernel support for other archs, or let it stay a bit more while working 
> > on solving the v86d side of things. So I'll just step aside now
> > 
> Once v86d is fixed up to get at the ROM directly and the driver uses MMIO
> directly, I don't see a problem with unrestricting it. For the time being
> however, the build is both broken, and the emulator it uses won't run on
> anything but x86, so I see no reason not to add a Kconfig dependency that
> reflects this until such a time where it's no longer true.
> 
> At least if there's a set of restrictions on something fairly generic,
> they tend to be visible, and they also tend to get fixed up over time. We
> should however not enable something generically which at the moment is
> very much tied to a single platform. Later patches can remove the
> dependency at such a time that that assertion no longer holds true.

Just to clear things up: yes, at the moment v86d supports only
x86 and amd64 (aka x86_64) and yes, supporting other arches is
possible and planned.  The main limiting factors as far as this
is concerned are the little amount of my free time and the fact
that I don't currently have access to non-x86 hardware.

Please note that the kernel part (i.e. uvesafb) is meant to be
generic (it currently uses VGA IO ports on non-x86, which is a
bug and for which a patch will follow) and support or lack thereof
for a specific arch should be dependent on v86d only.

That being said, I think that having a kernel dependency 
tracking the development status of userspace code is generally
a bad idea.

Best regards,
-- 
Michal Januszewski  JID: [EMAIL PROTECTED]
Gentoo Linux Developerhttp://people.gentoo.org/spock

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] eCryptfs: Use generic_file_splice_read()

2007-09-11 Thread Michael Halcrow

eCryptfs is currently just passing through splice reads to the lower
filesystem. This is obviously incorrect behavior; the decrypted data
is what needs to be read, not the lower encrypted data. I cannot think
of any good reason for eCryptfs to implement splice_read, so this
patch points the eCryptfs fops splice_read to use
generic_file_splice_read.

Signed-off-by: Michael Halcrow <[EMAIL PROTECTED]>

--- linux-2.6.23-rc4-mm1.orig/fs/ecryptfs/file.c
+++ linux-2.6.23-rc4-mm1/fs/ecryptfs/file.c
@@ -338,21 +338,6 @@ static int ecryptfs_fasync(int fd, struc
return rc;
 }
 
-static ssize_t ecryptfs_splice_read(struct file *file, loff_t * ppos,
-   struct pipe_inode_info *pipe, size_t count,
-   unsigned int flags)
-{
-   struct file *lower_file = NULL;
-   int rc = -EINVAL;
-
-   lower_file = ecryptfs_file_to_lower(file);
-   if (lower_file->f_op && lower_file->f_op->splice_read)
-   rc = lower_file->f_op->splice_read(lower_file, ppos, pipe,
-   count, flags);
-
-   return rc;
-}
-
 static int ecryptfs_ioctl(struct inode *inode, struct file *file,
  unsigned int cmd, unsigned long arg);
 
@@ -365,7 +350,7 @@ const struct file_operations ecryptfs_di
.release = ecryptfs_release,
.fsync = ecryptfs_fsync,
.fasync = ecryptfs_fasync,
-   .splice_read = ecryptfs_splice_read,
+   .splice_read = generic_file_splice_read,
 };
 
 const struct file_operations ecryptfs_main_fops = {
@@ -382,7 +367,7 @@ const struct file_operations ecryptfs_ma
.release = ecryptfs_release,
.fsync = ecryptfs_fsync,
.fasync = ecryptfs_fasync,
-   .splice_read = ecryptfs_splice_read,
+   .splice_read = generic_file_splice_read,
 };
 
 static int
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: libata not working for sis5533

2007-09-11 Thread Andrew Morton

On Sun, 09 Sep 2007 13:35:26 +0200
Patrizio Bassi <[EMAIL PROTECTED]> wrote:

> Patrizio Bassi ha scritto:
> > Jan Engelhardt ha scritto:
> >> On Sep 8 2007 11:38, Patrizio Bassi wrote:
> >>  
> >>> Jan Engelhardt wrote:
> >>>
>  I shall give this a spin too, since I happen to have sis5513.
> 
>  Just booted this fresh ata-enabled system (a matter of mkinitrd).
> It has
>  not exploded yet.
>   
> >>> don't you have the "irq 14" issue?
> >>>
> >> No, does not seem so.
> >>
> >>  
> >>> can you post here your .config?
> >>>
> >> http://rafb.net/p/vfTX0966.html
> >>
> >> Maybe it is solved in 2.6.22.3? (I don't remember what your version
> >> was.)
> >>
> >>
> >> Jan
> >>  
> >
> > For Alan, libata devs...hope can help debug...
> > this is http://www.patriziobassi.it/downloads/libata_issue.jpg

Looks more like a platform irq routing issue than an ata issue.

Perhaps an x86 or an acpi person can help out with this.

Probably nothing will happen, in which case I'll get back to you later
and ask you to raise a bugzilla entry, not that this will get it fixed :(


> > and this is the relative config i'm using
> > http://www.patriziobassi.it/downloads/config
> >
> > Let me know
> >
> > Patrizio
> 
> more debug:
> 
> I tried as suggested with the irqpoll option, i just get a faster panic
> as i don't have the 3 xfermode lines...but always impossibile to boot...
> 
> Patrizio
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] timekeeping: Prevent time going backwards on resume

2007-09-11 Thread Marcelo Tosatti


Patch below fixes the problem we were seeing (negative delta calculated 
in tick_do_update_jiffies64).

Thanks again Thomas!

On Wed, Sep 12, 2007 at 12:36:34AM +0200, Thomas Gleixner wrote:
> Timekeeping resume adjusts xtime by adding the slept time in seconds and
> resets the reference value of the clock source (clock->cycle_last).
> clock->cycle last is used to calculate the delta between the last xtime
> update and the readout of the clock source in __get_nsec_offset(). xtime
> plus the offset is the current time. The resume code ignores the delta
> which had already elapsed between the last xtime update and the actual
> time of suspend. If the suspend time is short, then we can see time
> going backwards on resume.
> 
> Suspend:
> offs_s = clock->read() - clock->cycle_last;
> now = xtime + offs_s;
> timekeeping_suspend_time = read_rtc();
> 
> Resume:
> sleep_time = read_rtc() - timekeeping_suspend_time;
> xtime.tv_sec += sleep_time;
> clock->cycle_last = clock->read();
> offs_r = clock->read() - clock->cycle_last;
> now = xtime + offs_r;
> 
> if sleep_time_seconds == 0 and offs_r < offs_s, then time goes
> backwards.
> 
> Fix this by storing the offset from the last xtime update and add it to
> xtime during resume, when we reset clock->cycle_last:
> 
> sleep_time = read_rtc() - timekeeping_suspend_time;
> xtime.tv_sec += sleep_time;
> xtime += offs_s;  /* Fixup xtime offset at suspend time */
> clock->cycle_last = clock->read();
> offs_r = clock->read() - clock->cycle_last;
> now = xtime + offs_r;
> 
> Thanks to Marcelo for tracking this down on the OLPC and providing the
> necessary details to analyze the root cause.
> 
> Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>
> 
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -280,6 +280,8 @@ void __init timekeeping_init(void)
>  static int timekeeping_suspended;
>  /* time in seconds when suspend began */
>  static unsigned long timekeeping_suspend_time;
> +/* xtime offset when we went into suspend */
> +static s64 timekeeping_suspend_offset;
>  
>  /**
>   * timekeeping_resume - Resumes the generic timekeeping subsystem.
> @@ -305,6 +307,8 @@ static int timekeeping_resume(struct sys_device *dev)
>   wall_to_monotonic.tv_sec -= sleep_length;
>   total_sleep_time += sleep_length;
>   }
> + /* Make sure that we have the correct xtime reference */
> + timespec_add_ns(&xtime, timekeeping_suspend_offset);
>   /* re-base the last cycle value */
>   clock->cycle_last = clocksource_read(clock);
>   clock->error = 0;
> @@ -326,6 +330,8 @@ static int timekeeping_suspend(struct sys_device *dev, 
> pm_message_t state)
>   unsigned long flags;
>  
>   write_seqlock_irqsave(&xtime_lock, flags);
>   timekeeping_suspended = 1;
> + /* Get the current xtime offset */
> + timekeeping_suspend_offset = __get_nsec_offset();
>   timekeeping_suspend_time = read_persistent_clock();
>   write_sequnlock_irqrestore(&xtime_lock, flags);
> 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: sk98lin for 2.6.23-rc1

2007-09-11 Thread Willy Tarreau

On Tue, Sep 11, 2007 at 05:03:57PM +0200, Adrian Bunk wrote:
> On Tue, Sep 11, 2007 at 10:29:47AM -0400, Bill Davidsen wrote:
> > So if you want people to try a new driver, I think it really has to have 
> > some benefits to the users, in terms of performance, reliability, or 
> > features. "Cleaner design" doesn't motivate, and it does raise the question 
> > of why the old driver wasn't just cleaned up. I've been doing software for 
> > decades, I appreciate why, but users in general just want to use their 
> > system. Which raises the question of why to delete drivers which work for 
> > many or even most users?
> 
> As I already explained, there is a long term advantage for all users if 
> there is only one driver in the kernel.

Not only that. You have to place the switch in its context with history.
Stephen, please correct me if I'm wrong, but sk98lin has been randomly
working for a very long time. Not 100% the driver's fault, because it
has had to workaround a lot of chips bugs. The fact that this driver
supports *all* chips in the family makes it harder to identify whether
problems are caused by the hardware or by the driver because it is
bloated with tons of if/else.

I've personally encountered random data corruption on the receive path
with PCI-E hardware with sk98lin, as well as random TX stops. Sometimes
it would require one terabyte of data, sometimes just a few hundreds
megs. On other hardware (skge now), UDP would simply stop being sent
and some TCP traffic was necessary to restart UDP! One guy at Marvell
once asked me for more information, but it was not easy to provide
much more, given the randomness of the problems!

Stephen has done an excellent (and thankless) job at restarting from
scratch, and the idea to separate the two chips was a good one IMHO.
The problem is that he might have thought that most of the bugs were
in the driver, while most of them are in the hardware, and this requires
a lot of workarounds, which do not always work the same for everybody
(I remember having tried to disable flow control with sk98lin because
it helped with sky2).

In parallel, sk98lin has improved on the vendor's site. v8 exhibited
all the problems I explained above, but v10 has fixed a lot of them,
making the new sk98lin more reliable. In parallel, sky2 and skge had
got wider acceptance and testing. The nastiest hardware bugs will
slowly surface, a good deal of driver bugs have been detected too
(and that's expected from any new driver).

It is possible that after 2 or 3 patches, a lot of the remaining
problems will suddenly vanish. But it's also possible that the driver
will still not work for 1% of people for 1 or 2 years because of some
obscure hardware combinations which trigger some obscure hardware bugs.

> Therefore all users should 
> switch away from obsolete drivers to the replacement drivers, and the 
> obsolete driver will be removed at some point in time. The only question 
> is how to do it.

Desktop users genreally have no problem experimenting with multiple kernels
or drivers. They can report feedback too, but generally, they're not very
good at downloading alternative drivers and patching their kernel with those.

Server users cannot experiment for a long time. After 2 or 3 losses of
service, they *have* to provide a definitive solution. For some of them
when sky2 fails, it may very well be to switch over to sk98lin. Downloading
from the vendor's site and patching is not a problem for those users, but
it causes them the trouble of updating the kernel for security fixes, so
the old driver must be shipped with the kernel.

However, I remember something which might constitute a solution. In 2.4,
there's a small bug in the kbuild process on alpha. One question is always
asked during make oldconfig. Its saved value is ignored because of the way
it is computed. I don't know if we could do this with 2.6 kbuild. It would
then be nice to always set sk98lin to unset if it was set to "Y" or "M",
so that at each build, the user has to explicitly state he wants it. It's
annoying enough to give the other one a try once in a while, without causing
too much trouble to people who really have no other choice right now.

What we need with this driver is people being fed up with it, not them
being unable to use it as a last resort. Also, given that it has improved
over the last years (probably due to competition pressure from sky2/skge),
users will even less understand why there is such incentive to remove it.

Another trick for obsolete drivers would be to simply remove them from
the usual build system, but have them being available for explicit build.
Eg: make modules will not build them, but make obsolete-modules would do.

> > Testing a new kernel is no longer a drop in a boot 
> > operation if modprobe.conf must be edited to get the network up, and the 
> > typical user isn't going to write that shell script to try one or the other 
> > driver.
> 
> The typical user will let his distribution handl

[PATCH] timekeeping: Prevent time going backwards on resume

2007-09-11 Thread Thomas Gleixner

Timekeeping resume adjusts xtime by adding the slept time in seconds and
resets the reference value of the clock source (clock->cycle_last).
clock->cycle last is used to calculate the delta between the last xtime
update and the readout of the clock source in __get_nsec_offset(). xtime
plus the offset is the current time. The resume code ignores the delta
which had already elapsed between the last xtime update and the actual
time of suspend. If the suspend time is short, then we can see time
going backwards on resume.

Suspend:
offs_s = clock->read() - clock->cycle_last;
now = xtime + offs_s;
timekeeping_suspend_time = read_rtc();

Resume:
sleep_time = read_rtc() - timekeeping_suspend_time;
xtime.tv_sec += sleep_time;
clock->cycle_last = clock->read();
offs_r = clock->read() - clock->cycle_last;
now = xtime + offs_r;

if sleep_time_seconds == 0 and offs_r < offs_s, then time goes
backwards.

Fix this by storing the offset from the last xtime update and add it to
xtime during resume, when we reset clock->cycle_last:

sleep_time = read_rtc() - timekeeping_suspend_time;
xtime.tv_sec += sleep_time;
xtime += offs_s;/* Fixup xtime offset at suspend time */
clock->cycle_last = clock->read();
offs_r = clock->read() - clock->cycle_last;
now = xtime + offs_r;

Thanks to Marcelo for tracking this down on the OLPC and providing the
necessary details to analyze the root cause.

Signed-off-by: Thomas Gleixner <[EMAIL PROTECTED]>

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -280,6 +280,8 @@ void __init timekeeping_init(void)
 static int timekeeping_suspended;
 /* time in seconds when suspend began */
 static unsigned long timekeeping_suspend_time;
+/* xtime offset when we went into suspend */
+static s64 timekeeping_suspend_offset;
 
 /**
  * timekeeping_resume - Resumes the generic timekeeping subsystem.
@@ -305,6 +307,8 @@ static int timekeeping_resume(struct sys_device *dev)
wall_to_monotonic.tv_sec -= sleep_length;
total_sleep_time += sleep_length;
}
+   /* Make sure that we have the correct xtime reference */
+   timespec_add_ns(&xtime, timekeeping_suspend_offset);
/* re-base the last cycle value */
clock->cycle_last = clocksource_read(clock);
clock->error = 0;
@@ -326,6 +330,8 @@ static int timekeeping_suspend(struct sys_device *dev, 
pm_message_t state)
unsigned long flags;
 
write_seqlock_irqsave(&xtime_lock, flags);
timekeeping_suspended = 1;
+   /* Get the current xtime offset */
+   timekeeping_suspend_offset = __get_nsec_offset();
timekeeping_suspend_time = read_persistent_clock();
write_sequnlock_irqrestore(&xtime_lock, flags);


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[git patches] ocfs2 fixes

2007-09-11 Thread Mark Fasheh

This includes a small doc update, which I missed earlier. It doesn't change
any code. The other three patches are real fixes.
--Mark

Please pull from 'upstream-linus' branch of
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2.git upstream-linus

to receive the following updates:

 Documentation/filesystems/ocfs2.txt |   13 --
 fs/Kconfig  |3 -
 fs/ocfs2/alloc.c|1 
 fs/ocfs2/aops.c |4 +-
 fs/ocfs2/file.c |1 
 fs/ocfs2/super.c|   69 +++-
 6 files changed, 50 insertions(+), 41 deletions(-)

Mark Fasheh (2):
  ocfs2: update docs for new features
  ocfs2: Fix calculation of i_blocks during truncate

Tiger Yang (1):
  ocfs2: fix mount option parsing

[EMAIL PROTECTED] (1):
  ocfs2: Fix a wrong cluster calculation.

diff --git a/Documentation/filesystems/ocfs2.txt 
b/Documentation/filesystems/ocfs2.txt
index 8ccf0c1..ed55238 100644
--- a/Documentation/filesystems/ocfs2.txt
+++ b/Documentation/filesystems/ocfs2.txt
@@ -28,11 +28,7 @@ Manish Singh  <[EMAIL PROTECTED]>
 Caveats
 ===
 Features which OCFS2 does not support yet:
-   - sparse files
- extended attributes
-   - shared writable mmap
-   - loopback is supported, but data written will not
- be cluster coherent.
- quotas
- cluster aware flock
- cluster aware lockf
@@ -57,3 +53,12 @@ nointr   Do not allow signals to 
interrupt cluster
 atime_quantum=60(*)OCFS2 will not update atime unless this number
of seconds has passed since the last update.
Set to zero to always update atime.
+data=ordered   (*) All data are forced directly out to the main file
+   system prior to its metadata being committed to the
+   journal.
+data=writeback Data ordering is not preserved, data may be written
+   into the main file system after its metadata has been
+   committed to the journal.
+preferred_slot=0(*)During mount, try to use this filesystem slot first. If
+   it is in use by another node, the first empty one found
+   will be chosen. Invalid values will be ignored.
diff --git a/fs/Kconfig b/fs/Kconfig
index 58a0650..f9eed6d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -441,9 +441,6 @@ config OCFS2_FS
 
  Note: Features which OCFS2 does not support yet:
  - extended attributes
- - shared writeable mmap
- - loopback is supported, but data written will not
-   be cluster coherent.
  - quotas
  - cluster aware flock
  - Directory change notification (F_NOTIFY)
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 4f51766..778a850 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -5602,6 +5602,7 @@ static int ocfs2_do_truncate(struct ocfs2_super *osb,
  clusters_to_del;
spin_unlock(&OCFS2_I(inode)->ip_lock);
le32_add_cpu(&fe->i_clusters, -clusters_to_del);
+   inode->i_blocks = ocfs2_inode_sector_count(inode);
 
status = ocfs2_trim_tree(inode, path, handle, tc,
 clusters_to_del, &delete_blk);
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 460d440..50cd8a2 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -855,6 +855,7 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt 
**wcp,
  struct ocfs2_super *osb, loff_t pos,
  unsigned len, struct buffer_head *di_bh)
 {
+   u32 cend;
struct ocfs2_write_ctxt *wc;
 
wc = kzalloc(sizeof(struct ocfs2_write_ctxt), GFP_NOFS);
@@ -862,7 +863,8 @@ static int ocfs2_alloc_write_ctxt(struct ocfs2_write_ctxt 
**wcp,
return -ENOMEM;
 
wc->w_cpos = pos >> osb->s_clustersize_bits;
-   wc->w_clen = ocfs2_clusters_for_bytes(osb->sb, len);
+   cend = (pos + len - 1) >> osb->s_clustersize_bits;
+   wc->w_clen = cend - wc->w_cpos + 1;
get_bh(di_bh);
wc->w_di_bh = di_bh;
 
diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 4ffa715..7e34e66 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -314,7 +314,6 @@ static int ocfs2_orphan_for_truncate(struct ocfs2_super 
*osb,
}
 
i_size_write(inode, new_i_size);
-   inode->i_blocks = ocfs2_align_bytes_to_sectors(new_i_size);
inode->i_ctime = inode->i_mtime = CURRENT_TIME;
 
di = (struct ocfs2_dinode *) fe_bh->b_data;
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index f2fc9a7..c034b51 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -81,8 +81,15 @@ static struct dentry *ocfs2_debugfs_root = NULL;
 MODULE_AUTHOR("Oracle");
 MODULE_LICENSE("GPL");

Re: sk98lin for 2.6.23-rc1

2007-09-11 Thread James Corey


--- Stephen Hemminger
<[EMAIL PROTECTED]> wrote:

> On Sun, 9 Sep 2007 13:13:26 +0200
> Adrian Bunk <[EMAIL PROTECTED]> wrote:
> 
> > On Sat, Sep 08, 2007 at 10:42:20PM -0400, Kyle
> Rose wrote:
> > > 
> > > > You are a regular reader of linux-kernel, and
> therefore the sk98lin 
> > > > removal can hardly be a surprise for you. If
> you prefer whining over 
> > > > helping to improve the kernel that's your
> choice...
> > > >   
> > > In my case the issue is simply one of
> practicality: I cannot go to the
> > > data center 5 times per day to reboot my colo
> box.  Therefore, I run
> > > sk98lin.  It's really that simple.
> > 
> > When did you report this bug the first time?
> > 
> > What we need is that people when testing a new
> kernel they plan to use 
> > test the new drivers *and report the bugs if they
> run into any*.
> > 
> > What could we have done so that you reported your
> bug without removing 
> > the sk98lin driver?
> > 
> > > Kyle
> > 
> > cu
> > Adrian
> 
> 
> There are several different problems in this thread:
> 1. The removal of old sk98lin driver caused some
> users to be forced to use
> skge. These users have uncovered issues with the
> dual port fiber based versions
> of the board.  
> Short term: The sk98lin driver should be
> restored to previous state, 
>and the PCI table should be used to limit the
> usage to only fiber systems.
>If Adrian doesn't do it, I'll do it when I
> return from Germany.
> Long term: I have fiber based board (thanks
> ebay) on the way to resolve
>skge bug.
> 
> 2. Sky2 driver has it's own fiber based problems. 
> Solve these after skge fiber.
> 
> 3. Sky2 doesn't have as many workarounds for
> hardware problems as vendor sk98lin
> driver.
> -


Hm, hope I didn't trigger a religious debate. When
you get to the point of working on the SKY2 driver
problem with DGE-550SX (Syskonnect SK-9S81) also
known as the "hw csum failure" issue, I'll be 
glad to test a patch or take debug data. Til then,
I'll stay out of the way.

-J





  

Shape Yahoo! in your own image.  Join our Network Research Panel today!   
http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Rene Herman


On 09/12/2007 12:15 AM, Adrian McMenamin wrote:


On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote:

On 09/12/2007 12:05 AM, Adrian McMenamin wrote:


OK, why does this line occasionally return true:

  if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))

while this one never does (no other changes made):

if  ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when)))

Is maple_dev->when an unsigned long?


Yes. Does that make a difference?


If it had been a signed type, it could've wrapped to something you didn't 
expect, explaining the difference at least...


With an unsigned long, the only diference should be that time_after() deals 
with jiffie wrapping which I assume is not an actual problem here. I'll 
retreat into the shades again... ;-(


Rene.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Adrian McMenamin

On 11/09/2007, Rene Herman <[EMAIL PROTECTED]> wrote:
> On 09/12/2007 12:05 AM, Adrian McMenamin wrote:
>
> > OK, why does this line occasionally return true:
> >
> >   if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))
> >
> > while this one never does (no other changes made):
> >
> > if  ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when)))
>
> Is maple_dev->when an unsigned long?
>
Yes. Does that make a difference?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: time_after - what on earth???

2007-09-11 Thread Rene Herman


On 09/12/2007 12:05 AM, Adrian McMenamin wrote:


OK, why does this line occasionally return true:

if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))

while this one never does (no other changes made):

if  ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when)))


Is maple_dev->when an unsigned long?

Rene.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[BUG:] forcedeth: MCP55 not allowing DHCP

2007-09-11 Thread Casey Dahlin

I have an Asus Striker Extreme motherboard with two built in MCP55 GigE 
interfaces. When I build with the original Fedora 7 release kernel ( 
ftp://ftp.belnet.be/linux/fedora/linux/releases/7/Fedora/i386/os/Fedora/kernel-2.6.21-1.3194.fc7.i686.rpm 
) everything works fine. However, when I boot with any updated kernels 
or any other kernel (have tried building from several points in the 
linus git tree between 2.6.20 and .23-rc3, and 2.6.21.2 in -stable) I 
cannot get an IP address via dhcp. There is no error in dmesg. The card 
shows a link and otherwise appears to be working, but it is as if the 
dhcp server has been removed from the network.


On a running system there is no indication that this is a kernel bug at 
all, however by varying only the kernel the bug appears and disappears. 
I've run all these tests repeatedly with no intervening updates of any 
other packages.


As I said I attempted to build 2.6.21.2 ( the point of divergence 
between the Fedora kernel in question and -stable ) and still the card 
did not work. I will next attempt to manually build the rpm for the 
release kernel. If this works I will try experimenting with the included 
patches to narrow it down, but at this point I'm at a complete loss.


-Casey Dahlin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH/RFC] doc: about email clients for Linux kernel patches

2007-09-11 Thread Sami Farin

On Tue, Sep 11, 2007 at 14:38:13 -0400, Lee Revell wrote:
> You can also diff -Nru old.c new.c | xclip, select Preformat, then
> paste with the middle button.

mutt does not come with text editor, so
I'd like to add note about vim:

If using xclip, type command
:set paste
before middle button or shift-insert or use
:r filename

...if you want to include patch inline.
(a)ttach works fine without "set paste".

-- 
Do what you love because life is too short for anything else.



pgp9tnOcB12Bo.pgp
Description: PGP signature

time_after - what on earth???

2007-09-11 Thread Adrian McMenamin

OK, why does this line occasionally return true:

if ((maple_dev->interval > 0) && (jiffies >maple_dev->when))

while this one never does (no other changes made):

if  ((maple_dev->interval > 0) && (time_after(jiffies, maple_dev->when)))


Is this a gcc issue or what?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 07:48, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > But that's not my place to say, and I'm actually not arguing that high
> > order pagecache does not have uses (especially as a practical,
> > shorter-term solution which is unintrusive to filesystems).
> >
> > So no, I don't think I'm really going against the basics of what we
> > agreed in Cambridge. But it sounds like it's still being billed as
> > first-order support right off the bat here.
>
> Well its seems that we have different interpretations of what was agreed
> on. My understanding was that the large blocksize patchset was okay
> provided that I supply an acceptable mmap implementation and put a
> warning in.

Yes. I think we differ on our interpretations of "okay". In my interpretation,
it is not OK to use this patch as a way to solve VM or FS or IO scalability
issues, especially not while the alternative approaches that do _not_ have
these problems have not been adequately compared or argued against.

> > But even so, you can just hold an open fd in order to pin the dentry you
> > want. My attack would go like this: get the page size and allocation
> > group size for the machine, then get the number of dentries required to
> > fill a slab. Then read in that many dentries and pin one of them. Repeat
> > the process. Even if there is other activity on the system, it seems
> > possible that such a thing will cause some headaches after not too long a
> > time. Some sources of pinned memory are going to be better than others
> > for this of course, so yeah maybe pagetables will be a bit easier (I
> > don't know).
>
> Well even without slab targeted reclaim: Mel's antifrag will sort the
> dentries into separate blocks of memory and so isolate the issue.

So even after all this time you do not understand what the fundamental
problem is with anti-frag and yet you are happy to waste both our time
in endless flamewars telling me how wrong I am about it.

Forgive me if I'm starting to be rude, Christoph. This is really irritating.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Mel Gorman

On (11/09/07 14:48), Christoph Lameter didst pronounce:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> 
> > But that's not my place to say, and I'm actually not arguing that high
> > order pagecache does not have uses (especially as a practical,
> > shorter-term solution which is unintrusive to filesystems).
> > 
> > So no, I don't think I'm really going against the basics of what we agreed
> > in Cambridge. But it sounds like it's still being billed as first-order
> > support right off the bat here.
> 
> Well its seems that we have different interpretations of what was agreed 
> on. My understanding was that the large blocksize patchset was okay 
> provided that I supply an acceptable mmap implementation and put a 
> warning in.
> 

Warnings == #2 citizen in my mind with known potential failure cases. That
was the point I thought.

> > But even so, you can just hold an open fd in order to pin the dentry you
> > want. My attack would go like this: get the page size and allocation group
> > size for the machine, then get the number of dentries required to fill a
> > slab. Then read in that many dentries and pin one of them. Repeat the
> > process. Even if there is other activity on the system, it seems possible
> > that such a thing will cause some headaches after not too long a time.
> > Some sources of pinned memory are going to be better than others for
> > this of course, so yeah maybe pagetables will be a bit easier (I don't 
> > know).
> 
> Well even without slab targeted reclaim: Mel's antifrag will sort the 
> dentries into separate blocks of memory and so isolate the issue.

-- 
Mel Gorman
Part-time Phd Student  Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] drivers/firmware: const-ify DMI API and internals

2007-09-11 Thread Bartlomiej Zolnierkiewicz

On Saturday 01 September 2007, Jeff Garzik wrote:
> 
> commit 457b6eb3bf3341d2e143518a0bb99ffbb8d754c4
> Author: Jeff Garzik <[EMAIL PROTECTED]>
> Date:   Sat Sep 1 10:16:45 2007 -0400
> 
> drivers/firmware: const-ify DMI API and internals
> 
> Three main sets of changes:
> 
> 1) dmi_get_system_info() return value should have been marked const,
>since callers should not be changing that data.
> 
> 2) const-ify DMI internals, since DMI firmware tables should,
>whenever possible, be marked const to ensure we never ever write to
>that data area.
> 
> 3) const-ify DMI API, to enable marking tables const where possible
>in low-level drivers.
> 
> And if we're really lucky, this might enable some additional
> optimizations on the part of the compiler.
> 
> The bulk of the changes are #2 and #3, which are interrelated.  #1 could
> have been a separate patch, but it was so small compared to the others,
> it was easier to roll it into this changeset.
> 
> Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]>

[ a bit late ]

Acked-by: Bartlomiej Zolnierkiewicz <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Christoph Lameter

On Tue, 11 Sep 2007, Nick Piggin wrote:

> > No you have not explained why the theoretical issues continue to exist
> > given even just considering Lumpy Reclaim in .23 nor what effect the
> > antifrag patchset would have.
> 
> So how does lumpy reclaim, your slab patches, or anti-frag have
> much effect on the worst case situation? Or help much against a
> targetted fragmentation attack?

F.e. Lumpy reclaim reclaim neighboring pages and thus works against 
fragmentation. So your formulae no longer works.

> > And you have used a 2M pagesize which is 
> > irrelevant to this patchset that deals with blocksizes up to 64k. In my
> > experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
> > safe.
> 
> I used EXACTLY the page sizes that you brought up in your patch
> description (ie. 64K and 2MB).

The patch currently only supports 64k. There is hope that it will support 
2M at some point and as mentioned also a special large page pool facility 
may be required.

Quoting from the post:

I would like to increase the supported blocksize to very large pages in 
the future so that device drives will be capable of providing large 
contiguous mapping. For that purpose I think that we need a mechanism to 
reserve pools of varying large sizes at boot time. Such a mechanism can 
also be used to compensate in situations where one wants to use larger 
buffers but defragmentation support is not (yet?) capable to reliably 
provide pages of the desired sizes.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 merge - a little feedback

2007-09-11 Thread Linus Torvalds

On Tue, 11 Sep 2007, Andi Kleen wrote:
> 
> Will that cause people to compile test both? I have my doubts that 
> will really work.

If people don't compile-test both now, then why would they compile-test 
things when merged?

So no, that's not the point.

But at least things like "grep" will work sanely, and people will be 
*aware* that "Oh, this touches a file that may be used by the other 
word-size".

Right now, we have people changing "i386-only" files that turn out to be 
used by x86-64 too - through very subtle Makefile things that the person 
who only looks into the i386 Makefile will never even *see*.

THAT is the problem (well, at least part of it).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] leds: add #include to include/linux/leds.h for rwlock_t

2007-09-11 Thread Richard Purdie

On Tue, 2007-09-11 at 17:48 +0900, Yoichi Yuasa wrote:
> This patch has added #include  to include/linux/leds.h for 
> rwlock_t.
> 
> Signed-off-by: Yoichi Yuasa <[EMAIL PROTECTED]>

Added to the leds tree[1], thanks.

http://git.o-hand.com/?p=linux-rpurdie-leds;a=shortlog;h=for-mm

Richard

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: x86 merge - a little feedback

2007-09-11 Thread Adrian Bunk

On Tue, Sep 11, 2007 at 10:34:23PM +0100, Andi Kleen wrote:
> 
> >
> > People do not expect code under arch/i386/ to be used by code under
> > arch/x86_64/ and vice versa.
> >
> > That regularly results in people sending patches that don't compile on
> > the other architecture.
> >
> > With one architecture it's much more obvious that the code is shared.
> 
> Will that cause people to compile test both? I have my doubts that 
> will really work.
>...

You will see that it could be shared, and it'll be much easier to see 
all configurations it's used in.

Currently, there are 6 or 7 different ways how a function under 
arch/i386/ could be used by a function under arch/x86_64/ (and vice 
versa) and it's non-trivial to figure out all usages when grep'ing for 
users.

> -Andi

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Christoph Lameter

On Tue, 11 Sep 2007, Nick Piggin wrote:

> But that's not my place to say, and I'm actually not arguing that high
> order pagecache does not have uses (especially as a practical,
> shorter-term solution which is unintrusive to filesystems).
> 
> So no, I don't think I'm really going against the basics of what we agreed
> in Cambridge. But it sounds like it's still being billed as first-order
> support right off the bat here.

Well its seems that we have different interpretations of what was agreed 
on. My understanding was that the large blocksize patchset was okay 
provided that I supply an acceptable mmap implementation and put a 
warning in.

> But even so, you can just hold an open fd in order to pin the dentry you
> want. My attack would go like this: get the page size and allocation group
> size for the machine, then get the number of dentries required to fill a
> slab. Then read in that many dentries and pin one of them. Repeat the
> process. Even if there is other activity on the system, it seems possible
> that such a thing will cause some headaches after not too long a time.
> Some sources of pinned memory are going to be better than others for
> this of course, so yeah maybe pagetables will be a bit easier (I don't know).

Well even without slab targeted reclaim: Mel's antifrag will sort the 
dentries into separate blocks of memory and so isolate the issue.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: clockevents: fix resume logic

2007-09-11 Thread Thomas Gleixner

On Tue, 2007-09-11 at 21:52 +0200, Thomas Gleixner wrote:
> > C1:  type[C1] promotion[C2] demotion[--] latency[001] 
> > usage[0010] duration[]
> >*C2:  type[C2] promotion[--] demotion[C1] latency[001] 
> > usage[8316] duration[000170717293]
> 
> Ok, here we are. The bad one uses C2 which stops the local apic on the
> VAIO. I suspect we end up in the suspend/resume with going into C2
> without the broadcast active.
> 
> Can you try to get the output of SysRq-Q during the "it needs help from
> keyboard" period ?

Summary of the oddities we are seing:

1.) disabling local apic timer makes the problem go away
2.) disabling nohz and highres makes the problem go away
3.) adding the cpuidle patches from the acpi tree makes the problem go
away

The obvious conclusion is, that in all other cases we run into a state,
where the local apic timer is not working.

1) we do not use it
2) it is used in periodic mode
3) the cpu does not enter C2 (which turns the lapic timer off on the
VAIO)

While 1) and 3) are understandable, the reason why 2) is working is a
mystery because the periodic mode is affected by the C state as well.

Andrew, can you please provide the output of /proc/timer_list when you
boot the kernel with "nohz=off highres=off", but honestly I do not
expect a lot of enlightenment from it.

The Sysrq-Q output from the point where the box is stuck without
keystrokes might give us more information.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 07:41, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Nick Piggin wrote:
> > I think I would have as good a shot as any to write a fragmentation
> > exploit, yes. I think I've given you enough info to do the same, so I'd
> > like to hear a reason why it is not a problem.
>
> No you have not explained why the theoretical issues continue to exist
> given even just considering Lumpy Reclaim in .23 nor what effect the
> antifrag patchset would have.

So how does lumpy reclaim, your slab patches, or anti-frag have
much effect on the worst case situation? Or help much against a
targetted fragmentation attack?

> And you have used a 2M pagesize which is 
> irrelevant to this patchset that deals with blocksizes up to 64k. In my
> experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably
> safe.

I used EXACTLY the page sizes that you brought up in your patch
description (ie. 64K and 2MB).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] build system: section garbage collection for vmlinux

2007-09-11 Thread Daniel Walker

On Tue, 2007-09-11 at 21:07 +0100, Denys Vlasenko wrote:
> This patch is needed for --gc-sections to work, regardless
> of which final form that support will have.
> 
> This patch renames .text.xxx and .data.xxx sections
> into .xxx.text and .xxx.data, respectively.

I think you'll have better luck with this if you focus on a single
architecture (i386 would be best) .. 

Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch][Intel-IOMMU] Fix for IOMMU early crash

2007-09-11 Thread Keshavamurthy, Anil S

On Wed, Sep 12, 2007 at 05:48:52AM +1000, Paul Mackerras wrote:
> Keshavamurthy, Anil S writes:
> 
> > Subject: Fix IOMMU early crash
> > 
> > This patch avoids copying pci_bus's->sysdata to
> > pci_dev's->sysdata as one can easily obtain
> > the same through pci_dev->bus->sysdata.
> 
> At the moment this will cause ppc64 to crash, since we rely on
> pci_dev->sysdata pointing to some node in the firmware device tree,
> either the device's node or the node for a parent bus.
> 
> We could change the ppc64 code to use pci_dev->bus->sysdata in the
> case when pci_dev->sysdata is NULL, which would fix the problem.  I
> think that change should be incorporated as part of this patch so that
> we don't break git bisection.

Why do you want to check if pci_dev->sysdata is NULL then use
pci_dev->bus->sysdata else pci_dev->sysdata? If you change this
to always use pci_dev->bus->sysdata, then you don;t have to depend
on my patch and your patch can get in independent of mine.

> 
> In other words I don't want to see this patch applied as it stands.
Is it possible to post your patch anytime soon? Or feel free to combine
mine with yours and post it with your signed-off-by.

Thanks,
Anil
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Christoph Lameter

On Tue, 11 Sep 2007, Nick Piggin wrote:

> I think I would have as good a shot as any to write a fragmentation
> exploit, yes. I think I've given you enough info to do the same, so I'd
> like to hear a reason why it is not a problem.

No you have not explained why the theoretical issues continue to exist 
given even just considering Lumpy Reclaim in .23 nor what effect the 
antifrag patchset would have. And you have used a 2M pagesize which is 
irrelevant to this patchset that deals with blocksizes up to 64k. In my 
experience the use of blocksize < PAGE_COSTLY_ORDER (32k) is reasonably 
safe.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-11 Thread Nick Piggin

On Wednesday 12 September 2007 06:53, Mel Gorman wrote:
> On (11/09/07 11:44), Nick Piggin didst pronounce:

> However, this discussion belongs more with the non-existant-remove-slab
> patch. Based on what we've seen since the summits, we need a thorough
> analysis with benchmarks before making a final decision (kernbench, ebizzy,
> tbench (netpipe if someone has the time/resources), hackbench and maybe
> sysbench as well as something the filesystem people recommend to get good
> coverage of the subsystems).

True. Aside, it seems I might have been mistaken in saying Christoph
is proposing to use higher order allocations to fix the SLUB regression.
Anyway, I agree let's not get sidetracked about this here.

> I'd rather not get side-tracked here. I regret you feel stream-rolled but I
> think grouping pages by mobility is the right thing to do for better usage
> of the TLB by the kernel and for improving hugepage support in userspace
> minimally. We never really did see eye-to-eye but this way, if I'm wrong
> you get to chuck eggs down the line.

No it's a fair point, and even the hugepage allocations alone are a fair
point. From the discussions I think it seems like quite probably the right
thing to do pragmatically, which is what Linux is about and I hope will
result in a better kernel in the end. So I don't have complaints except
from little ivory tower ;)

> > Sure. And some people run workloads where fragmentation is likely never
> > going to be a problem, they are shipping this poorly configured hardware
> > now or soon, so they don't have too much interest in doing it right at
> > this point, rather than doing it *now*. OK, that's a valid reason which
> > is why I don't use the argument that we should do it correctly or never
> > at all.
>
> So are we saying the right thing to do is go with fs-block from day 1 once
> we get it to optimistically use high-order pages? I think your concern
> might be that if this goes in then it'll be harder to justify fsblock in
> the future because it'll be solving a theoritical problem that takes months
> to trigger if at all. i.e. The filesystem people will push because
> apparently large block support as it is solves world peace. Is that
> accurate?

Heh. It's hard to say. I think fsblock could take a while to implement,
regardless of high order pages or not. I actually would like to be able
to pass down a mandate to say higher order pagecache will never
get merged, simply so that these talented people would work on
fsblock ;)

But that's not my place to say, and I'm actually not arguing that high
order pagecache does not have uses (especially as a practical,
shorter-term solution which is unintrusive to filesystems).

So no, I don't think I'm really going against the basics of what we agreed
in Cambridge. But it sounds like it's still being billed as first-order
support right off the bat here.

> > OTOH, I'm not sure how much buy-in there was from the filesystems guys.
> > Particularly Christoph H and XFS (which is strange because they already
> > do vmapping in places).
>
> I think they use vmapping because they have to, not because they want
> to. They might be a lot happier with fsblock if it used contiguous pages
> for large blocks whenever possible - I don't know for sure. The metadata
> accessors they might be unhappy with because it's inconvenient but as
> Christoph Hellwig pointed out at VM/FS, the filesystems who really care
> will convert.

Sure, they would rather not to. But there are also a lot of ways you can
improve vmap more than what XFS does (or probably what darwin does)
(more persistence for cached objects, and batched invalidates for example).
There are also a lot of trivial things you can do to make a lot of those
accesses not require vmaps (and less trivial things, but even such things
as binary searches over multiple pages should be quite possible with a bit
of logic).

> > It would be interesting to craft an attack. If you knew roughly the
> > layout and size of your dentry slab for example... maybe you could stat a
> > whole lot of files, then open one and keep it open (maybe post the fd to
> > a unix socket or something crazy!) when you think you have filled up a
> > couple of MB worth of them.
>
> I might regret saying this, but it would be easier to craft an attack
> using pagetable pages. It's woefully difficult to do but it's probably
> doable. I say pagetables because while slub targetted reclaim is on the
> cards and memory compaction exists for page cache pages, pagetables are
> currently pinned with no prototype patch existing to deal with them.

But even so, you can just hold an open fd in order to pin the dentry you
want. My attack would go like this: get the page size and allocation group
size for the machine, then get the number of dentries required to fill a
slab. Then read in that many dentries and pin one of them. Repeat the
process. Even if there is other activity on the system, it seems possible
that such a thing will

1 2 3 4 5 >

1 - 100 of 473 matches

Mail list logo