Re: [PATCH 6/7] d_path: Make d_path() use a struct path

2007-11-01 Thread Bharata B Rao
On 10/29/07, Jan Blunck <[EMAIL PROTECTED]> wrote:
>
>

Did you miss the d_path() caller arch/blackfin/kernel/traps.c:printk_address() ?

Regards,
Bharata.
-- 
"Men come and go but mountains remain" -- Ruskin Bond.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin
On Thu, Nov 01, 2007 at 06:19:41PM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 1 Nov 2007, Rik van Riel wrote:
> > 
> > Larry Woodman managed to wedge the VM into a state where, on his
> > 4x dual core system, only 2 cores (on the same CPU) could get the
> > zone->lru_lock overnight.  The other 6 cores on the system were
> > just spinning, without being able to get the lock.

That's quite incredible, considering that the CPUs actually _taking_
the locks also drop the locks and do quite a bit of work before taking
them again (ie. they take them to pull pages off the LRU, but then
do a reasonable amount of work to remove each one from pagecache before
refilling from the LRU).

Possibly actually that is a *more* difficult case for the HW to handle:
once the CPU actually goes away and operates on other cachelines, it 
may get a little more difficult to detect that it is causing starvation
issues.


> .. and this is almost always the result of a locking *bug*, not unfairness 
> per se. IOW, unfairness just ends up showing the bug in the first place.

I'd almost agree, but there are always going to be corner cases where
we get multiple contentions on a spinlock -- the fact that a lock is
needed at all obviously suggests that it can be contended. The LRU locking
could be improved, but you could have eg. scheduler runqueue lock starvation
if the planets lined up just right, and it is a little more difficult to
improve on runqueue locking.

Anyway, I also think this is partially a hardware issue, and as muliple
cores, threads, and sockets get more common, I hope it will improve (it
affects Intel CPUs as well as AMD). So it is possible to have an option
to switch between locks if the hardware is fairer, but I want to get
as much exposure with this locking as possible for now, to see if there
is any funny performance corner cases exposed (which quite possibly will
turn out to be caused by suboptimal locking itself).

Anyway, if this can make its way to the x86 tree, I think it will get
pulled into -mm (?) and get some exposure...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin
On Thu, Nov 01, 2007 at 06:17:42PM -0700, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Nov 2007, Nick Piggin wrote:
> > 
> > But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
> > uses this for inserting breakpoints (but fortunately, a COW page in a
> > MAP_PRIVATE mapping is a much more natural thing for the VM).
> 
> Yes, I phrased that badly. I meant that I'd be happier if we got rid of 
> VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made 
> them always be the same (and any VM_MAYSHARE use is historical).

Oh yeah, I think it would probably be clearer to use VM_SHARED == MAP_SHARED,
and test the write permission explicitly. Though there could be something
I missed that makes it not as easy as it sounds... probably something best
left for Hugh ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ehea: add kexec support

2007-11-01 Thread Michael Ellerman
On Wed, 2007-10-31 at 20:48 +0100, Christoph Raisch wrote:
> Michael Ellerman <[EMAIL PROTECTED]> wrote on 30.10.2007 23:50:36:
> >
> > On Tue, 2007-10-30 at 09:39 +0100, Christoph Raisch wrote:
> > >
> > > Michael Ellerman <[EMAIL PROTECTED]> wrote on 28.10.2007 23:32:17:
> > > Hope I didn't miss anything here...
> >
> > Perhaps. When we kdump the kernel does not call the reboot notifiers, so
> > the code Jan-Bernd just added won't get called. So the eHEA resources
> > won't be freed. When the kdump kernel tries to load the eHEA driver what
> > will happen?
> >
> Good point.
> 
> If the device driver tries to allocate resources again (in the kdump
> kernel),
> which have been allocated before (in the crashed kernel) the hcalls will
> fail because from the hypervisor view the resources are still in use.
> Currently there's no method to find out the resource handles for these
> HEA resources allocated by the crashed kernel within the hypervisor...

So the hypervisor can't allocate more resources, because they're already
allocated, but it can't free the ones that are allocated because it
doesn't know what they are? I don't think I understand.

If that's really the way it works then eHEA is more or less broken for
kdump I'm afraid.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person


signature.asc
Description: This is a digitally signed message part


[PATCH 1/4] Blackfin I2C/TWI driver: Add repeat start feature to avoid break of a bundle of i2c master xfer operation.

2007-11-01 Thread Bryan Wu
From: Sonic Zhang <[EMAIL PROTECTED]>

 - Create a new mode TWI_I2C_MODE_REPEAT.
 - No change to smbus operation.

Signed-off-by: Sonic Zhang <[EMAIL PROTECTED]>
Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/i2c/busses/i2c-bfin-twi.c |  179 +++--
 1 files changed, 111 insertions(+), 68 deletions(-)

diff --git a/drivers/i2c/busses/i2c-bfin-twi.c 
b/drivers/i2c/busses/i2c-bfin-twi.c
index 67224a4..6535852 100644
--- a/drivers/i2c/busses/i2c-bfin-twi.c
+++ b/drivers/i2c/busses/i2c-bfin-twi.c
@@ -42,6 +42,7 @@
 #define TWI_I2C_MODE_STANDARD  0x01
 #define TWI_I2C_MODE_STANDARDSUB   0x02
 #define TWI_I2C_MODE_COMBINED  0x04
+#define TWI_I2C_MODE_REPEAT0x08
 
 struct bfin_twi_iface {
int irq;
@@ -58,6 +59,9 @@ struct bfin_twi_iface {
struct timer_list   timeout_timer;
struct i2c_adapter  adap;
struct completion   complete;
+   struct i2c_msg  *pmsg;
+   int msg_num;
+   int cur_msg;
 };
 
 static struct bfin_twi_iface twi_iface;
@@ -76,12 +80,16 @@ static void bfin_twi_handle_interrupt(struct bfin_twi_iface 
*iface)
/* start receive immediately after complete sending in
 * combine mode.
 */
-   else if (iface->cur_mode == TWI_I2C_MODE_COMBINED) {
+   else if (iface->cur_mode == TWI_I2C_MODE_COMBINED)
bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
| MDIR | RSTART);
-   } else if (iface->manual_stop)
+   else if (iface->manual_stop)
bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
| STOP);
+   else if (iface->cur_mode == TWI_I2C_MODE_REPEAT &&
+   iface->cur_msg+1 < iface->msg_num)
+   bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
+   | RSTART);
SSYNC();
/* Clear status */
bfin_write_TWI_INT_STAT(XMTSERV);
@@ -108,6 +116,11 @@ static void bfin_twi_handle_interrupt(struct 
bfin_twi_iface *iface)
bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
| STOP);
SSYNC();
+   } else if (iface->cur_mode == TWI_I2C_MODE_REPEAT &&
+   iface->cur_msg+1 < iface->msg_num) {
+   bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
+   | RSTART);
+   SSYNC();
}
/* Clear interrupt source */
bfin_write_TWI_INT_STAT(RCVSERV);
@@ -119,7 +132,7 @@ static void bfin_twi_handle_interrupt(struct bfin_twi_iface 
*iface)
bfin_write_TWI_MASTER_STAT(0x3e);
bfin_write_TWI_MASTER_CTL(0);
SSYNC();
-   iface->result = -1;
+   iface->result = -EIO;
/* if both err and complete int stats are set, return proper
 * results.
 */
@@ -170,6 +183,42 @@ static void bfin_twi_handle_interrupt(struct 
bfin_twi_iface *iface)
bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL() |
MEN | MDIR);
SSYNC();
+   } else if (iface->cur_mode == TWI_I2C_MODE_REPEAT &&
+   iface->cur_msg+1 < iface->msg_num) {
+   iface->cur_msg++;
+   iface->transPtr = iface->pmsg[iface->cur_msg].buf;
+   iface->writeNum = iface->readNum =
+   iface->pmsg[iface->cur_msg].len;
+   /* Set Transmit device address */
+   bfin_write_TWI_MASTER_ADDR(
+   iface->pmsg[iface->cur_msg].addr);
+   if (iface->pmsg[iface->cur_msg].flags & I2C_M_RD)
+   iface->read_write = I2C_SMBUS_READ;
+   else {
+   iface->read_write = I2C_SMBUS_WRITE;
+   /* Transmit first data */
+   if (iface->writeNum > 0) {
+   bfin_write_TWI_XMT_DATA8(
+   *(iface->transPtr++));
+   iface->writeNum--;
+   SSYNC();
+   }
+   }
+
+   if (iface->pmsg[iface->cur_msg].len <= 255)
+   bfin_write_TWI_MASTER_CTL(
+   iface->pmsg[iface->cur_msg].len << 6);
+   else if (iface->pmsg[iface->cur_msg].len > 255) {
+

[PATCH 2/4] Blackfin I2C/TWI driver: Add platform_resource interface to support multi-port TWI controllers

2007-11-01 Thread Bryan Wu
 - Dynamic alloc the resource of TWI driver data according to board information
 - TWI register read/write accessor based on dynamic regs_base
 - Support TWI0/TWI1 for BF54x

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/i2c/busses/i2c-bfin-twi.c |  269 +++--
 1 files changed, 166 insertions(+), 103 deletions(-)

diff --git a/drivers/i2c/busses/i2c-bfin-twi.c 
b/drivers/i2c/busses/i2c-bfin-twi.c
index 6535852..e495454 100644
--- a/drivers/i2c/busses/i2c-bfin-twi.c
+++ b/drivers/i2c/busses/i2c-bfin-twi.c
@@ -62,43 +62,66 @@ struct bfin_twi_iface {
struct i2c_msg  *pmsg;
int msg_num;
int cur_msg;
+   void __iomem*regs_base;
 };
 
-static struct bfin_twi_iface twi_iface;
+
+#define DEFINE_TWI_REG(reg, off) \
+static inline u16 read_##reg(struct bfin_twi_iface *iface) \
+   { return bfin_read16(iface->regs_base + off); } \
+static inline void write_##reg(struct bfin_twi_iface *iface, u16 v) \
+   {bfin_write16(iface->regs_base + off, v); }
+
+DEFINE_TWI_REG(CLKDIV, 0x00)
+DEFINE_TWI_REG(CONTROL, 0x04)
+DEFINE_TWI_REG(SLAVE_CTL, 0x08)
+DEFINE_TWI_REG(SLAVE_STAT, 0x0C)
+DEFINE_TWI_REG(SLAVE_ADDR, 0x10)
+DEFINE_TWI_REG(MASTER_CTL, 0x14)
+DEFINE_TWI_REG(MASTER_STAT, 0x18)
+DEFINE_TWI_REG(MASTER_ADDR, 0x1C)
+DEFINE_TWI_REG(INT_STAT, 0x20)
+DEFINE_TWI_REG(INT_MASK, 0x24)
+DEFINE_TWI_REG(FIFO_CTL, 0x28)
+DEFINE_TWI_REG(FIFO_STAT, 0x2C)
+DEFINE_TWI_REG(XMT_DATA8, 0x80)
+DEFINE_TWI_REG(XMT_DATA16, 0x84)
+DEFINE_TWI_REG(RCV_DATA8, 0x88)
+DEFINE_TWI_REG(RCV_DATA16, 0x8C)
 
 static void bfin_twi_handle_interrupt(struct bfin_twi_iface *iface)
 {
-   unsigned short twi_int_status = bfin_read_TWI_INT_STAT();
-   unsigned short mast_stat = bfin_read_TWI_MASTER_STAT();
+   unsigned short twi_int_status = read_INT_STAT(iface);
+   unsigned short mast_stat = read_MASTER_STAT(iface);
 
if (twi_int_status & XMTSERV) {
/* Transmit next data */
if (iface->writeNum > 0) {
-   bfin_write_TWI_XMT_DATA8(*(iface->transPtr++));
+   write_XMT_DATA8(iface, *(iface->transPtr++));
iface->writeNum--;
}
/* start receive immediately after complete sending in
 * combine mode.
 */
else if (iface->cur_mode == TWI_I2C_MODE_COMBINED)
-   bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
-   | MDIR | RSTART);
+   write_MASTER_CTL(iface,
+   read_MASTER_CTL(iface) | MDIR | RSTART);
else if (iface->manual_stop)
-   bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
-   | STOP);
+   write_MASTER_CTL(iface,
+   read_MASTER_CTL(iface) | STOP);
else if (iface->cur_mode == TWI_I2C_MODE_REPEAT &&
iface->cur_msg+1 < iface->msg_num)
-   bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
-   | RSTART);
+   write_MASTER_CTL(iface,
+   read_MASTER_CTL(iface) | RSTART);
SSYNC();
/* Clear status */
-   bfin_write_TWI_INT_STAT(XMTSERV);
+   write_INT_STAT(iface, XMTSERV);
SSYNC();
}
if (twi_int_status & RCVSERV) {
if (iface->readNum > 0) {
/* Receive next data */
-   *(iface->transPtr) = bfin_read_TWI_RCV_DATA8();
+   *(iface->transPtr) = read_RCV_DATA8(iface);
if (iface->cur_mode == TWI_I2C_MODE_COMBINED) {
/* Change combine mode into sub mode after
 * read first data.
@@ -113,33 +136,33 @@ static void bfin_twi_handle_interrupt(struct 
bfin_twi_iface *iface)
iface->transPtr++;
iface->readNum--;
} else if (iface->manual_stop) {
-   bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
-   | STOP);
+   write_MASTER_CTL(iface,
+   read_MASTER_CTL(iface) | STOP);
SSYNC();
} else if (iface->cur_mode == TWI_I2C_MODE_REPEAT &&
iface->cur_msg+1 < iface->msg_num) {
-   bfin_write_TWI_MASTER_CTL(bfin_read_TWI_MASTER_CTL()
-   | RSTART);
+   write_MASTER_CTL(iface,
+   read_MASTER_CTL(iface) | RSTART);
SSYNC();
}
/* Clear interrupt source */
-   

[PATCH 4/4] Blackfin I2C/TWI driver: add driver descriptions, versions and some module useful information

2007-11-01 Thread Bryan Wu
Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/i2c/busses/i2c-bfin-twi.c |   19 ---
 1 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/drivers/i2c/busses/i2c-bfin-twi.c 
b/drivers/i2c/busses/i2c-bfin-twi.c
index 727625b..ae41c0b 100644
--- a/drivers/i2c/busses/i2c-bfin-twi.c
+++ b/drivers/i2c/busses/i2c-bfin-twi.c
@@ -37,6 +37,15 @@
 #include 
 #include 
 
+#define DRV_NAME   "i2c-bfin-twi"
+#define DRV_AUTHOR "Sonic Zhang, Bryan Wu"
+#define DRV_DESC   "Blackfin BF5xx on-chip I2C TWI Contoller Driver"
+#define DRV_VERSION"1.8"
+
+MODULE_AUTHOR(DRV_AUTHOR);
+MODULE_DESCRIPTION(DRV_DESC);
+MODULE_LICENSE("GPL");
+
 #define POLL_TIMEOUT   (2 * HZ)
 
 /* SMBus mode*/
@@ -701,8 +710,8 @@ static int i2c_bfin_twi_probe(struct platform_device *pdev)
else
platform_set_drvdata(pdev, iface);
 
-   dev_info(>dev, "Blackfin I2C TWI controller, [EMAIL PROTECTED]",
-   iface->regs_base);
+   dev_info(>dev, "%s, Version %s, [EMAIL PROTECTED]",
+   DRV_DESC, DRV_VERSION, iface->regs_base);
 
return rc;
 
@@ -735,7 +744,7 @@ static struct platform_driver i2c_bfin_twi_driver = {
.suspend= i2c_bfin_twi_suspend,
.resume = i2c_bfin_twi_resume,
.driver = {
-   .name   = "i2c-bfin-twi",
+   .name   = DRV_NAME,
.owner  = THIS_MODULE,
},
 };
@@ -750,9 +759,5 @@ static void __exit i2c_bfin_twi_exit(void)
platform_driver_unregister(_bfin_twi_driver);
 }
 
-MODULE_AUTHOR("Sonic Zhang <[EMAIL PROTECTED]>");
-MODULE_DESCRIPTION("I2C-Bus adapter routines for Blackfin TWI");
-MODULE_LICENSE("GPL");
-
 module_init(i2c_bfin_twi_init);
 module_exit(i2c_bfin_twi_exit);
-- 
1.5.3.4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] Blackfin I2C/TWI driver: add missing pin mux operation

2007-11-01 Thread Bryan Wu
Blackfin TWI controller hardware pin should be requested from GPIO port 
controller
Before BF54x, there is no need to do this. But as long as BF54x and BF52x
are supported by this generic driver, the missing pin mux operation should be
added.

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/i2c/busses/i2c-bfin-twi.c |   24 
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/i2c/busses/i2c-bfin-twi.c 
b/drivers/i2c/busses/i2c-bfin-twi.c
index e495454..727625b 100644
--- a/drivers/i2c/busses/i2c-bfin-twi.c
+++ b/drivers/i2c/busses/i2c-bfin-twi.c
@@ -34,6 +34,7 @@
 #include 
 
 #include 
+#include 
 #include 
 
 #define POLL_TIMEOUT   (2 * HZ)
@@ -63,6 +64,7 @@ struct bfin_twi_iface {
int msg_num;
int cur_msg;
void __iomem*regs_base;
+   int bus_num;
 };
 
 
@@ -89,6 +91,24 @@ DEFINE_TWI_REG(XMT_DATA16, 0x84)
 DEFINE_TWI_REG(RCV_DATA8, 0x88)
 DEFINE_TWI_REG(RCV_DATA16, 0x8C)
 
+static int setup_pin_mux(int action, struct bfin_twi_iface *iface)
+{
+
+   u16 pin_req[2][3] = {
+   {P_TWI0_SCL, P_TWI0_SDA, 0},
+   {P_TWI1_SCL, P_TWI1_SDA, 0},
+   };
+
+   if (action) {
+   if (peripheral_request_list(pin_req[iface->bus_num], DRV_NAME))
+   return -EFAULT;
+   } else {
+   peripheral_free_list(pin_req[iface->bus_num]);
+   }
+
+   return 0;
+}
+
 static void bfin_twi_handle_interrupt(struct bfin_twi_iface *iface)
 {
unsigned short twi_int_status = read_INT_STAT(iface);
@@ -640,6 +660,7 @@ static int i2c_bfin_twi_probe(struct platform_device *pdev)
goto out_error_no_irq;
}
 
+   iface->bus_num = pdev->id;
init_timer(&(iface->timeout_timer));
iface->timeout_timer.function = bfin_twi_timeout;
iface->timeout_timer.data = (unsigned long)iface;
@@ -652,6 +673,8 @@ static int i2c_bfin_twi_probe(struct platform_device *pdev)
p_adap->class = I2C_CLASS_ALL;
p_adap->dev.parent = >dev;
 
+   setup_pin_mux(1, iface);
+
rc = request_irq(iface->irq, bfin_twi_interrupt_entry,
IRQF_DISABLED, pdev->name, iface);
if (rc) {
@@ -701,6 +724,7 @@ static int i2c_bfin_twi_remove(struct platform_device *pdev)
 
i2c_del_adapter(&(iface->adap));
free_irq(iface->irq, iface);
+   setup_pin_mux(0, iface);
 
return 0;
 }
-- 
1.5.3.4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] Blackfin I2C/TWI driver updates and bug fixing according to Jean's review

2007-11-01 Thread Bryan Wu
Still remain in old I2C driver interface. I plan to move to new style I2C API 
recently.
Also intend to using new TWI register accessor functions as Jean's suggestions.
Then move that static pin_req setting to our Blackfin board files.

But currently version is OK and tested on Blackfin board with I2C devices.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


kernel 2.6.24-rc1-git10 crash report

2007-11-01 Thread w.landgraf
A detailed crash report is in your letter box
[EMAIL PROTECTED] .  It returned steady-steady
because was considered como spam by your mailer.
w.landgraf www.copaya.yi.org

> > On 1/Nov/2007 21:26 werner wrote ..
> > > On 1/Nov/2007 15:57 werner wrote ..
> > > > Kernel Crash -- Details see below
> > > > globc 2.7  glib2 2.14.2
> > > > W.Landgraf
> > > > www.copaya.yi.org
> > > >
=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D=3D3D
> > > > 2.6.24-rc1-git10
> > > > EIP 0600:  EFLAGS 00010212 CPU 0
> > > > EIUP is at xor_sse_2+0x34/0x200
> > > > EAX: 10 EBX fffedb22 ECX c183f000 EDX c183c000 ESS
8005003b EDI c0929614
> EBP
> > > c183f000
> > > > ESP c1823ef0   
> > > > DS 7b ES 7b FS d8 GS 0 SS 68
> > > > Process swapper  pid 1  ti: c182200  task c182
task.ti c=3D3D1822000
> > > > Stack:  8x 08x 0   fffedb22  0  c04067b3  10 
c0849b62  c1030780  c183f000
> > > > c183c000
> > > > call trace
> > > > c0 4067b3 do_xor_speed+0x53/0xd0
> > > >9a9582 calibrate_xor_blocks 0xe2/0x100 (or 1a0 ?)
> > > >   191594  register_filesystem =3D3D0X44/0X70
> > > >   991565 kernel_init+0x125/0x2f0
> > > >10420a  ret_from_fork +0x6/0x1c  (or 0xb ...)
> > > >   991440 kernel_init+0x0/0x2f0
> > > >" again
> > > >c0104edf  kernel_thread_helper+0x7/0x18
> > > > code  08 89 74 24 44 0f 20 cf 0f 06 (or 0b) 0f 11 04
24 0f 11 4c 34 10 0f
> 11
> > > 54
> > > > 24 20 0f 11 5c 24 30 0f 18 82 00
> > > > 01 00 00 0f 18 82 20 01 00 00 <00> 20x 0
> > > > EIP c0407284 xor_sse_2+0x34/0x200 SS ESP 068: c1823ef0
> > > > kernel panic
> > > >
> 
> THE REST OF THIS MESSAGE, SEE IN YOUR MAILBOX 
[EMAIL PROTECTED]  BECAUSE
> OF YOUR STUPID SPAM CONFIGURATION 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] sh64 updates for 2.6.24-rc2

2007-11-01 Thread Paul Mundt
Please pull from:

master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh64-2.6.git

Which contains:

Adrian Bunk (1):
  sh64: fix dma_cache_sync() compilation

Paul Mundt (1):
  sh64: Update defconfigs.

Robert P. J. Day (1):
  sh64: Move DMA macros from pci.h to scatterlist.h.

 arch/sh64/configs/cayman_defconfig |  140 +---
 arch/sh64/configs/harp_defconfig   |  105 ---
 arch/sh64/configs/sim_defconfig|   68 --
 include/asm-sh64/dma-mapping.h |5 +-
 include/asm-sh64/pci.h |9 ---
 include/asm-sh64/scatterlist.h |9 +++
 6 files changed, 154 insertions(+), 182 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] sh updates for 2.6.24-rc2

2007-11-01 Thread Paul Mundt
Please pull from:

master.kernel.org:/pub/scm/linux/kernel/git/lethal/sh-2.6.git

Which contains:

Adrian McMenamin (2):
  sh: Clean up Kconfig entry for Dreamcast.
  maple: Fix maple bus compiler warning

Alejandro Martinez Ruiz (1):
  sh: ARRAY_SIZE() cleanup

Kaz Kojima (1):
  sh: Terminate .eh_frame in VDSO with a 4-byte 0.

Magnus Damm (1):
  sh: add support for ax88796 and 93cx6 to highlander boards

Manuel Lauss (1):
  sh: fix zImage build with >=binutils-2.18

Paul Mundt (15):
  sh: Correct pte_page() breakage.
  sh: Fix up early mem cmdline parsing.
  sh: Kill off legacy embedded ramdisk section.
  sh: Use generic SMP_CACHE_BYTES/L1_CACHE_ALIGN.
  sh: Move zero page param defs somewhere sensible.
  sh: linker script tidying.
  sh: Provide a __read_mostly section wrapper.
  sh: Make SH7750 oprofile compile again.
  sh: Kill off dead ipr_irq_demux().
  sh: Clean up SR.RB Kconfig mess.
  sh: Decouple 4k and soft/hardirq stacks.
  sh: Correct SUBARCH matching.
  sh: Fix up r7780rp highlander CF access size.
  sh: mach-type updates.
  sh: Update r7785rp defconfig.

Stuart Menefy (1):
  sh: Fix optimized __copy_user() movca.l usage.

Yoshihiro Shimoda (2):
  sh: Add resource of USBF for SH7722.
  sh: Enable USBF on MS7722SE.

 Makefile   |3 +-
 arch/sh/Kconfig|8 +-
 arch/sh/Kconfig.debug  |8 +
 arch/sh/Makefile   |2 +-
 arch/sh/boards/renesas/r7780rp/setup.c |   71 
 arch/sh/boards/se/7722/setup.c |4 +-
 arch/sh/configs/r7785rp_defconfig  |  299 +++-
 arch/sh/drivers/pci/pci-st40.c |4 +-
 arch/sh/kernel/cpu/irq/ipr.c   |9 -
 arch/sh/kernel/cpu/sh4a/setup-sh7722.c |   27 +++
 arch/sh/kernel/irq.c   |8 +-
 arch/sh/kernel/setup.c |   46 ++
 arch/sh/kernel/vmlinux.lds.S   |  201 +++---
 arch/sh/kernel/vsyscall/vsyscall.lds.S |5 +-
 arch/sh/mm/copy_page.S |4 +
 arch/sh/oprofile/op_model_sh7750.c |   22 +--
 arch/sh/tools/mach-types   |   29 +++-
 drivers/sh/maple/maple.c   |3 +-
 include/asm-sh/cache.h |3 +-
 include/asm-sh/irq.h   |2 +-
 include/asm-sh/page.h  |1 -
 include/asm-sh/pgtable.h   |2 +-
 include/asm-sh/processor.h |2 +-
 include/asm-sh/setup.h |   14 ++
 24 files changed, 390 insertions(+), 387 deletions(-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Differences in bitops argument types

2007-11-01 Thread Paul Mackerras
Jan Kara writes:

>   I've just found out that operations like constant_test_bit() take pointer
> of different types on different architectures. In particular, x86_64,
> blackfin and frv take void * while i386, s390 and m68k take unsigned long
> *. Is this intended difference? Wouldn't using void * everywhere be more
> appropriate? Thanks for answer in advance.

A bitmap is defined to be an array of unsigned longs.  Using an array
of a smaller or longer type will give different results on big-endian
architectures.  Therefore using unsigned long * is better, because it
finds errors where callers are using an array of some other type.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] libata ATAPI transfer size cleanups

2007-11-01 Thread Jeff Garzik

Torsten Kaiser wrote:

On 11/1/07, Jeff Garzik <[EMAIL PROTECTED]> wrote:

+   lo = tf->lbam;
+   hi = tf->lbam;
+   ibyte = (hi << 8) | lo;
+
+   lo = result_tf->lbam;
+   hi = result_tf->lbam;


That doesn't look right.
I suspect this was intended:

lo = tf->lbam;
hi = tf->lbah;


Agreed, will correct.

Thanks,

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NPTL support

2007-11-01 Thread veerasena reddy
Hi,

I am trying to build the toolchain for MIPS processor using buildroot.
I am using gcc version of 3.4.3, binutils-2.15, uclibc-0.9.28 and 
linux-2.6.18.8 kernel.

Basically i need to enable NPTL feature support in my toolchain.
does uclibc-0.9.28 has the support for NPTL?
If not, how can i get it enabled for my above build configuration?

I see there is separate branch "uclibc-nptl" in uclibc. 
Do i need to use this (uclibc-nptl) to meet my requirement?

Could you please suggest me right approach to succssfully enable NPTL?

Thanks in advance.

Regards,
Veerasena.


  Why delete messages? Unlimited storage is just a click away. Go to 
http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread David Miller
From: David Miller <[EMAIL PROTECTED]>
Date: Wed, 31 Oct 2007 00:44:25 -0700 (PDT)

> From: Nick Piggin <[EMAIL PROTECTED]>
> Date: Wed, 31 Oct 2007 08:41:06 +0100
> 
> > You could possibly even do a generic "best effort" kind of thing with
> > regular IPIs, that will timeout and continue if some CPUs don't handle
> > them, and should be pretty easy to get working with existing smp_call_
> > function stuff. Not exactly clean, but it would be better than nothing.
> 
> Without a doubt.

Putting my code where my mouth is, here is an example implementation
of a special SysRQ "g" "dump regs globally" debugging tool for
sparc64.

The only thing that has to happen is the SysRQ trigger.  So if you can
either SysRQ-'g' at the console or "echo 'g' >/proc/sysrq-trigger" you
can get the registers from the cpus in the system.

The only case the remote cpu registers would not be capturable would
be if they were stuck looping in the trap entry, trap exit, or low
level TLB handler code.

This means that even if some cpu is stuck in a spinlock loop with
interrupts disabled, you'd see it with this thing.  The way it works
is that cross cpu vectored interrupts are disabled independently of
the processor interrupt level on sparc64.

This version just records the absolute minimum processor state, it
could be trivially extended to record all of pt_regs etc.

Even on my 64 cpu niagara box, the output is reasonable and fills up
one full screen of my console window.  Full pt_regs dumps are too
much.

diff --git a/arch/sparc64/kernel/process.c b/arch/sparc64/kernel/process.c
index ca7cdfd..67bf91d 100644
--- a/arch/sparc64/kernel/process.c
+++ b/arch/sparc64/kernel/process.c
@@ -1,7 +1,6 @@
-/*  $Id: process.c,v 1.131 2002/02/09 19:49:30 davem Exp $
- *  arch/sparc64/kernel/process.c
+/*  arch/sparc64/kernel/process.c
  *
- *  Copyright (C) 1995, 1996 David S. Miller ([EMAIL PROTECTED])
+ *  Copyright (C) 1995, 1996, 2007 David S. Miller ([EMAIL PROTECTED])
  *  Copyright (C) 1996   Eddie C. Dost   ([EMAIL PROTECTED])
  *  Copyright (C) 1997, 1998 Jakub Jelinek   ([EMAIL PROTECTED])
  */
@@ -31,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* #define VERBOSE_SHOWREGS */
 
@@ -388,6 +389,76 @@ void show_regs32(struct pt_regs32 *regs)
   regs->u_regs[15]);
 }
 
+#ifdef CONFIG_MAGIC_SYSRQ
+struct global_reg_snapshot {
+   unsigned long   tstate;
+   unsigned long   tpc;
+   unsigned long   tnpc;
+   struct thread_info  *thread;
+} global_reg_snapshot[NR_CPUS];
+static DEFINE_SPINLOCK(global_reg_snapshot_lock);
+
+static void sysrq_handle_globreg(int key, struct tty_struct *tty)
+{
+   struct pt_regs *regs = get_irq_regs();
+#ifdef CONFIG_KALLSYMS
+   char buffer[KSYM_SYMBOL_LEN];
+#endif
+   unsigned long flags;
+   int cpu;
+
+   spin_lock_irqsave(_reg_snapshot_lock, flags);
+   cpu = raw_smp_processor_id();
+   if (regs) {
+   global_reg_snapshot[cpu].tstate = regs->tstate;
+   global_reg_snapshot[cpu].tpc = regs->tpc;
+   global_reg_snapshot[cpu].tnpc = regs->tnpc;
+   } else {
+   global_reg_snapshot[cpu].tstate = 0;
+   global_reg_snapshot[cpu].tpc = 0;
+   global_reg_snapshot[cpu].tnpc = 0;
+   }
+   global_reg_snapshot[cpu].thread = current_thread_info();
+
+   smp_fetch_global_regs();
+
+   for_each_online_cpu(cpu) {
+   struct global_reg_snapshot *gp = _reg_snapshot[cpu];
+   struct thread_info *tp = gp->thread;
+
+   printk("%c CPU[%3d]: TSTATE[%016lx] TPC[%016lx] TNPC[%016lx] 
TASK[%s:%d]\n",
+  (cpu == raw_smp_processor_id() ? '*' : ' '), cpu,
+  gp->tstate, gp->tpc, gp->tnpc,
+  ((tp  && tp->task) ? tp->task->comm : "NULL"),
+  ((tp  && tp->task) ? tp->task->pid : -1));
+#ifdef CONFIG_KALLSYMS
+   if ((gp->tstate & TSTATE_PRIV) && (gp->tpc != 0UL)) {
+   sprint_symbol(buffer, gp->tpc);
+   printk(" TPC[%s]\n", buffer);
+   }
+#endif
+   }
+
+   memset(global_reg_snapshot, 0, sizeof(global_reg_snapshot));
+
+   spin_unlock_irqrestore(_reg_snapshot_lock, flags);
+}
+
+static struct sysrq_key_op sparc_globalreg_op = {
+   .handler= sysrq_handle_globreg,
+   .help_msg   = "Globalregs",
+   .action_msg = "Show Global CPU Regs",
+};
+
+static int __init sparc_globreg_init(void)
+{
+   return register_sysrq_key('g', _globalreg_op);
+}
+
+core_initcall(sparc_globreg_init);
+
+#endif
+
 unsigned long thread_saved_pc(struct task_struct *tsk)
 {
struct thread_info *ti = task_thread_info(tsk);
diff --git a/arch/sparc64/kernel/smp.c b/arch/sparc64/kernel/smp.c
index c73b7a4..cbedf27 100644
--- a/arch/sparc64/kernel/smp.c

[PATCH] Restore deterministic CPU accounting on powerpc

2007-11-01 Thread Paul Mackerras
Since powerpc started using CONFIG_GENERIC_CLOCKEVENTS, the
deterministic CPU accounting (CONFIG_VIRT_CPU_ACCOUNTING) has been
broken on powerpc, because we end up counting user time twice: once in
timer_interrupt() and once in update_process_times().

This fixes the problem by pulling the code in update_process_times
that updates utime and stime into a separate function called
account_process_tick.  If CONFIG_VIRT_CPU_ACCOUNTING is not defined,
there is a version of account_process_tick in kernel/timer.c that
simply accounts a whole tick to either utime or stime as before.  If
CONFIG_VIRT_CPU_ACCOUNTING is defined, then arch code gets to
implement account_process_tick.

This also lets us simplify the s390 code a bit; it means that the s390
timer interrupt can now call update_process_times even when
CONFIG_VIRT_CPU_ACCOUNTING is turned on, and can just implement a
suitable account_process_tick().

Signed-off-by: Paul Mackerras <[EMAIL PROTECTED]>
---
I don't know who maintains kernel/timer.c, but I assume it's one of
Ingo, Peter Z. or Thomas.  

In fact the arch bits here don't need to go in at the same time as the
changes to include/linux/sched.h and kernel/timer.c, but could go in
later.  I have included them here so people can see how these changes
help in the VIRT_CPU_ACCOUNTING=y case.  The powerpc changes are
tested, but the s390 changes aren't.

In case it's not obvious, I'd like this (or at least the generic and
powerpc bits) to go in 2.6.24 since it fixes a regression.

 arch/powerpc/kernel/process.c |4 +++-
 arch/powerpc/kernel/time.c|   29 +++--
 arch/s390/kernel/time.c   |4 
 arch/s390/kernel/vtime.c  |9 ++---
 include/asm-powerpc/time.h|6 --
 include/asm-ppc/time.h|2 --
 include/asm-s390/system.h |1 -
 include/linux/sched.h |1 +
 kernel/timer.c|   21 ++---
 9 files changed, 23 insertions(+), 54 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index b9d8837..eba9332 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -349,9 +349,11 @@ struct task_struct *__switch_to(struct task_struct *prev,
 
local_irq_save(flags);
 
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
account_system_vtime(current);
-   account_process_vtime(current);
+   account_process_tick(0);
calculate_steal_time();
+#endif
 
last = _switch(old_thread, new_thread);
 
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 9eb3284..f950336 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -259,31 +259,19 @@ void account_system_vtime(struct task_struct *tsk)
  * user and system time records.
  * Must be called with interrupts disabled.
  */
-void account_process_vtime(struct task_struct *tsk)
+void account_process_tick(int user_tick)
 {
cputime_t utime, utimescaled;
 
utime = get_paca()->user_time;
get_paca()->user_time = 0;
-   account_user_time(tsk, utime);
+   account_user_time(current, utime);
 
/* Estimate the scaled utime by scaling the real utime based
 * on the last spurr to purr ratio */
utimescaled = utime * get_paca()->spurrdelta / get_paca()->purrdelta;
get_paca()->spurrdelta = get_paca()->purrdelta = 0;
-   account_user_time_scaled(tsk, utimescaled);
-}
-
-static void account_process_time(struct pt_regs *regs)
-{
-   int cpu = smp_processor_id();
-
-   account_process_vtime(current);
-   run_local_timers();
-   if (rcu_pending(cpu))
-   rcu_check_callbacks(cpu, user_mode(regs));
-   scheduler_tick();
-   run_posix_cpu_timers(current);
+   account_user_time_scaled(current, utimescaled);
 }
 
 /*
@@ -375,7 +363,6 @@ static void snapshot_purr(void)
 
 #else /* ! CONFIG_VIRT_CPU_ACCOUNTING */
 #define calc_cputime_factors()
-#define account_process_time(regs) update_process_times(user_mode(regs))
 #define calculate_steal_time() do { } while (0)
 #endif
 
@@ -599,16 +586,6 @@ void timer_interrupt(struct pt_regs * regs)
get_lppaca()->int_dword.fields.decr_int = 0;
 #endif
 
-   /*
-* We cannot disable the decrementer, so in the period
-* between this cpu's being marked offline in cpu_online_map
-* and calling stop-self, it is taking timer interrupts.
-* Avoid calling into the scheduler rebalancing code if this
-* is the case.
-*/
-   if (!cpu_is_offline(cpu))
-   account_process_time(regs);
-
if (evt->event_handler)
evt->event_handler(evt);
else
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index 48dae49..6c6be1f 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -145,12 +145,8 @@ void account_ticks(u64 time)
do_timer(ticks);
 #endif
 
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
-   

Re: [2.6.23] Unable to boot kernel, regression?

2007-11-01 Thread pradeep singh
On 10/11/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:
[snip]
> >
> > Did something got changes which i forgot to take care of? or am i
> > missing something pretty obvious here?
>
> i have successfully built and booted a config derived from your config,
> on similar hardware. The only change is that i switched on some drivers
> by default and turned off some - the modified config is attached. Could
> you try this modified config, does it boot fine for you? (One crutial
> difference is that it uses CONFIG_SATA/PATA instead of CONFIG_IDE, so
> you might have to change /dev/hdax to /dev/sdax in your /etc/grub.conf,
> for the entry of the _new_ kernel only - if your system is not fs-label
> based.)

Sorry for the late reply Ingo. Did not get a chance to check mails.
Anyway i tried your config and with a little additional tweaking it works great.

Thanks for the help Ingo.

Best Regards
-- 
Pradeep
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] 2.6.23: Filesystem capabilities 0.17

2007-11-01 Thread Casey Schaufler

--- Olaf Dietsche <[EMAIL PROTECTED]> wrote:

> Jan Kara <[EMAIL PROTECTED]> writes:
> 
> > On Thu 01-11-07 20:49:32, Olaf Dietsche wrote:
> >> Jan Kara <[EMAIL PROTECTED]> writes:
> >> 
> >> >> This patch implements filesystem capabilities. It allows to
> >> >> run privileged executables without the need for suid root.
> >> >   Hmm, is there some "design document" so that one does not have to poke
> >> > through the code and find out what it's actually trying to do?
> >> 
> >> What do you mean with "trying to do"? I thought this is obvious, it
> >> provides executables with filesystem capabilities.
> >   Well, yes, that was obvious but I rather meant "how is it doing it?".
> > So where does it store these bits and such.
> 
> The bits are stored in a sparse file named /.capabilities in the
> directory of the mount point, where the corresponding executable
> lives. The inode number of the file is the index into this file.

The old PlanG approach. It's the way that we did MAC labels in
Trix4. It has the wicked advantage of working across NFS without
anyone being the wiser. It really causes trouble for backup
utilities, however. Trix6 (there wasn't really a Trix5) had xattrs
available and we found the switch well worth the investment.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


x86: not exported, but is?

2007-11-01 Thread H. Peter Anvin
Anyone happens to know how come, in the x86 tree (and previously in the 
x86-64 tree),  is not exported to userspace (but uses 
userspace-compatible typenames), whereas  is?


This is particularly puzzling since at least my version of glibc 
contains an  that looks just like the one in the 
kernel, minus a __user and an inclusion of .


-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Jeff Garzik

Heikki Orsila wrote:

On Mon, Oct 29, 2007 at 09:54:27AM -0700, Max Krasnyansky wrote:

A couple of HP xw9300 machines (dual Opterons) started freezing up.
We're running on 2.6.22.1 on them. Freezes a somewhere weird. 
VGA console is alive

(I can switch vts, etc) but everything else is dead (network, etc).


I'm thinking this is not a coincidence. I was running 2.6.22.5, and 
looking at your problems, I just had a similar experience on tuesday.. 
The network was still fine after kernel errors so that I was able to 
login with SSH. See:


http://lkml.org/lkml/2007/10/30/193


ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 status 
0x1540 next cpb count 0x0 next cpb idx 0x0
ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Descriptor sense data with sense descriptors (in hex):
end_request: I/O error, dev sda, sector 8388695
Buffer I/O error on device sda1, logical block 1048579
lost page write due to I/O error on sda1
sd 0:0:0:0: [sda] Write Protect is off


With ata_piix Intel SATA I got these errors:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting port
ata1.00: revalidation failed (errno=-2)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA


These are two 100% different issues  The only thing they have in 
common is that they spit out an error.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: TCP_DEFER_ACCEPT issues

2007-11-01 Thread David Miller
From: Felix von Leitner <[EMAIL PROTECTED]>
Date: Fri, 2 Nov 2007 02:33:21 +0100

> I am trying to use TCP_DEFER_ACCEPT in my web server.

You aren't going to reach many Linux kernel networking
exports on this mailing list.  Please post your question
instead to [EMAIL PROTECTED], as that's where all
the networking developers hang out.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread David Miller
From: Christoph Lameter <[EMAIL PROTECTED]>
Date: Thu, 1 Nov 2007 18:06:17 -0700 (PDT)

> A reasonable implementation for 64 bit is likely going to depend on 
> reserving some virtual memory space for the per cpu mappings so that they 
> can be dynamically grown up to what the reserved virtual space allows.
> 
> F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
> then there is a limit on the per cpu space available of 16MB.

Now that I understand your implementation better, yes this
sounds just fine.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2 -v3] x86_64 EFI boot support

2007-11-01 Thread Huang, Ying
Following sets of patches add EFI/UEFI (Unified Extensible Firmware
Interface) boot support to x86_64 architecture.

The patches have been tested against 2.6.24-rc1 kernel on Intel
platforms with EFI1.10 and UEFI2.0 firmware. With this set of patches
applied, the 64bit and 32bit x86 kernel can be booted on x86_64
machine with UEFI64 firmware.

Because the EFI memory map is converted to E820 map in bootloader, now
the only needed code for booting Linux kernel on x86_64 UEFI platform
is the framebuffer driver.

UEFI specification can be found here: http://www.uefi.org

For booting the UEFI x86_64 enabled kernel, the machine with EFI/UEFI
firmware and the support of bootloader is required. Detailed usage
guide can be found in Documentation/x86_64/uefi.txt, which is added in
the patch: EFI boot document


v3:

- The VIDEO_TYPE_EFI is changed to 0x70 to group VIDEO_TYPE_ better.

v2:

- The include files of efifb.c is cleaned up.
- Make CONFIG_FB_EFI not depend on CONFIG_EFI.


Best Regards,
Huang Ying
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] Blackfin SPI driver: move hard coded pin_req to board file

2007-11-01 Thread Bryan Wu
Remove some sort of bloaty code, try to get these pin_req arrays built at 
compile-time

 - move this static things to the blackfin board file
 - add pin_req array to struct bfin5xx_spi_master
 - tested on BF537/BF548 with SPI flash

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/spi/spi_bfin5xx.c  |   28 +++-
 include/asm-blackfin/bfin5xx_spi.h |1 +
 2 files changed, 8 insertions(+), 21 deletions(-)

diff --git a/drivers/spi/spi_bfin5xx.c b/drivers/spi/spi_bfin5xx.c
index a6f6d5f..f466fa0 100644
--- a/drivers/spi/spi_bfin5xx.c
+++ b/drivers/spi/spi_bfin5xx.c
@@ -80,6 +80,9 @@ struct driver_data {
/* Regs base of SPI controller */
void __iomem *regs_base;
 
+   /* Pin request list */
+   u16 *pin_req;
+
/* BFIN hookup */
struct bfin5xx_spi_master *master_info;
 
@@ -1245,25 +1248,6 @@ static inline int destroy_queue(struct driver_data 
*drv_data)
return 0;
 }
 
-static int setup_pin_mux(int action, int bus_num)
-{
-
-   u16 pin_req[3][4] = {
-   {P_SPI0_SCK, P_SPI0_MISO, P_SPI0_MOSI, 0},
-   {P_SPI1_SCK, P_SPI1_MISO, P_SPI1_MOSI, 0},
-   {P_SPI2_SCK, P_SPI2_MISO, P_SPI2_MOSI, 0},
-   };
-
-   if (action) {
-   if (peripheral_request_list(pin_req[bus_num], DRV_NAME))
-   return -EFAULT;
-   } else {
-   peripheral_free_list(pin_req[bus_num]);
-   }
-
-   return 0;
-}
-
 static int __init bfin5xx_spi_probe(struct platform_device *pdev)
 {
struct device *dev = >dev;
@@ -1286,6 +1270,7 @@ static int __init bfin5xx_spi_probe(struct 
platform_device *pdev)
drv_data->master = master;
drv_data->master_info = platform_info;
drv_data->pdev = pdev;
+   drv_data->pin_req = platform_info->pin_req;
 
master->bus_num = pdev->id;
master->num_chipselect = platform_info->num_chipselect;
@@ -1336,7 +1321,8 @@ static int __init bfin5xx_spi_probe(struct 
platform_device *pdev)
goto out_error_queue_alloc;
}
 
-   if (setup_pin_mux(1, master->bus_num)) {
+   status = peripheral_request_list(drv_data->pin_req, DRV_NAME);
+   if (status != 0) {
dev_err(>dev, ": Requesting Peripherals failed\n");
goto out_error;
}
@@ -1384,7 +1370,7 @@ static int __devexit bfin5xx_spi_remove(struct 
platform_device *pdev)
/* Disconnect from the SPI framework */
spi_unregister_master(drv_data->master);
 
-   setup_pin_mux(0, drv_data->master->bus_num);
+   peripheral_free_list(drv_data->pin_req);
 
/* Prevent double remove */
platform_set_drvdata(pdev, NULL);
diff --git a/include/asm-blackfin/bfin5xx_spi.h 
b/include/asm-blackfin/bfin5xx_spi.h
index d4485b3..1a0b57f 100644
--- a/include/asm-blackfin/bfin5xx_spi.h
+++ b/include/asm-blackfin/bfin5xx_spi.h
@@ -152,6 +152,7 @@
 struct bfin5xx_spi_master {
u16 num_chipselect;
u8 enable_dma;
+   u16 pin_req[4];
 };
 
 /* spi_board_info.controller_data for SPI slave devices,
-- 
1.5.3.4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/4] Blackfin SPI driver: reconfigure speed_hz and bits_per_word in each spi transfer

2007-11-01 Thread Bryan Wu
 - reconfigure SPI baud from speed_hz of each spi transfer
 - according to spi_transfer.bits_per_word to reprogram register and setup
   correct SPI operation handlers

Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/spi/spi_bfin5xx.c |   54 ++--
 1 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/drivers/spi/spi_bfin5xx.c b/drivers/spi/spi_bfin5xx.c
index f466fa0..3e5126e 100644
--- a/drivers/spi/spi_bfin5xx.c
+++ b/drivers/spi/spi_bfin5xx.c
@@ -234,10 +234,8 @@ static int restore_state(struct driver_data *drv_data)
dev_dbg(_data->pdev->dev, "restoring spi ctl state\n");
 
/* Load the registers */
-   write_BAUD(drv_data, chip->baud);
-   chip->ctl_reg &= (~BIT_CTL_TIMOD);
-   chip->ctl_reg |= (chip->width << 8);
write_CTRL(drv_data, chip->ctl_reg);
+   write_BAUD(drv_data, chip->baud);
 
bfin_spi_enable(drv_data);
cs_active(drv_data, chip);
@@ -679,6 +677,7 @@ static void pump_transfers(unsigned long data)
message = drv_data->cur_msg;
transfer = drv_data->cur_transfer;
chip = drv_data->cur_chip;
+
/*
 * if msg is error or done, report it back using complete() callback
 */
@@ -736,23 +735,62 @@ static void pump_transfers(unsigned long data)
drv_data->len_in_bytes = transfer->len;
drv_data->cs_change = transfer->cs_change;
 
-   width = chip->width;
+   /* Bits per word setup */
+   switch (transfer->bits_per_word) {
+   case 8:
+   drv_data->n_bytes = 1;
+   width = CFG_SPI_WORDSIZE8;
+   drv_data->read = chip->cs_change_per_word ?
+   u8_cs_chg_reader : u8_reader;
+   drv_data->write = chip->cs_change_per_word ?
+   u8_cs_chg_writer : u8_writer;
+   drv_data->duplex = chip->cs_change_per_word ?
+   u8_cs_chg_duplex : u8_duplex;
+   break;
+
+   case 16:
+   drv_data->n_bytes = 2;
+   width = CFG_SPI_WORDSIZE16;
+   drv_data->read = chip->cs_change_per_word ?
+   u16_cs_chg_reader : u16_reader;
+   drv_data->write = chip->cs_change_per_word ?
+   u16_cs_chg_writer : u16_writer;
+   drv_data->duplex = chip->cs_change_per_word ?
+   u16_cs_chg_duplex : u16_duplex;
+   break;
+
+   default:
+   /* No change, the same as default setting */
+   drv_data->n_bytes = chip->n_bytes;
+   width = chip->width;
+   drv_data->write = drv_data->tx ? chip->write : null_writer;
+   drv_data->read = drv_data->rx ? chip->read : null_reader;
+   drv_data->duplex = chip->duplex ? chip->duplex : null_writer;
+   break;
+   }
+   cr = (read_CTRL(drv_data) & (~BIT_CTL_TIMOD));
+   cr |= (width << 8);
+   write_CTRL(drv_data, cr);
+
if (width == CFG_SPI_WORDSIZE16) {
drv_data->len = (transfer->len) >> 1;
} else {
drv_data->len = transfer->len;
}
-   drv_data->write = drv_data->tx ? chip->write : null_writer;
-   drv_data->read = drv_data->rx ? chip->read : null_reader;
-   drv_data->duplex = chip->duplex ? chip->duplex : null_writer;
dev_dbg(_data->pdev->dev,
"transfer: drv_data->write is %p, chip->write is %p, null_wr is 
%p\n",
-   drv_data->write, chip->write, null_writer);
+   drv_data->write, chip->write, null_writer);
 
/* speed and width has been set on per message */
message->state = RUNNING_STATE;
dma_config = 0;
 
+   /* Speed setup (surely valid because already checked) */
+   if (transfer->speed_hz)
+   write_BAUD(drv_data, hz_to_spi_baud(transfer->speed_hz));
+   else
+   write_BAUD(drv_data, chip->baud);
+
write_STAT(drv_data, BIT_STAT_CLR);
cr = (read_CTRL(drv_data) & (~BIT_CTL_TIMOD));
cs_active(drv_data, chip);
-- 
1.5.3.4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/4] Blackfin SPI driver: use cpu_relax() to replace continue in while busywait

2007-11-01 Thread Bryan Wu
Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/spi/spi_bfin5xx.c |   78 ++--
 1 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/drivers/spi/spi_bfin5xx.c b/drivers/spi/spi_bfin5xx.c
index 83c866d..fc0c374 100644
--- a/drivers/spi/spi_bfin5xx.c
+++ b/drivers/spi/spi_bfin5xx.c
@@ -186,7 +186,7 @@ static int flush(struct driver_data *drv_data)
 
/* wait for stop and clear stat */
while (!(read_STAT(drv_data) & BIT_STAT_SPIF) && limit--)
-   continue;
+   cpu_relax();
 
write_STAT(drv_data, BIT_STAT_CLR);
 
@@ -262,7 +262,7 @@ static void null_writer(struct driver_data *drv_data)
while (drv_data->tx < drv_data->tx_end) {
write_TDBR(drv_data, 0);
while ((read_STAT(drv_data) & BIT_STAT_TXS))
-   continue;
+   cpu_relax();
drv_data->tx += n_bytes;
}
 }
@@ -274,7 +274,7 @@ static void null_reader(struct driver_data *drv_data)
 
while (drv_data->rx < drv_data->rx_end) {
while (!(read_STAT(drv_data) & BIT_STAT_RXS))
-   continue;
+   cpu_relax();
dummy_read(drv_data);
drv_data->rx += n_bytes;
}
@@ -287,12 +287,12 @@ static void u8_writer(struct driver_data *drv_data)
 
/* poll for SPI completion before start */
while (!(read_STAT(drv_data) & BIT_STAT_SPIF))
-   continue;
+   cpu_relax();
 
while (drv_data->tx < drv_data->tx_end) {
write_TDBR(drv_data, (*(u8 *) (drv_data->tx)));
while (read_STAT(drv_data) & BIT_STAT_TXS)
-   continue;
+   cpu_relax();
++drv_data->tx;
}
 }
@@ -303,14 +303,14 @@ static void u8_cs_chg_writer(struct driver_data *drv_data)
 
/* poll for SPI completion before start */
while (!(read_STAT(drv_data) & BIT_STAT_SPIF))
-   continue;
+   cpu_relax();
 
while (drv_data->tx < drv_data->tx_end) {
cs_active(drv_data, chip);
 
write_TDBR(drv_data, (*(u8 *) (drv_data->tx)));
while (read_STAT(drv_data) & BIT_STAT_TXS)
-   continue;
+   cpu_relax();
 
cs_deactive(drv_data, chip);
 
@@ -325,7 +325,7 @@ static void u8_reader(struct driver_data *drv_data)
 
/* poll for SPI completion before start */
while (!(read_STAT(drv_data) & BIT_STAT_SPIF))
-   continue;
+   cpu_relax();
 
/* clear TDBR buffer before read(else it will be shifted out) */
write_TDBR(drv_data, 0x);
@@ -334,13 +334,13 @@ static void u8_reader(struct driver_data *drv_data)
 
while (drv_data->rx < drv_data->rx_end - 1) {
while (!(read_STAT(drv_data) & BIT_STAT_RXS))
-   continue;
+   cpu_relax();
*(u8 *) (drv_data->rx) = read_RDBR(drv_data);
++drv_data->rx;
}
 
while (!(read_STAT(drv_data) & BIT_STAT_RXS))
-   continue;
+   cpu_relax();
*(u8 *) (drv_data->rx) = read_SHAW(drv_data);
++drv_data->rx;
 }
@@ -351,7 +351,7 @@ static void u8_cs_chg_reader(struct driver_data *drv_data)
 
/* poll for SPI completion before start */
while (!(read_STAT(drv_data) & BIT_STAT_SPIF))
-   continue;
+   cpu_relax();
 
/* clear TDBR buffer before read(else it will be shifted out) */
write_TDBR(drv_data, 0x);
@@ -363,7 +363,7 @@ static void u8_cs_chg_reader(struct driver_data *drv_data)
cs_deactive(drv_data, chip);
 
while (!(read_STAT(drv_data) & BIT_STAT_RXS))
-   continue;
+   cpu_relax();
cs_active(drv_data, chip);
*(u8 *) (drv_data->rx) = read_RDBR(drv_data);
++drv_data->rx;
@@ -371,7 +371,7 @@ static void u8_cs_chg_reader(struct driver_data *drv_data)
cs_deactive(drv_data, chip);
 
while (!(read_STAT(drv_data) & BIT_STAT_RXS))
-   continue;
+   cpu_relax();
*(u8 *) (drv_data->rx) = read_SHAW(drv_data);
++drv_data->rx;
 }
@@ -380,15 +380,15 @@ static void u8_duplex(struct driver_data *drv_data)
 {
/* poll for SPI completion before start */
while (!(read_STAT(drv_data) & BIT_STAT_SPIF))
-   continue;
+   cpu_relax();
 
/* in duplex mode, clk is triggered by writing of TDBR */
while (drv_data->rx < drv_data->rx_end) {
write_TDBR(drv_data, (*(u8 *) (drv_data->tx)));
while (read_STAT(drv_data) & BIT_STAT_TXS)
-   continue;
+   cpu_relax();
while (!(read_STAT(drv_data) & 

[PATCH 0/4] Blackfin SPI driver updates and fixing

2007-11-01 Thread Bryan Wu
According to David and Andrew's advices, update the Blackfin SPI series patches
in -mm tree.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] Blackfin SPI driver: use void __iomem * for regs_base

2007-11-01 Thread Bryan Wu
Signed-off-by: Bryan Wu <[EMAIL PROTECTED]>
---
 drivers/spi/spi_bfin5xx.c |9 -
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/spi/spi_bfin5xx.c b/drivers/spi/spi_bfin5xx.c
index fc0c374..a6f6d5f 100644
--- a/drivers/spi/spi_bfin5xx.c
+++ b/drivers/spi/spi_bfin5xx.c
@@ -78,7 +78,7 @@ struct driver_data {
struct spi_master *master;
 
/* Regs base of SPI controller */
-   u32 regs_base;
+   void __iomem *regs_base;
 
/* BFIN hookup */
struct bfin5xx_spi_master *master_info;
@@ -1301,9 +1301,8 @@ static int __init bfin5xx_spi_probe(struct 
platform_device *pdev)
goto out_error_get_res;
}
 
-   drv_data->regs_base = (u32) ioremap(res->start,
-   (res->end - res->start + 1));
-   if (!drv_data->regs_base) {
+   drv_data->regs_base = ioremap(res->start, (res->end - res->start + 1));
+   if (drv_data->regs_base == NULL) {
dev_err(dev, "Cannot map IO\n");
status = -ENXIO;
goto out_error_ioremap;
@@ -1342,7 +1341,7 @@ static int __init bfin5xx_spi_probe(struct 
platform_device *pdev)
goto out_error;
}
 
-   dev_info(dev, "%s, Version %s, [EMAIL PROTECTED], dma [EMAIL 
PROTECTED]",
+   dev_info(dev, "%s, Version %s, [EMAIL PROTECTED], dma [EMAIL 
PROTECTED]",
DRV_DESC, DRV_VERSION, drv_data->regs_base,
drv_data->dma_channel);
return status;
-- 
1.5.3.4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: writeout stalls in current -git

2007-11-01 Thread Fengguang Wu
On Thu, Nov 01, 2007 at 08:00:10PM +0100, Torsten Kaiser wrote:
> On 11/1/07, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> > On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > > Thank you. Maybe we can start by the applied debug patch :-)
> >
> > Will applied it and try to recreate this.
> 
> Patch applied, used emerge to install a 2.6.24-rc1 kernel.
> 
> I had no complete stalls, but three times during the move from tmpfs
> to the main xfs the emerge got noticeable slower. There still was
> writeout happening, but as emerge prints out every file it has written
> during the pause not one file was processed.
> 
> vmstat 10:
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  0  1  0 3146424332 61476800   134  1849  438 2515  3  4 91  2
>  0  0  0 3146644332 61478400 2  1628  507  646  0  2 85 13
>  0  0  0 3146868332 61486800 5  2359  527 1076  0  3 97  0
>  1  0  0 3144372332 6161480096  2829  607 2666  2  5 92  0
> -> normal writeout
>  0  0  0 3140560332 61814400   152  2764  633 3308  3  6 91  0
>  0  0  0 3137332332 61990800   114  1801  588 2858  3  4 93  0
>  0  0  0 3136912332 6201360020   827  393 1605  1  2 98  0
> -> first stall

'stall': vmstat's output stalls for some time, or emerge stalls for
the next several vmstat lines?

>  0  0  0 3137088332 62013600 0   557  339 1437  0  1 99  0
>  0  0  0 3137160332 62013600 0   642  310 1400  0  1 99  0
>  0  0  0 3136588332 62017200 6  2972  527 1195  0  3 80 16
>  0  0  0 3136276332 6203480010  2668  558 1195  0  3 96  0
>  0  0  0 3135228332 62042400 8  2712  522 1311  0  4 96  0
>  0  0  0 3131740332 6215240075  2935  559 2457  2  5 93  0
>  0  0  0 3128348332 6229720085  1470  490 2607  3  4 93  0
>  0  0  0 3129292332 62297200 0   527  353 1398  0  1 99  0
> -> second longer stall
>  0  0  0 3128520332 62302800 6   488  249 1390  0  1 99  0
>  0  0  0 3128236332 62302800 0   482  222 1222  0  1 99  0
>  0  0  0 3128408332 62302800 0   585  269 1301  0  0 99  0
>  0  0  0 3128532332 62302800 0   610  262 1278  0  0 99  0
>  0  0  0 3128568332 62302800 0   636  345 1639  0  1 99  0
>  0  0  0 3129032332 62304000 1   664  337 1466  0  1 99  0
>  0  0  0 3129484332 62304000 0   658  300 1508  0  0 100  > 0
>  0  0  0 3129576332 62304000 0   562  271 1454  0  1 99  0
>  0  0  0 3129736332 62304000 0   627  278 1406  0  1 99  0
>  0  0  0 3129368332 62304000 0   507  274 1301  0  1 99  0
>  0  0  0 3129004332 62304000 0   444  211 1213  0  0 99  0
>  0  1  0 3127260332 62304000 0  1036  305 1242  0  1 95  4
>  0  0  0 3126280332 62312800 7  4241  555 1575  1  5 84 10
>  0  0  0 3124948332 62323200 6  4194  529 1505  1  4 95  0
>  0  0  0 3125228332 6241680058  1966  586 1964  2  4 94  0
> -> emerge resumed to normal speed, without any intervention from my side
>  0  0  0 3120932332 62590400   112  1546  546 2565  3  4 93  0
>  0  0  0 3118012332 62756800   128  1542  612 2705  3  4 93  0

Interesting, the 'bo' never falls to zero.

> 
> >From syslog:
> first stall:
> [  575.05] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 47259 > 
> global 610 0 0 wc __ tw 1023 sk 0
> [  586.35] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50465 > 
> global 6117 0 0 wc _M tw 967 sk 0
> [  586.36] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 50408 > 
> global 6117 0 0 wc __ tw 1022 sk 0
> [  599.90] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 53523 > 
> global 11141 0 0 wc __ tw 1009 sk 0
> [  635.78] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 59397 > 
> global 12757 124 0 wc __ tw 0 sk 0
> [  638.47] mm/page-writeback.c 418 balance_dirty_pages: > emerge(6113) 
> 1536 global 11405 51 0 wc __ tw 0 sk 0
> [  638.82] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 58373 > 
> global 11276 48 0 wc __ tw -1 sk 0
> [  641.26] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 57348 > 
> global 10565 100 0 wc __ tw 0 sk 0
> [  643.98] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 56324 > 
> global 9788 103 0 wc __ tw -1 sk 0
> [  646.12] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 55299 > 
> global 8912 6 0 wc __ tw 0 sk 0
> 
> second stall:
> [  664.04] mm/page-writeback.c 655 wb_kupdate: pdflush(285) 48117 > 
> global 2864 81 0 wc _M tw -13 sk 0
> [  664.40] mm/page-writeback.c 

Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Florin Iucha
On Fri, Nov 02, 2007 at 09:33:21AM +0800, Fengguang Wu wrote:
> > I will try that with a USB disk - I hope that won't make a difference.
> 
> Thank you. I guess a reiserfs on loop file would also be OK.
> 
> > > btw, what's the exact kernel version you are running?
> > 
> > I noticed it with the kernel in the $SUBJECT, as reported by 'git
> > describe'.  I have pulled in new changesets since then.
> 
> And with the following patch applied?
> 
> ---
>  fs/reiserfs/stree.c |3 ---
>  1 file changed, 3 deletions(-)
> 
> --- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
> +++ linux-2.6.24-git17/fs/reiserfs/stree.c
> @@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
>   }
>   bh = next;
>   } while (bh != head);
> - if (PAGE_SIZE == bh->b_size) {
> - cancel_dirty_page(page, PAGE_CACHE_SIZE);
> - }
>   }
>   }
>  }

... and with the above patch applied.

Copying 300 MB from root (ext3) to the new file system did not trigger
the pdflush condition.  But then I did a
   cd $MOUNTPOINT && find . -exec md5sum {} \;
and that brought one cpu to 75% iowait.

I have attached my .config, if it helps.

Cheers,
florin

-- 
Bruce Schneier expects the Spanish Inquisition.
  http://geekz.co.uk/schneierfacts/fact/163
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc1-7-fw2
# Wed Oct 31 07:27:14 2007
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
# CONFIG_TASKSTATS is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=19
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CPUSETS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_BLK_DEV_BSG=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=m
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y

[PATCH 2/2 -v3] x86_64 EFI boot support: EFI boot document

2007-11-01 Thread Huang, Ying
This patch adds document for EFI x86_64 boot support. The setup and
operation guide of EFI based system is documented in
Documentation/x86_64/uefi.txt.

Signed-off-by: Chandramouli Narayanan <[EMAIL PROTECTED]>
Signed-off-by: Huang Ying <[EMAIL PROTECTED]>

---
 Documentation/x86_64/uefi.txt |   29 +
 1 file changed, 29 insertions(+)

Index: linux-2.6.24-rc1/Documentation/x86_64/uefi.txt
===
--- /dev/null
+++ linux-2.6.24-rc1/Documentation/x86_64/uefi.txt
@@ -0,0 +1,29 @@
+General note on [U]EFI x86_64 support
+-
+
+The nomenclature EFI and UEFI are used interchangeably in this document.
+
+Although the tools below are _not_ needed for building the kernel,
+the needed bootloader support and associated tools for x86_64 platforms
+with EFI firmware and specifications are listed below.
+
+1. UEFI specification:  http://www.uefi.org
+
+2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
+   support. Elilo with x86_64 support can be used.
+
+3. x86_64 platform with EFI/UEFI firmware.
+
+Mechanics:
+-
+- Build the kernel with the following configuration.
+   CONFIG_FB_EFI=y
+   CONFIG_FRAMEBUFFER_CONSOLE=y
+- Create a VFAT partition on the disk
+- Copy the following to the VFAT partition:
+   elilo bootloader with x86_64 support, elilo configuration file,
+   kernel image built in first step and corresponding
+   initrd. Instructions on building elilo  and its dependencies
+   can be found in the elilo sourceforge project.
+- Boot to EFI shell and invoke elilo choosing the kernel image built
+  in first step.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2 -v3] x86_64 EFI boot support: EFI frame buffer driver

2007-11-01 Thread Huang, Ying
This patch adds Graphics Output Protocol support to the kernel.
UEFI2.0 spec deprecates Universal Graphics Adapter (UGA) protocol and
only Graphics Output Protocol (GOP) is produced. Therefore, the boot
loader needs to query the UEFI firmware with appropriate Output
Protocol and pass the video information to the kernel. As a result of
GOP protocol, an EFI framebuffer driver is needed for displaying
console messages. The patch adds a EFI framebuffer driver. The EFI
frame buffer driver in this patch is based on the Intel Mac
framebuffer driver.

The ELILO bootloader takes care of passing the video information as
appropriate for EFI firmware.

The framebuffer driver has been tested in i386 kernel and x86_64
kernel on EFI platform.

Signed-off-by: Chandramouli Narayanan <[EMAIL PROTECTED]>
Signed-off-by: Huang Ying <[EMAIL PROTECTED]>

---
 drivers/video/Kconfig   |   11 ++
 drivers/video/Makefile  |1 
 drivers/video/efifb.c   |  232 
 include/linux/screen_info.h |2 
 4 files changed, 246 insertions(+)

Index: linux-2.6.24-rc1/include/linux/screen_info.h
===
--- linux-2.6.24-rc1.orig/include/linux/screen_info.h
+++ linux-2.6.24-rc1/include/linux/screen_info.h
@@ -63,6 +63,8 @@ struct screen_info {
 
 #define VIDEO_TYPE_PMAC0x60/* PowerMacintosh frame buffer. 
*/
 
+#define VIDEO_TYPE_EFI 0x70/* EFI graphic mode */
+
 #ifdef __KERNEL__
 extern struct screen_info screen_info;
 
Index: linux-2.6.24-rc1/drivers/video/Kconfig
===
--- linux-2.6.24-rc1.orig/drivers/video/Kconfig
+++ linux-2.6.24-rc1/drivers/video/Kconfig
@@ -641,6 +641,17 @@ config FB_VESA
  You will get a boot time penguin logo at no additional cost. Please
  read . If unsure, say Y.
 
+config FB_EFI
+   bool "EFI-based Framebuffer Support"
+   depends on (FB = y) && X86
+   select FB_CFB_FILLRECT
+   select FB_CFB_COPYAREA
+   select FB_CFB_IMAGEBLIT
+   help
+ This is the EFI frame buffer device driver. If the firmware on
+ your platform is UEFI2.0, select Y to add support for
+ Graphics Output Protocol for early console messages to appear.
+
 config FB_IMAC
bool "Intel-based Macintosh Framebuffer Support"
depends on (FB = y) && X86 && EFI
Index: linux-2.6.24-rc1/drivers/video/Makefile
===
--- linux-2.6.24-rc1.orig/drivers/video/Makefile
+++ linux-2.6.24-rc1/drivers/video/Makefile
@@ -118,6 +118,7 @@ obj-$(CONFIG_FB_OMAP) += oma
 obj-$(CONFIG_FB_UVESA)+= uvesafb.o
 obj-$(CONFIG_FB_VESA) += vesafb.o
 obj-$(CONFIG_FB_IMAC) += imacfb.o
+obj-$(CONFIG_FB_EFI)  += efifb.o
 obj-$(CONFIG_FB_VGA16)+= vga16fb.o
 obj-$(CONFIG_FB_OF)   += offb.o
 obj-$(CONFIG_FB_BF54X_LQ043) += bf54x-lq043fb.o
Index: linux-2.6.24-rc1/drivers/video/efifb.c
===
--- /dev/null
+++ linux-2.6.24-rc1/drivers/video/efifb.c
@@ -0,0 +1,232 @@
+/*
+ * Framebuffer driver for EFI/UEFI based system
+ *
+ * (c) 2006 Edgar Hucek <[EMAIL PROTECTED]>
+ * Original efi driver written by Gerd Knorr <[EMAIL PROTECTED]>
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+static struct fb_var_screeninfo efifb_defined __initdata = {
+   .activate   = FB_ACTIVATE_NOW,
+   .height = -1,
+   .width  = -1,
+   .right_margin   = 32,
+   .upper_margin   = 16,
+   .lower_margin   = 4,
+   .vsync_len  = 4,
+   .vmode  = FB_VMODE_NONINTERLACED,
+};
+
+static struct fb_fix_screeninfo efifb_fix __initdata = {
+   .id = "EFI VGA",
+   .type   = FB_TYPE_PACKED_PIXELS,
+   .accel  = FB_ACCEL_NONE,
+   .visual = FB_VISUAL_TRUECOLOR,
+};
+
+static int efifb_setcolreg(unsigned regno, unsigned red, unsigned green,
+  unsigned blue, unsigned transp,
+  struct fb_info *info)
+{
+   /*
+*  Set a single color register. The values supplied are
+*  already rounded down to the hardware's capabilities
+*  (according to the entries in the `var' structure). Return
+*  != 0 for invalid regno.
+*/
+
+   if (regno >= info->cmap.len)
+   return 1;
+
+   if (regno < 16) {
+   red   >>= 8;
+   green >>= 8;
+   blue  >>= 8;
+   ((u32 *)(info->pseudo_palette))[regno] =
+   (red   << info->var.red.offset)   |
+   (green << info->var.green.offset) |
+ 

Re: [PATCH/RFC] eradicate bashisms in scripts/patch-kernel

2007-11-01 Thread Herbert Xu
On Thu, Nov 01, 2007 at 04:08:57PM -0700, Randy Dunlap wrote:
>
> > - replace non-standard bash string parsing by sed expression
> >   (is the sed syntax ok? correct? strict enough?)
> 
> I think that this is the part that bothers me.  I can't find
> anything at
> http://www.opengroup.org/onlinepubs/95399/utilities/xcu_chap02.html
> that says that this:
>   EXTRAVER=${EXTRAVER%%[[:punct:]]*}
> 
> is invalid or even optional syntax.  OTOH, it does list such syntax,

This is POSIX-compliant and has worked with dash from the very
start.

> > About the missing $ signs:
> > http://www.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_14
> > says:
> > "If the shell variable x contains a value that forms a valid integer
> > constant, then the arithmetic expansions
> >  "$((x))" and "$(($x))" shall return the same value."
> > 
> > Hmm, well, seems dash doesn't... (syntax error).
> > Thus I still needed to add the $ signs despite opengroup.org specifying
> > it differently.
> 
> Herbert?

Using variables without dollar signs in arithmetic expansion was
only added to dash very recently.  So please please talk to your
distribution maker to update their dash packages and it will work
correctly.

This usage is compliant with the most recent revision of POSIX
while earlier ones did not specifically require this (due to
the fact that assignment support was not required either).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Rik van Riel
On Thu, 1 Nov 2007 18:19:41 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:
> On Thu, 1 Nov 2007, Rik van Riel wrote:
> > 
> > Larry Woodman managed to wedge the VM into a state where, on his
> > 4x dual core system, only 2 cores (on the same CPU) could get the
> > zone->lru_lock overnight.  The other 6 cores on the system were
> > just spinning, without being able to get the lock.
> 
> .. and this is almost always the result of a locking *bug*, not
> unfairness per se. IOW, unfairness just ends up showing the bug in
> the first place.

No argument there.  If you have the kind of lock contention where
fairness matters, the contention is probably what needs to be fixed,
not the locking mechanism.

Having said that, making bugs like that less likely to totally wedge
a system would be a good thing for everybody who uses Linux in
production.  Exposing bugs is good for development, bad for business.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] file capabilities: allow sigcont within session (v2)

2007-11-01 Thread Theodore Tso
On Thu, Nov 01, 2007 at 08:47:01AM -0500, Serge E. Hallyn wrote:
> > > >From 5bff8967f45a35f858b96ca673d9bf98eac53d49 Mon Sep 17 00:00:00 2001
> > > From: Serge E. Hallyn <[EMAIL PROTECTED]>
> > > Date: Wed, 31 Oct 2007 11:22:04 -0500
> > > Subject: [PATCH 1/1] file capabilities: allow sigcont within session (v2)

> New patch on top of previous one is appended.
> 
> From 98741f07ab1bc4a1fc2de7fedfb9023ea30bf988 Mon Sep 17 00:00:00 2001
> From: Serge E. Hallyn <[EMAIL PROTECTED]>
> Date: Thu, 1 Nov 2007 08:20:12 -0500
> Subject: [PATCH 1/1] file capabilities: remove the non-matching uid special 
> case for kill
> 

Tested-by: "Theodore Ts'o" <[EMAIL PROTECTED]>

Thanks, this fixes the issue I reported!

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


writeout stalls in current -git

2007-11-01 Thread Fengguang Wu
On Thu, Nov 01, 2007 at 07:20:51PM +0100, Torsten Kaiser wrote:
> On 11/1/07, Fengguang Wu <[EMAIL PROTECTED]> wrote:
> > On Wed, Oct 31, 2007 at 04:22:10PM +0100, Torsten Kaiser wrote:
> > > Since 2.6.23-mm1 I also experience strange hangs during heavy writeouts.
> > > Each time I noticed this I was using emerge (package util from the
> > > gentoo distribution) to install/upgrade a package. The last step,
> > > where this hang occurred, is moving the prepared files from a tmpfs
> > > partion to the main xfs filesystem.
> > > The hangs where not fatal, after a few second everything resumed
> > > normal, so I was not able to capture a good image of what was
> > > happening.
> >
> > Thank you for the detailed report.
> >
> > How severe was the hangs? Only writeouts stalled, all apps stalled, or
> > cannot type and run new commands?
> 
> Only writeout stalled. The emerge that was moving the files hung, but
> everything else worked normaly.
> I was able to run new commands, like coping the /proc/meminfo.

But you mentioned in the next mail that `watch cat /proc/meminfo`
could also be blocked for some time - I guess in the same time emerge
was stalled?

> [snip]
> > > After this SysRq+W writeback resumed again. Possible that writing
> > > above into the syslog triggered that.
> >
> > Maybe. Are the log files on another disk/partition?
> 
> No, everything was going to /
> 
> What might be interesting is, that doing cat /proc/meminfo
> >~/stall/meminfo did not resume the writeback. So there might some
> threshold that only was broken with the additional write from
> syslog-ng. Or syslog-ng does some flushing, I dont now. (I'm using the

Have you tried explicit `sync`? ;-)

> syslog-ng package from gentoo:
> http://www.balabit.com/products/syslog_ng/ , version 2.0.5)
> 
> > > The source tmpfs is mounted with any special parameters, but the
> > > target xfs filesystem resides on a dm-crypt device that is on top a 3
> > > disk RAID5 md.
> > > During the hang all CPUs where idle.
> >
> > No iowaits? ;-)
> 
> No, I have a KSysGuard in my taskbar that showed no activity at all.
> 
> OK, the subject does not match for my case, but there was also a tmpfs
> involved. And I found no thread with stalls on xfs. :-)

Do you mean it is actually related with tmpfs?

> > > The system is x86_64 with CONFIG_NO_HZ=y, but was still receiving ~330
> > > interrupts per second because of the bttv driver. (But I was not using
> > > that device at this time.)
> > >
> > > I'm willing to test patches or more provide more information, but lack
> > > a good testcase to trigger this on demand.
> >
> > Thank you. Maybe we can start by the applied debug patch :-)
> 
> Will applied it and try to recreate this.
> 
> Thanks for looking into it.

Thank you for the rich information, too :-)

Fengguang

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] sparck64: remove duplicate includes

2007-11-01 Thread lizf

This patch removes duplicate includes in arch/sparc64

Signed-off-by Li Zefan <[EMAIL PROTECTED]>

---
 arch/sparc64/kernel/ds.c  |1 -
 arch/sparc64/kernel/module.c  |1 -
 arch/sparc64/kernel/sys_sparc32.c |1 -
 arch/sparc64/kernel/sys_sunos32.c |1 -
 arch/sparc64/kernel/time.c|2 --
 5 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/arch/sparc64/kernel/ds.c b/arch/sparc64/kernel/ds.c
index 9f472a7..eeb5a2f 100644
--- a/arch/sparc64/kernel/ds.c
+++ b/arch/sparc64/kernel/ds.c
@@ -6,7 +6,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
diff --git a/arch/sparc64/kernel/module.c b/arch/sparc64/kernel/module.c
index 5798715..158484b 100644
--- a/arch/sparc64/kernel/module.c
+++ b/arch/sparc64/kernel/module.c
@@ -11,7 +11,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
diff --git a/arch/sparc64/kernel/sys_sparc32.c 
b/arch/sparc64/kernel/sys_sparc32.c
index 78caff9..98c4688 100644
--- a/arch/sparc64/kernel/sys_sparc32.c
+++ b/arch/sparc64/kernel/sys_sparc32.c
@@ -51,7 +51,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
diff --git a/arch/sparc64/kernel/sys_sunos32.c 
b/arch/sparc64/kernel/sys_sunos32.c
index 170d6ca..cfc22d3 100644
--- a/arch/sparc64/kernel/sys_sunos32.c
+++ b/arch/sparc64/kernel/sys_sunos32.c
@@ -57,7 +57,6 @@
 #include 
 
 /* For SOCKET_I */
-#include 
 #include 
 #include 
 
diff --git a/arch/sparc64/kernel/time.c b/arch/sparc64/kernel/time.c
index cd8c740..54bdb88 100644
--- a/arch/sparc64/kernel/time.c
+++ b/arch/sparc64/kernel/time.c
@@ -28,7 +28,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -47,7 +46,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 DEFINE_SPINLOCK(mostek_lock);
-- 
1.5.3.rc7

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel processes - are they really needed?

2007-11-01 Thread Dmitry Torokhov
On Wednesday 31 October 2007 13:33, Andi Kleen wrote:
> "Dmitry Torokhov" <[EMAIL PROTECTED]> writes:
> 
> > On 10/24/07, Andi Kleen <[EMAIL PROTECTED]> wrote:
> >>
> >> My favourite for a ridiculous thread was and is "kpsmoused"
> >>
> >
> > Mouse querying can take significant amount of time. Do you really want
> > all your other events to be delayed just because kernel tries to get
> > mouse back in order?
> 
> How long?

If a mouse is stubborn and does not want to get enabled it may sleep
up to 1 sec.

> 
> 
> >
> > Although I probably want to kill it if mouse resync is disabled...
> 
> How often does that happen? Can't you just start a thread for this
> as needed? Or if it's a simple algorithm you can just use a state machine
> using timers?
>

The IRQ handler is already too complex, I'd rather not mess with a state
machine. I will see how to kill the thread if resync is disabled.

-- 
Dmitry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] Blackfin I2C/TWI driver: add missing pin mux operation

2007-11-01 Thread Bryan Wu
On 11/2/07, Mike Frysinger <[EMAIL PROTECTED]> wrote:
> On 10/30/07, Bryan Wu <[EMAIL PROTECTED]> wrote:
> > --- a/drivers/i2c/busses/i2c-bfin-twi.c
> > +++ b/drivers/i2c/busses/i2c-bfin-twi.c
> > +static int setup_pin_mux(int action, struct bfin_twi_iface *iface)
> > +{
> > +
> > +   u16 pin_req[2][3] = {
> > +   {P_TWI0_SCL, P_TWI0_SDA, 0},
> > +   {P_TWI1_SCL, P_TWI1_SDA, 0},
> > +   };
>
> might be better to have this in the boards file ... consider the
> scenario on the BF54x where the user wants to use I2C0 and not I2C1 or
> vice versa so that they can use the set of pins for something else ...
> this would prevent such a setup
>

Yes, I plan to use new style I2C driver interface as Jean suggested before.
The whole hard coded pin_req list can be passed to the i2c-bfin-twi.c
dynamically.

> > +   if (action) {
> > +   if (peripheral_request_list(pin_req[iface->bus_num], 
> > DRV_NAME))
> > +   return -EFAULT;
> > +   } else {
> > +   peripheral_free_list(pin_req[iface->bus_num]);
> > +   }
> > +
> > +   return 0;
> > +}
>
> EFAULT is incorrect i think ... want to pass back the actual value
> from peripheral_request_list()
>
It will be removed in the new style interface.

> > --- a/drivers/i2c/busses/i2c-bfin-twi.c
> > +++ b/drivers/i2c/busses/i2c-bfin-twi.c
> > +static int setup_pin_mux(int action, struct bfin_twi_iface *iface)
> > +{
> > +
> > +   u16 pin_req[2][3] = {
> > +   {P_TWI0_SCL, P_TWI0_SDA, 0},
> > +   {P_TWI1_SCL, P_TWI1_SDA, 0},
> > +   };
>
> might be better to have this in the boards file ... consider the
> scenario on the BF54x where the user wants to use I2C0 and not I2C1 or
> vice versa so that they can use the set of pins for something else ...
> this would prevent such a setup
>
>if (action)
>return peripheral_request_list(pin_req[iface->bus_num],
> DRV_NAME);
>else
>peripheral_free_list(pin_req[iface->bus_num]);
>
>return 0;
> -mike
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] md: Fix misapplied patch in raid5.c

2007-11-01 Thread NeilBrown
commit 4ae3f847e49e3787eca91bced31f8fd328d50496 did not get applied
correctly, presumably due to substantial similarities between
handle_stripe5 and handle_stripe6.

This patch (with lots of context) moves the chunk of new code from
handle_stripe6 (where it isn't needed (yet)) to handle_stripe5.


Signed-off-by: Neil Brown <[EMAIL PROTECTED]>
cc: "Dan Williams" <[EMAIL PROTECTED]>

---
 The patch is correctly applied in -mm.
 The same patch was sent to stable@ but doesn't seem to have made it yet.
 When it does get applied, we should make sure it gets applied properly...

### Diffstat output
 ./drivers/md/raid5.c |   14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff .prev/drivers/md/raid5.c ./drivers/md/raid5.c
--- .prev/drivers/md/raid5.c2007-11-02 12:10:49.0 +1100
+++ ./drivers/md/raid5.c2007-11-02 12:25:31.0 +1100
@@ -2607,40 +2607,47 @@ static void handle_stripe5(struct stripe
struct bio *return_bi = NULL;
struct stripe_head_state s;
struct r5dev *dev;
unsigned long pending = 0;
 
memset(, 0, sizeof(s));
pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d "
"ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state,
atomic_read(>count), sh->pd_idx,
sh->ops.pending, sh->ops.ack, sh->ops.complete);
 
spin_lock(>lock);
clear_bit(STRIPE_HANDLE, >state);
clear_bit(STRIPE_DELAYED, >state);
 
s.syncing = test_bit(STRIPE_SYNCING, >state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, >state);
s.expanded = test_bit(STRIPE_EXPAND_READY, >state);
/* Now to look around and see what can be done */
 
+   /* clean-up completed biofill operations */
+   if (test_bit(STRIPE_OP_BIOFILL, >ops.complete)) {
+   clear_bit(STRIPE_OP_BIOFILL, >ops.pending);
+   clear_bit(STRIPE_OP_BIOFILL, >ops.ack);
+   clear_bit(STRIPE_OP_BIOFILL, >ops.complete);
+   }
+
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
struct r5dev *dev = >dev[i];
clear_bit(R5_Insync, >flags);
 
pr_debug("check %d: state 0x%lx toread %p read %p write %p "
"written %p\n", i, dev->flags, dev->toread, dev->read,
dev->towrite, dev->written);
 
/* maybe we can request a biofill operation
 *
 * new wantfill requests are only permitted while
 * STRIPE_OP_BIOFILL is clear
 */
if (test_bit(R5_UPTODATE, >flags) && dev->toread &&
!test_bit(STRIPE_OP_BIOFILL, >ops.pending))
set_bit(R5_Wantfill, >flags);
 
/* now count some things */
@@ -2880,47 +2887,40 @@ static void handle_stripe6(struct stripe
struct stripe_head_state s;
struct r6_state r6s;
struct r5dev *dev, *pdev, *qdev;
 
r6s.qd_idx = raid6_next_disk(pd_idx, disks);
pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
"pd_idx=%d, qd_idx=%d\n",
   (unsigned long long)sh->sector, sh->state,
   atomic_read(>count), pd_idx, r6s.qd_idx);
memset(, 0, sizeof(s));
 
spin_lock(>lock);
clear_bit(STRIPE_HANDLE, >state);
clear_bit(STRIPE_DELAYED, >state);
 
s.syncing = test_bit(STRIPE_SYNCING, >state);
s.expanding = test_bit(STRIPE_EXPAND_SOURCE, >state);
s.expanded = test_bit(STRIPE_EXPAND_READY, >state);
/* Now to look around and see what can be done */
 
-   /* clean-up completed biofill operations */
-   if (test_bit(STRIPE_OP_BIOFILL, >ops.complete)) {
-   clear_bit(STRIPE_OP_BIOFILL, >ops.pending);
-   clear_bit(STRIPE_OP_BIOFILL, >ops.ack);
-   clear_bit(STRIPE_OP_BIOFILL, >ops.complete);
-   }
-
rcu_read_lock();
for (i=disks; i--; ) {
mdk_rdev_t *rdev;
dev = >dev[i];
clear_bit(R5_Insync, >flags);
 
pr_debug("check %d: state 0x%lx read %p write %p written %p\n",
i, dev->flags, dev->toread, dev->towrite, dev->written);
/* maybe we can reply to a read */
if (test_bit(R5_UPTODATE, >flags) && dev->toread) {
struct bio *rbi, *rbi2;
pr_debug("Return read for disc %d\n", i);
spin_lock_irq(>device_lock);
rbi = dev->toread;
dev->toread = NULL;
if (test_and_clear_bit(R5_Overlap, >flags))
wake_up(>wait_for_overlap);
spin_unlock_irq(>device_lock);
while (rbi && rbi->bi_sector < dev->sector + 
STRIPE_SECTORS) {

Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
Hmmm... On x86_64 we could take 8 terabyte virtual space (bit order 43)

With the worst case scenario of 16k of cpus (bit order 16) we are looking 
at 43-16 = 27 ~ 128MB per cpu. Each percpu can at max be mapped by 64 pmd 
entries. 4k support is actually max for projected hw. So we'd get 
to 512M. 

On IA64 we could take half of the vmemmap area which is 45 bits. So 
we could get up to 512MB (with 16k pages, 64k pages can get us even 
further) assuming we can at some point run 16 processors per node (4k is 
the current max which would put the limit on the per cpu area >1GB).

Lets say you have a system with 64 cpus and an area of 128M of per cpu 
storage. Then we are using 8GB of total memory for per cpu storage. The 
128M allows us to store f.e.  16 M of word size counters.

With SLAB and the current allocpercpu you would need the following for 
16M counters:

16M*32*64 (minimum alloc size of SLAB is 32 byte and we alloc via 
kmalloc) for the data.

16M*64*8 for the pointer arrays. 16M allocpercpu areas for 64 processors 
and a pointer size of 8 bytes.

So you would need to use 40G in current systems. The new scheme 
would only need 8GB for the same amount of counters.

So I think its unreasonable to assume that currently systems exist that 
can use more than 128m of allocpercpu space (assuming 64 cpus).

---
 include/asm-x86/pgtable_64.h |4 
 1 file changed, 4 insertions(+)

Index: linux-2.6/include/asm-x86/pgtable_64.h
===
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-11-01 18:15:52.282577904 
-0700
+++ linux-2.6/include/asm-x86/pgtable_64.h  2007-11-01 18:18:02.886979040 
-0700
@@ -138,10 +138,14 @@ static inline pte_t ptep_get_and_clear_f
 #define VMALLOC_START_AC(0xc200, UL)
 #define VMALLOC_END  _AC(0xe1ff, UL)
 #define VMEMMAP_START   _AC(0xe200, UL)
+#define PERCPU_START_AC(0xf200, UL)
+#define PERCPU_END  _AC(0xfa00, UL)
 #define MODULES_VADDR_AC(0x8800, UL)
 #define MODULES_END  _AC(0xfff0, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
 
+#define PERCPU_MIN_SHIFT   PMD_SHIFT
+#define PERCPU_BITS43
+
 #define _PAGE_BIT_PRESENT  0
 #define _PAGE_BIT_RW   1
 #define _PAGE_BIT_USER 2

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pdflush stuck in D state with v2.6.24-rc1-192-gef49c32

2007-11-01 Thread Fengguang Wu
On Thu, Nov 01, 2007 at 09:14:14AM -0500, Florin Iucha wrote:
> On Thu, Nov 01, 2007 at 09:03:33PM +0800, Fengguang Wu wrote:
> > Or will the system or fs size/age make any difference? If you happen
> > to have a spare/swap partition, could you make a new reiserfs and
> > mount it and copy several less-than-4KB files into it and wait for 30s
> > and see what happen to pdflush?
> 
> I will try that with a USB disk - I hope that won't make a difference.

Thank you. I guess a reiserfs on loop file would also be OK.

> > btw, what's the exact kernel version you are running?
> 
> I noticed it with the kernel in the $SUBJECT, as reported by 'git
> describe'.  I have pulled in new changesets since then.

And with the following patch applied?

---
 fs/reiserfs/stree.c |3 ---
 1 file changed, 3 deletions(-)

--- linux-2.6.24-git17.orig/fs/reiserfs/stree.c
+++ linux-2.6.24-git17/fs/reiserfs/stree.c
@@ -1458,9 +1458,6 @@ static void unmap_buffers(struct page *p
}
bh = next;
} while (bh != head);
-   if (PAGE_SIZE == bh->b_size) {
-   cancel_dirty_page(page, PAGE_CACHE_SIZE);
-   }
}
}
 }

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


TCP_DEFER_ACCEPT issues

2007-11-01 Thread Felix von Leitner
I am trying to use TCP_DEFER_ACCEPT in my web server.

There are some operational problems.  First of all: timeout handling.  I
would like to be able to set a timeout in seconds (or better:
milliseconds) for how long the socket is allowed to sit there without
data coming in.  For high load situations, I have been enforcing
timeouts in the range of 15 seconds, otherwise someone can DoS the
server by opening a lot of connections and tying up data structures.

It is still possible, of course, to tie up kernel memory this way, by
not reacting to the FIN or RST packets and running into a timeout there,
too, but that is partially tunable via sysctl.

According to tcp(7) the int argument to TCP_DEFER_ACCEPT is in seconds.
In the kernel code, it's converted to TCP timeout units.  When I ran my
server, and connected without sending any data, nothing happened.  No
timeout.  Minutes later, the connection was still there.  Even worse:
when I killed (!) the server process (thus closing the server socket),
the client did not get a reset.  Only when I type something in the
telnet, I get a reset.  This appears to be very broken.

My suggestion:

  1. make the argument to the setsockopt be in seconds, or milliseconds.
  2. if the server socket is closed, reset all pending connections.

Comments?

Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2 -v2 resend] x86_64 EFI boot support

2007-11-01 Thread H. Peter Anvin

Huang, Ying wrote:

From: H. Peter Anvin [mailto:[EMAIL PROTECTED]
The "EFI boot" patchset looks fairly unobtrusive to me.  One objection:
the VIDEO_TYPE_ numbers appear split up into groups; and the EFI one
probably should be 0x70 instead of 0x24.


Should I change this and resend the patchset? Or you modify it before
merging?



I'd be easier if you could change and resend.

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

2007-11-01 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
> I understood it as reordering was permitted, but no re-ordering across
> another volatile load, store, or asm was permitted.

It doesn't say that, so I wouldn't assume it.  Certainly we had problems
with the pda code; until I added the _proxy_pda dependency variable, the
only fix Andi could find was adding both "volatile" and a memory clobber.

>   And of course, as
> long as input and output constraints are written properly, the
> re-ordering should not be vulnerable to pathological movement causing
> the code to malfunction.
>   

Yes.  I think constraints are the only way to control ordering (even if
it's as heavy-handed as a memory clobber).  It would be nice if gcc had
a constraint which was only used for ordering, and never generated a
reference.  Then you could make up pseudo-variables in order to express
dependencies without having the risk that the compiler would generate
references.

> It seems that CPU state side effects which can't be expressed in C need
> special care - FPU is certainly one example.
>   

Not an immediate problem, fortunately.

> Also, memory clobber on a volatile asm should stop invalid movement
> across TLB flushes and other problems areas.

Yes.  Any asm which has global effects on how addresses are interpreted
(like tlbflush, reloading the pagetable base, changing modes, etc) needs
to have a memory clobber.

>   Even memory fences should
> have memory clobber in order to stop movement of loads and stores across
> the fence by the compiler.
>   

Pretty sure they do.  A normal compiler barrier is *just* a memory clobber.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Linus Torvalds


On Thu, 1 Nov 2007, Rik van Riel wrote:
> 
> Larry Woodman managed to wedge the VM into a state where, on his
> 4x dual core system, only 2 cores (on the same CPU) could get the
> zone->lru_lock overnight.  The other 6 cores on the system were
> just spinning, without being able to get the lock.

.. and this is almost always the result of a locking *bug*, not unfairness 
per se. IOW, unfairness just ends up showing the bug in the first place.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Linus Torvalds


On Fri, 2 Nov 2007, Nick Piggin wrote:
> 
> But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
> uses this for inserting breakpoints (but fortunately, a COW page in a
> MAP_PRIVATE mapping is a much more natural thing for the VM).

Yes, I phrased that badly. I meant that I'd be happier if we got rid of 
VM_MAYSHARE entirely, and just used VM_SHARED. I thought we already made 
them always be the same (and any VM_MAYSHARE use is historical).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Thu, 1 Nov 2007, David Miller wrote:

> You cannot put limits of the amount of alloc_percpu() memory available
> to clients, please let's proceed with that basic understanding in
> mind.  We're wasting a ton of time discussing this fundamental issue.

There is no point in making absolute demands like "no limits". There are 
always limits to everything. 

A new implementation avoids the need to allocate per cpu arrays and also 
avoids the 32 bytes per object times cpus that are mostly wasted for small 
allocations today. So its going to potentially allow more per cpu objects
that available today.

A reasonable implementation for 64 bit is likely going to depend on 
reserving some virtual memory space for the per cpu mappings so that they 
can be dynamically grown up to what the reserved virtual space allows.

F.e. If we reserve 256G of virtual space and support a maximum of 16k cpus 
then there is a limit on the per cpu space available of 16MB.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

2007-11-01 Thread Zachary Amsden
On Thu, 2007-11-01 at 10:41 -0700, Jeremy Fitzhardinge wrote:
> Keir Fraser wrote:
> > volatile prevents the asm from being 'moved significantly', according to the
> > gcc manual. I take that to mean that reordering is not allowed.
> >   

I understood it as reordering was permitted, but no re-ordering across
another volatile load, store, or asm was permitted.  And of course, as
long as input and output constraints are written properly, the
re-ordering should not be vulnerable to pathological movement causing
the code to malfunction.

It seems that CPU state side effects which can't be expressed in C need
special care - FPU is certainly one example.

Also, memory clobber on a volatile asm should stop invalid movement
across TLB flushes and other problems areas.  Even memory fences should
have memory clobber in order to stop movement of loads and stores across
the fence by the compiler.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 0/2 -v2 resend] x86_64 EFI boot support

2007-11-01 Thread Huang, Ying
>From: H. Peter Anvin [mailto:[EMAIL PROTECTED]
>The "EFI boot" patchset looks fairly unobtrusive to me.  One objection:
>the VIDEO_TYPE_ numbers appear split up into groups; and the EFI one
>probably should be 0x70 instead of 0x24.

Should I change this and resend the patchset? Or you modify it before
merging?

Best Regards,
Huang Ying
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH 0/2 -v2 resend] x86_64 EFI boot support

2007-11-01 Thread Huang, Ying
>From: Andrew Morton [mailto:[EMAIL PROTECTED]
>On Wed, 31 Oct 2007 09:21:41 +0800
>"Huang, Ying" <[EMAIL PROTECTED]> wrote:
>
>> Can this patchset be merged into mainline kernel? This patchset has
been
>> in -mm tree from 2.6.23-rc2-mm2 on. Andrew Moton has suggested it to
be
>> merged into 2.6.24 during early merge window of 2.6.24. It was not
>> merged into mainline because the 32-bit boot protocol has not been
done.
>>
>> But now, the 32-bit boot protocol has been merged into mainline. So
can
>> this patchset be merged into mainline kernel now?
>>
>
>I stopped paying attention, sorry.  Have all the outstanding issues
been
>addressed?

That's all right. I think there is no big issue with this patchset, just
a framebuffer driver and the document.

Best Regards,
Huang Ying
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 0/7] [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead

2007-11-01 Thread Christoph Lameter
On Fri, 2 Nov 2007, Eric Dumazet wrote:

> > Na. Some reasonable upper limit needs to be set. If we set that to say
> > 32Megabytes and do the virtual mapping then we can just populate the first
> > 2M and only allocate the remainder if we need it. Then we need to rely on
> > Mel's defrag stuff though defrag memory if we need it.
> 
> If a 2MB page is not available, could we revert using 4KB pages ? (like
> vmalloc stuff), paying an extra runtime overhead of course.

Sure. Its going to be like vmemmap. There will be limited imposed though 
by the amount of virtual space available. Basically the dynamic per cpu 
area can be at maximum

available_virtual_space / NR_CPUS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v4l-dvb-maintainer] bttv build error (CONFIG_NET=n)

2007-11-01 Thread Trent Piepho
On Thu, 1 Nov 2007, Mauro Carvalho Chehab wrote:
> Randy,
> > > The only reason the net stuff works, is because CONFIG_NET includes 
> > > igmp.c,
> > > which can't be compiled as a module.  That means ip_compute_csum() will 
> > > get
> > > pulled out of the lib.a file for igmp, and thus be present for the net 
> > > modules
> > > that use it too.  If igmp could be turned off, made a module, or stopped 
> > > using
> > > ip_compute_csum(), then the users of ip_compute_csum() that do depend on
> > > CONFIG_NET would have the same problem as bttv does.
> >
> > Thanks for the analysis and summary.
> > (I'm still waiting for those lkml.org links to load... timed out)
> >
> > > It seems a shame to create a new ip checksum function in the bttv driver 
> > > when
> > > a perfectly good one already exists and will already be present in just 
> > > about
> > > every kernel out there.  Honestly, how common is NET=n and VIDEO_BT848=m
> > > outside of randconfig?
>
> This might happen on embedded devices, like a set top box or a PVR,
> using a bttv hardware.
>
> > so just adding "depends on NET" should be OK then?
>
> Seems very weird to have bttv module dependent on NET, just because a
> checksum calculus function is defined there.

Mauro, read the first message I linked too:
http://lkml.org/lkml/2007/4/3/209 or 
http://article.gmane.org/gmane.linux.kernel/511684

Randy had this exact same problem with the md driver and a different ip
checksum function.

ip_compute_csum() _isn't_ defined under NET.  It's part of the kernel's
arch specific library.  So it should be available for all modules to use as
part of the kernel core.  Except due to a flaw in the build system, symbols
that are part of a library can't be used by modules unless there is at
least one non-module user.  There would be the same problem with strcat()
or tons of other functions, if one were able to compile all users of these
functions are modules.

I wonder if the build system could be modified to take every object that's
part of lib-y an turn it into a .ko file?

The process would be something like this:
build lib-y objects like they are and make lib.a

filter out of lib-y all objects that don't export symbols.  Since the
objects are already compiled, this shouldn't be hard.

obj-m += lib-y

Now all the lib files will be modules, and if any module needs a symbol
from one and it's not in the kernel, modprobe will load it.  Minimum bloat,
since the library code isn't loaded into the kernel until something is in
the kernel that needs it.  And we don't need to create kconfig symbols for
library functions and remember to select them.  Let depmod keep track of
what library functions a module needs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Rik van Riel
On Thu, 1 Nov 2007 09:38:22 -0700 (PDT)
Linus Torvalds <[EMAIL PROTECTED]> wrote:

> So "unfair" is obviously always bad. Except when it isn't.

Larry Woodman managed to wedge the VM into a state where, on his
4x dual core system, only 2 cores (on the same CPU) could get the
zone->lru_lock overnight.  The other 6 cores on the system were
just spinning, without being able to get the lock.

On the other hand, spinlock contention in the page replacement
code is just a symptom of the fact that we scan too many pages.
It can probably be fixed in other ways...

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Major SATA / EXT3 Issue?

2007-11-01 Thread Norbert Preining
Hi all!

(Please Cc)

I alsohave to report a very similar incident. Debian/sid, kernel 2.6.22.
Doing some hard work for the disk (svn up of two big repositories, some
copying of files, etc etc).

Suddently the PC froze. Nothing, I had to reboot. But then:
- BIOS didn't detect the disks, or better, it took extremely long
- booting into linux gave those time out messages already mentioned (I
  am away, cannot give you details for now till sunday)
- booting into windows frooze windows when accessing the second harddisk
  (from which stuff was copied).
- reseting the computer didn't help., but turning physically off, and
  turning on again did the trick, some fsck-ing.
- booting into windows needed chkdsk from ewindows, and severalk files
  destroyed.

Both the disks and the computer are quite new, and are NOT heavily used,
only now and then.

AFAIR nv SATA driver.

Ic ould repeat these problems with big copying actions.

The problem with logging is that the computer freezes hard and nothing
remains in the log files.

Best wishes

Norbert

---
Dr. Norbert Preining <[EMAIL PROTECTED]>Vienna University of Technology
Debian Developer <[EMAIL PROTECTED]> Debian TeX Group
gpg DSA: 0x09C5B094  fp: 14DF 2E6C 0307 BE6D AD76  A9C0 D2BF 4AA3 09C5 B094
---
JARROW (adj.)
An agricultural device which, when towed behind a tractor, enables the
farmer to spread his dung evenly across the width of the road.
--- Douglas Adams, The Meaning of Liff
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] PID namespace design bug, workaround

2007-11-01 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ingo Molnar wrote:

> but this problem is still present in the code, and it has been recently 
> committed into mainline via:
> 
>   commit 30e49c263e36341b60b735cbef5ca37912549264
>   Author: Pavel Emelyanov <[EMAIL PROTECTED]>
>   Date:   Thu Oct 18 23:40:10 2007 -0700
> 
>   pid namespaces: allow cloning of new namespace
> 
> without these problems having been resolved. A full-scale revert is 
> probably too intrusive, but at minimum we need to turn off user-space 
> access to this feature via this simple patch. Until this issue is 
> resolved properly the new PID namespace code needs to be turned off. 
> Letting this into 2.6.24 would be a disaster.
> 
> Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>

Acked-by: Ulrich Drepper <[EMAIL PROTECTED]>


- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHKm3n2ijCOnn/RHQRAn7dAJ9PhfhLg29mTELwH7qLXwgJcyNi9QCgr7sc
WQa4QBNesktzPKh5vcCulhM=
=cYnF
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] PID namespace design bug, workaround

2007-11-01 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Pavel Emelyanov wrote:
> The "fix" I mention is just returning -EINVAL in case user orders 
> CLONE_NEWPIDS

That is the "fix" you were referring to?  I was hoping you have a sketch
for a real solution.  If nobody can think of a way to fix this PID
namespaces are IMO not something which should go in at all.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHKm2R2ijCOnn/RHQRAgjXAKCkU9lcWC9aTR0nG89x47AZO9pVfwCgiaVC
/Giyp+en+VbtfFyD8D6v4Xk=
=RnIw
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [build bug, 2.6.24-rc1] CONFIG_VIDEO_DEV=m & CONFIG_VIDEO_SAA7146_VV=y

2007-11-01 Thread Trent Piepho
On Thu, 1 Nov 2007, Ingo Molnar wrote:
> > For some time now I've thought the whole ttpci config/makefile setup
> > sucked.  I've finally gone though and redone it and fixed this problem
> > too.
> >
> > Here is the patch: http://linuxtv.org/hg/v4l-dvb/rev/5320c2571183
>
> the drivers/media/dvb/ttpci/Kconfig bits do not apply:
>
>  $ q push
>  Applying patch patches/dvb-fix-5320c2571183.patch
>  patching file drivers/media/common/Kconfig
>  patching file drivers/media/dvb/ttpci/Kconfig
>  Hunk #1 FAILED at 1.
>  Hunk #2 FAILED at 63.
>  Hunk #4 FAILED at 99.
>  Hunk #5 FAILED at 120.
>  Hunk #6 FAILED at 142.
>  5 out of 6 hunks FAILED -- rejects in file
>  drivers/media/dvb/ttpci/Kconfig
>  patching file drivers/media/dvb/ttpci/Makefile
>  Patch patches/dvb-fix-5320c2571183.patch does not apply (enforce with -f)
>
> got a link to the dependent patch that i'm apparently missing?

These two:
http://linuxtv.org/hg/v4l-dvb/rev/5320c2571183
http://linuxtv.org/hg/v4l-dvb/rev/64935a44e510

Mauro will probably prepare a patch series to send to git soon that has this
stuff in it.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: RFC: Reproducible oops with lockdep on count_matching_names()

2007-11-01 Thread Michael Wu
On Thursday 01 November 2007 15:17:16 Luis R. Rodriguez wrote:
> [EMAIL PROTECTED]:~/devel/wireless-2.6$ git-describe
> v2.6.24-rc1-146-g2280253
>
> So I hit segfault with lockdep on count_matching_names() on the
> strcmp() multiple times now. This is reproducible and with different
> wireless drivers.
>
I've found the problem. It appears to be in lockdep. struct lock_class has a 
const char *name field which points to a statically allocated string that 
comes from the code which uses the lock. If that code/string is in a module 
and gets unloaded, the pointer in |name| is no longer valid. Next time this 
field is dereferenced (count_matching_names, in this case), we crash.

The following patch fixes the issue but there's probably a better way.

-Michael Wu

---

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 4c4d236..2aa0d35 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -114,7 +114,7 @@ struct lock_class {
 */
unsigned long   ops;
 
-   const char  *name;
+   charname[128];
int name_version;
 
 #ifdef CONFIG_LOCK_STAT
diff --git a/kernel/lockdep.c b/kernel/lockdep.c
index 55fe0c7..63c4d8f 100644
--- a/kernel/lockdep.c
+++ b/kernel/lockdep.c
@@ -768,7 +768,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int 
subclass, int force)
class = lock_classes + nr_lock_classes++;
debug_atomic_inc(_unused_locks);
class->key = key;
-   class->name = lock->name;
+   strcpy(class->name, lock->name);
class->subclass = subclass;
INIT_LIST_HEAD(>lock_entry);
INIT_LIST_HEAD(>locks_before);


signature.asc
Description: This is a digitally signed message part.


Re: [patch 1/4] x86: FIFO ticket spinlocks

2007-11-01 Thread Nick Piggin
On Thu, Nov 01, 2007 at 04:01:45PM -0400, Chuck Ebbert wrote:
> On 11/01/2007 10:03 AM, Nick Piggin wrote:
> 
> [edited to show the resulting code]
> 
> > +   __asm__ __volatile__ (
> > +   LOCK_PREFIX "xaddw %w0, %1\n"
> > +   "1:\t"
> > +   "cmpb %h0, %b0\n\t"
> > +   "je 2f\n\t"
> > +   "rep ; nop\n\t"
> > +   "movb %1, %b0\n\t"
> > +   /* don't need lfence here, because loads are in-order */
> > "jmp 1b\n"
> > +   "2:"
> > +   :"+Q" (inc), "+m" (lock->slock)
> > +   :
> > +   :"memory", "cc");
> >  }
> 
> If you really thought you might get long queues, you could figure out
> how far back you are and use that to determine how long to wait before
> testing the lock again. That cmpb could become a subb without adding
> overhead to the fast path -- that would give you the queue length (or
> its complement anyway.)

Indeed. You can use this as a really nice input into a backoff
algorithm (eg. if you're next in line, don't back off, or at least
don't go into exponential backoff; if you've got people in front
of you, start throttling harder).

I think I'll leave that to SGI if they come up with a big x86 SSI ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH resend] Make the dev_*() family of macros in device.hcomplete

2007-11-01 Thread Andrew Morton
On Tue, 30 Oct 2007 08:40:08 -0700
Greg KH <[EMAIL PROTECTED]> wrote:

> On Tue, Oct 30, 2007 at 05:11:24AM -0700, Medve Emilian-EMMEDVE1 wrote:
> > Hi Greg K-H,
> > 
> > 
> > > > +#define dev_info(dev, format, arg...)  \
> > > > +   dev_printk(KERN_INFO, dev, format, ## arg)
> > > > +
> > > >  #ifdef DEBUG
> > > >  #define dev_dbg(dev, format, arg...)   \
> > > > -   dev_printk(KERN_DEBUG , dev , format , ## arg)
> > > > +   dev_printk(KERN_DEBUG, dev, format, ## arg)
> > > 
> > > Those extra spaces are there for a good reason, older versions of gcc
> > > are broken without it.  So please, put them all back...
> > 
> > You mean I should add spaces before commas only where they were
> > initially or to all new code and/or macros?
> 
> Put it back where it was, and do the same for all other macros.
> 
> > I've observed other kernel code and more often there are no spaces
> > before commas. I'm asking because the CodingStyle document is not very
> > explicit about this rule.
> 
> This is a gcc rule, for variable length macros, not a CodingStyle
> guideline.  It just will not work without it :)
> 

The space-before-a-comma requirement was for gcc-2.95, iirc.

It got to the stage where I was the only person testing with gcc-2.95 so I
spent inordinate amounts of time adding spaces before people's newly-added
commas.  Fortunately we abamdoned that gcc version so the space-before-a-comma
requirement no longer exists.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23 regression: accessing invalid mmap'ed memory from gdb causes unkillable spinning

2007-11-01 Thread Nick Piggin
On Thu, Nov 01, 2007 at 09:08:45AM -0700, Linus Torvalds wrote:
> 
> 
> On Thu, 1 Nov 2007, Nick Piggin wrote:
> > 
> > Untested patch follows
> 
> Ok, this looks ok.
> 
> Except I would remove the VM_MAYSHARE bit from the test.

But we do want to allow forced COW faults for MAP_PRIVATE mappings. gdb
uses this for inserting breakpoints (but fortunately, a COW page in a
MAP_PRIVATE mapping is a much more natural thing for the VM).


> That whole bit should go, in fact.
> 
> We used to make it something different: iirc, a read-only SHARED mapping 
> was downgraded to a non-shared mapping, because we wanted to avoid some of 
> the costs we used to have with the VM implementation (actually, I think it 
> was various filesystems that don't like shared mappings because they don't 
> have a per-page writeback). But we left the VM_MAYSHARE bit on, to get 
> /proc//mmap things right.
> 
> Or something like that. I forget the details. But I *think* we don't 
> actually need this any more.
> 
> But basically, the "right" way to test for shared mappings is historically 
> to just test the VM_MAYSHARE bit - but not *both* bits. Because VM_SHARE 
> may have been artificially cleared.
 
I think you're right -- VM_MAYSHARE is basically testing for MAP_SHARED.
I just don't know exactly what you're proposing here.

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Heikki Orsila
On Mon, Oct 29, 2007 at 09:54:27AM -0700, Max Krasnyansky wrote:
> A couple of HP xw9300 machines (dual Opterons) started freezing up.
> We're running on 2.6.22.1 on them. Freezes a somewhere weird. 
> VGA console is alive
> (I can switch vts, etc) but everything else is dead (network, etc).

I'm thinking this is not a coincidence. I was running 2.6.22.5, and 
looking at your problems, I just had a similar experience on tuesday.. 
The network was still fine after kernel errors so that I was able to 
login with SSH. See:

http://lkml.org/lkml/2007/10/30/193

> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
> status 0x1540 next cpb count 0x0 next cpb idx 0x0
> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Descriptor sense data with sense descriptors (in hex):
> end_request: I/O error, dev sda, sector 8388695
> Buffer I/O error on device sda1, logical block 1048579
> lost page write due to I/O error on sda1
> sd 0:0:0:0: [sda] Write Protect is off

With ata_piix Intel SATA I got these errors:

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:68:6f:3a:00/00:00:00:00:00/e0 tag 0 cdb 0x0 data 53248 out
 res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1: port is slow to respond, please be patient (Status 0xd0)
ata1: device not ready (errno=-16), forcing hardreset
ata1: soft resetting port
ata1.00: revalidation failed (errno=-2)
ata1: failed to recover some devices, retrying in 5 secs
ata1: soft resetting port
ata1.00: configured for UDMA/133
ata1: EH complete
sd 0:0:0:0: [sda] 488397168 512-byte hardware sectors (250059 MB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support 
DPO or FUA

> Here is how this machine looks like
> 
> 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
> 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
> 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
> 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
> 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
> 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio 
> Controller (rev a2)
> 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
> 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
> 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
> 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
> 00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3)
> 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
> 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> HyperTransport Technology Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
> Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Miscellaneous Control
> 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> HyperTransport Technology Configuration
> 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Address Map
> 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
> Controller
> 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
> Miscellaneous Control
> 05:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY 
> [Radeon 7000/VE]
> 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
> Controller (PHY/Link)
> 0a:00.0 Ethernet controller: Intel Corporation 82572EI Gigabit Ethernet 
> Controller (Copper) (rev 06)
> 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
> 12)
> 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 
> 12)
> 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
> 61:04.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
> 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
> Fusion-MPT Dual Ultra320 SCSI (rev 07)
> 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X 
> Fusion-MPT Dual Ultra320 SCSI (rev 07)
> 61:09.0 PCI bridge: Intel Corporation Unknown device 537c (rev 07)
> 62:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
> 63:09.0 Multimedia controller: BittWare, Inc. Unknown device 0035 (rev 01)
> 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3)
> 80:01.0 Memory 

Re: [2.6 patch] let USB_USBNET always select MII

2007-11-01 Thread David Brownell
On Thursday 01 November 2007, Adrian Bunk wrote:
> All this USB_USBNET_MII trickery is simply not worth it considering how 
> few code it saves.

Depends on what systems you're talking about.  Forcing unused
code into the kernel is not free, especially if that's made into
a design policy and applied repeatedly to many subsystems.


> As a side effect, this also fixes the following compile error reported 
> by Toralf Förster:

Why not just fix the thing which changed and broke the build?

Or if reverse dependencies can't be made to work sanely, then
have those Ethernet-adapter minidrivers depend on NET_ETHERNET
and then select MII.  (To make the relationships be simple
enough that current Kconfig can handle them.)

I have a fair number of usbnet devices.  Not one of them needs
MII or NET_ETHERNET.

- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: build #337 failed for 2.6.24-rc1-gb1d08ac In function `usbnet_set_settings':

2007-11-01 Thread Adrian Bunk
On Thu, Nov 01, 2007 at 04:32:18PM -0700, David Brownell wrote:
> On Thursday 01 November 2007, Randy Dunlap wrote:
> > The MII functions aren't available unless NET_ETHERNET=y.

The setting of CONFIG_NET_ETHERNET doesn't matter for this bug.

> > Howver, the MII functions aren't always needed...
> > 
> > David, any ideas on this one?
> 
> It's been several years since I looked at this.  It
> used to behave just fine.
> 
> Something must have changed in the not-too-distant
> past to have broken this mechanism...
>...

It seems to be an old bug.

The following combination of options is simply an unusual one:

CONFIG_MII=m
CONFIG_USB_USBNET=y
CONFIG_USB_USBNET_MII=n

> - Dave

cu
Adrian

-- 

   "Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
   "Only a promise," Lao Er said.
   Pearl S. Buck - Dragon Seed

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange freezes (seems like SATA related)

2007-11-01 Thread Andrew Morton
On Mon, 29 Oct 2007 09:54:27 -0700
Max Krasnyansky <[EMAIL PROTECTED]> wrote:

> A couple of HP xw9300 machines (dual Opterons) started freezing up.
> We're running on 2.6.22.1 on them. Freezes a somewhere weird. VGA console is 
> alive
> (I can switch vts, etc) but everything else is dead (network, etc).
> Unfortunately SYSRQ was not enabled and I could not get backtraces and stuff.
> 
> Hooked up serial console and the only error that shows up is this.
> 
> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
> status 0x1540 next cpb count 0x0 next cpb idx 0x0
> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ca/00:08:57:00:80/00:00:00:00:00/e0 tag 0 cdb 0x0 data 4096 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> Descriptor sense data with sense descriptors (in hex):
> end_request: I/O error, dev sda, sector 8388695
> Buffer I/O error on device sda1, logical block 1048579
> lost page write due to I/O error on sda1
> sd 0:0:0:0: [sda] Write Protect is off
> 
> I see a bunch of those and then the box just sits there spewing this 
> periodically
> 
> ata1: EH in ADMA mode, notifier 0x1 notifier_error 0x0 gen_ctl 0x1581000 
> status 0x1540 next cpb count 0x0 next cpb idx 0x0
> ata1: CPB 0: ctl_flags 0xd, resp_flags 0x1
> ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata1.00: cmd ca/00:08:4f:00:f8/00:00:00:00:00/e1 tag 0 cdb 0x0 data 4096 out
>  res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> 
> SMART selftest on the drive passed without errors.
> 
> Here is how this machine looks like
> 
> ...

So this happens on more than one machine?

The kernel shouldn't freeze, so even if both machines have magically
identical hardware faults, there's a kernel bug there somewhere.

I guess it would be useful to test a 2.6.23 kernel if poss.  We've seen a
very large number of reports like this one in recent months (many of which
have not been responded to, btw) and perhaps someone has done something
about them.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: build #337 failed for 2.6.24-rc1-gb1d08ac In function `usbnet_set_settings':

2007-11-01 Thread David Brownell
On Thursday 01 November 2007, Randy Dunlap wrote:
> The MII functions aren't available unless NET_ETHERNET=y.
> Howver, the MII functions aren't always needed...
> 
> David, any ideas on this one?

It's been several years since I looked at this.  It
used to behave just fine.

Something must have changed in the not-too-distant
past to have broken this mechanism...


>  config USB_USBNET
>         tristate "Multi-purpose USB Networking Framework"
> +       depends on NET_ETHERNET if USB_USBNET_MII != n
>         select MII if USB_USBNET_MII != n
> 
> would be handy.  But invalid.
> 
> Hm, wait.  Haven't we seen this before and decided that MII should
> be made more generally available?  I.e., not depend on NET_ETHERNET?

Some of us keep wanting to see "select" work properly,
not omitting dependencies...

Re interdependencies MII and NET_ETHERNET, I'll leave
that up to the netedev folk.

- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: irq 21: nobody cared 2.6.24-rc1

2007-11-01 Thread Bongani Hlope
On Thursday 01 November 2007 22:32:38 Andrew Morton wrote:
> On Thu, 25 Oct 2007 10:45:36 +0200
>
> Bongani Hlope <[EMAIL PROTECTED]> wrote:
> > Booting with irqpoll works
> >
> > ls /proc/irq/21/ (with irqpoll)
> > ehci_hcd:usb1/  smp_affinity  uhci_hcd:usb2/  uhci_hcd:usb3/ 
> > uhci_hcd:usb4/
> >
> >  Disabling IRQ #21
>
> Was any earlier kernel version OK?  2.6.23?

Yes the 2.6.23 kernel works fine. This seems to happen when I boot the 
2.6.24-rc1 kernel with an iPod attached (hope this helps), but the 2.6.23 
kernel doesn't experience this problem.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.24-rc1: hangs when logging in to X session

2007-11-01 Thread Andrew Morton
On Mon, 29 Oct 2007 16:13:37 +0100
Marcus Better <[EMAIL PROTECTED]> wrote:

> Marcus Better wrote:
> > NO_HZ. I will try without it...
> 
> Nope, still the same result.
> 

Restoring the (excessively) trimmed context:

> My laptop hangs when I try to log in to X with the current git kernel
> (commit 2a397e82c7db18019e408f953dd58dc1963a328c). It runs fine with
> 2.6.23. At boot time kdm starts normally, but hangs with the caps lock LED
> blinking immediately after I press Enter after typing the password.
>
> I can log in with a virtual console but didn't do much testing otherwise.
> 
> The system is a Thinkpad R60, Intel Core 2 Duo, x86_64 running Debian.
> Kernel config is attached. A notable change in my config is that I enabled
> NO_HZ. I will try without it...

hanging-with-led-blinking means the kernel oopsed.

Please configure netconsole (Documentation/networking/netconsole.txt) and
see if you can capture the oops on another machine on the LAN.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: expected behavior of PF_PACKET on NETIF_F_HW_VLAN_RX device?

2007-11-01 Thread Rick Jones

David Miller wrote:

From: Rick Jones <[EMAIL PROTECTED]>

I'll try to go pester folks in tcpdump-workers then.



The thing to check is "TP_STATUS_CSUMNOTREADY".

When using mmap(), it will be provided in the descriptor.  When using
recvmsg() it will be provided via a PACKET_AUXDATA control message
when enabled via the PACKET_AUXDATA socket option.


Figures... the "dailies" and "weeklies" for tar files of tcpdump and libpcap 
source are fubar... again.  I've email in to tcpdump-workers on that one.  If 
that isn't resolved quickly I'll learn how to access their CVS (pick an SCM, any 
SCM...)


I did an apt-get of debian lenny's tcpdump and sources:

hpcpc103:~# tcpdump -V
tcpdump version 3.9.8
libpcap version 0.9.8

and that seems to show the false checksum failure and not use the 
TP_STATUS_CSUMNOTREADY - at least that didn't appear in a grepping of the 
sources.  At first I thought it might be, but then I realized that my snaplen 
was too short to get the whole TSO'ed frame so tcpdump wasn't even trying to 
verify.  After disabling TSO on the NIC, leaving CKO on, and making my snaplen > 
1500 I could see it was doing undesirable stuff.


I'll see what top of trunk has at some point and what the folks there think of 
adding-in a change.


rick jones
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 27/27] keep track of mnt_writer state of struct file

2007-11-01 Thread Dave Hansen

There have been a few oopses caused by 'struct file's with
NULL f_vfsmnts.  There was also a set of potentially missed
mnt_want_write()s from dentry_open() calls.

This patch provides a very simple debugging framework to
catch these kinds of bugs.  It will WARN_ON() them, but
should stop us from having any oopses or mnt_writer
count imbalances.

I'm quite convinced that this is a good thing because it
found bugs in the stuff I was working on as soon as I
wrote it.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/file_table.c|   21 +++--
 linux-2.6.git-dave/fs/open.c  |   14 +-
 linux-2.6.git-dave/include/linux/fs.h |4 
 3 files changed, 36 insertions(+), 3 deletions(-)

diff -puN fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file 
fs/file_table.c
--- linux-2.6.git/fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file 
2007-11-01 14:46:22.0 -0700
+++ linux-2.6.git-dave/fs/file_table.c  2007-11-01 14:46:22.0 -0700
@@ -42,6 +42,12 @@ static inline void file_free_rcu(struct 
 static inline void file_free(struct file *f)
 {
percpu_counter_dec(_files);
+   /*
+* At this point, either both or neither of these bits
+* should be set.
+*/
+   WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN);
+   WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_RELEASED);
call_rcu(>f_u.fu_rcuhead, file_free_rcu);
 }
 
@@ -201,6 +207,7 @@ int init_file(struct file *file, struct 
 * that we can do debugging checks at __fput()e
 */
if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+   file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
error = mnt_want_write(mnt);
WARN_ON(error);
}
@@ -243,8 +250,18 @@ void fastcall __fput(struct file *file)
fops_put(file->f_op);
if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
-   if (!special_file(inode->i_mode))
-   mnt_drop_write(mnt);
+   if (!special_file(inode->i_mode)) {
+   if (file->f_mnt_write_state == FILE_MNT_WRITE_TAKEN) {
+   mnt_drop_write(mnt);
+   file->f_mnt_write_state |=
+   FILE_MNT_WRITE_RELEASED;
+   } else {
+   printk(KERN_WARNING "__fput() of writeable "
+   "file with no "
+   "mnt_want_write()\n");
+   WARN_ON(1);
+   }
+   }
}
put_pid(file->f_owner.pid);
file_kill(file);
diff -puN fs/open.c~keep-track-of-mnt_writer-state-of-struct-file fs/open.c
--- linux-2.6.git/fs/open.c~keep-track-of-mnt_writer-state-of-struct-file   
2007-11-01 14:46:22.0 -0700
+++ linux-2.6.git-dave/fs/open.c2007-11-01 14:46:22.0 -0700
@@ -810,6 +810,10 @@ static struct file *__dentry_open(struct
error = __get_file_write_access(inode, mnt);
if (error)
goto cleanup_file;
+   if (!special_file(inode->i_mode)) {
+   WARN_ON(f->f_mnt_write_state != 0);
+   f->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
+   }
}
 
f->f_mapping = inode->i_mapping;
@@ -851,8 +855,16 @@ cleanup_all:
fops_put(f->f_op);
if (f->f_mode & FMODE_WRITE) {
put_write_access(inode);
-   if (!special_file(inode->i_mode))
+   if (!special_file(inode->i_mode)) {
+   /*
+* We don't consider this a real
+* mnt_want/drop_write() pair
+* because it all happenend right
+* here, so just reset the state.
+*/
+   f->f_mnt_write_state = 0;
mnt_drop_write(mnt);
+   }
}
file_kill(f);
f->f_path.dentry = NULL;
diff -puN include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file 
include/linux/fs.h
--- 
linux-2.6.git/include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file  
2007-11-01 14:46:22.0 -0700
+++ linux-2.6.git-dave/include/linux/fs.h   2007-11-01 14:46:22.0 
-0700
@@ -774,6 +774,9 @@ static inline int ra_has_index(struct fi
index <  ra->start + ra->size);
 }
 
+#define FILE_MNT_WRITE_TAKEN   1
+#define FILE_MNT_WRITE_RELEASED2
+
 struct file {
/*
 * fu_list becomes invalid after file_free is called and queued via
@@ -808,6 +811,7 @@ struct file {
spinlock_t  f_ep_lock;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
+   

Re: 2.6.24-rc1: OOPS at acpi_battery_update

2007-11-01 Thread Andrew Morton
On Mon, 29 Oct 2007 11:11:04 +0100
Romano Giannetti <[EMAIL PROTECTED]> wrote:

> 
> Hi,
> 
>   sometime on resuming from s2ram my laptop spew the following oops.
> Config, dmesg etc are at: 
> 
> http://www.dea.icai.upcomillas.es/romano/linux/info/2624rc1_6/
> 
> [3.475386] Oops:  [#1] SMP 
> [3.475602] Process kacpi_notify (pid: 50, ti=c2122000 task=c210c030 
> task.ti=c2122000)
> [3.475608] Stack: c35c5060 c02f2a58 0046  6b6b6b6b c2b20298 
> c2123efc c02f29ac 
> [3.475626]c02166cb f896f3ca  f896ff63 0246  
> c2b20298 a512 
> [3.475644]c2123f04 c02f2a58 c2123f34 f896f3ed c2123f1c c34fbdb8 
> c2123f20 c02f3e22 
> [3.475662] Call Trace:
> [3.475667]  [show_trace_log_lvl+26/48] show_trace_log_lvl+0x1a/0x30
> [3.475678]  [show_stack_log_lvl+177/224] show_stack_log_lvl+0xb1/0xe0
> [3.475695]  [die+282/560] die+0x11a/0x230
> [3.475687]  [show_registers+193/464] show_registers+0xc1/0x1d0
> [3.475703]  [do_page_fault+415/1648] do_page_fault+0x19f/0x670
> [3.475713]  [error_code+114/120] error_code+0x72/0x78
> [3.475722]  [__mutex_unlock_slowpath+172/336] 
> __mutex_unlock_slowpath+0xac/0x150
> [3.475731]  [mutex_unlock+8/16] mutex_unlock+0x8/0x10
> [3.475739]  [] acpi_battery_update+0x1ce/0x23c [battery]
> [3.475753]  [] acpi_battery_notify+0x21/0x78 [battery]
> [3.475764]  [acpi_ev_notify_dispatch+79/90] 
> acpi_ev_notify_dispatch+0x4f/0x5a
> [3.475792]  [worker_thread+157/256] worker_thread+0x9d/0x100
> [3.475774]  [acpi_os_execute_notify+36/47] 
> acpi_os_execute_notify+0x24/0x2f
> [3.475784]  [run_workqueue+288/464] run_workqueue+0x120/0x1d0
> [3.475809]  [kernel_thread_helper+7/16] kernel_thread_helper+0x7/0x10
> [3.475801]  [kthread+66/112] kthread+0x42/0x70
> [3.475821] Code: 8d b4 26 00 00 00 00 55 89 e5 83 ec 18 89 5d f8 89 c3 89 
> 75 fc 0f b6 40 04 89 d6 84 c0 7f 24 8d 43 20 3b 43 20 0f 84 e4 00 00 00 <3b> 
> 76 10 0f 85 93 00 00 00 3b 36 90 74 47 8b 5d f8 8b 75 fc 89 
> [3.475818]  ===
> [3.475909] EIP: [debug_mutex_wake_waiter+36/352] 
> debug_mutex_wake_waiter+0x24/0x160 SS:ESP 0068:c2123ebc
> 

Did any earlier kernels do this?  In other words, do you believe that this
is a bug which we added after 2.6.23 was released?

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] backlight dimmer

2007-11-01 Thread jack

Pavel Machek wrote:

On Sun 2007-10-28 17:10:53, [EMAIL PROTECTED] wrote:

Hello,
this patch implements a macbook like backlight dimmer on 
top of backlight.c.


The dimmer is entirely in kernelspace and is suitable 
for an embedded context in order to avoid the overhead 
of a daemon controlling the backlight. Implementing this 
functionality in userspace has other advantages and is a 
perfectly reasonable alternative, so this patch is

not the definitive solution.

activate dimmer:
echo 1 > /sys/devices/virtual/backlight/*/dimmer_control

other attributes (britness levels & timeout):
/sys/devices/virtual/backlight/*/dimmer_high_level
/sys/devices/virtual/backlight/*/dimmer_low_level
/sys/devices/virtual/backlight/*/dimmer_timeout


I'd say that userspace makes sense here. I'd want backlight to go down
slowly, for example.

But... maybe undimming should be done in kernel, so it keeps
low,latency?

Hmm, maybe existing screen blanking infrastructure can be reused?

--- linux-2.6.23.1/include/linux/backlight.h 2007-10-12 
18:43:44.0 +0200
+++ b/include/linux/backlight.h	2007-10-28 
13:45:21.0 +0100

@@ -11,6 +11,7 @@
#include 
#include 
#include 
+#include 

/* Notes on locking:


Your mail client damages patches?



Hi,
incremental dimming can be easily implemented on top of this patch. 
It's just about posting the timeout repeatedly. I will post something 
in the near future :).


I believe the userspace vs kernelspace argument is very arbitrary.

As i said, the dimmer can be completely coded in userspace. There are 
good reasons for that. Userspace apps can work out better than the 
kernel what the user is doing. If the user is watching a movie for 
instance, the dimmer should be stopped.


However, the patch is just a light addition to backlight.c. If the 
dimmer is stopped/unused, just a few pointers and some code are wasted.


Userspace apps can start, stop and change the timeout value for the 
dimmer on the fly, as if they were doing the dimming themselves. In 
this case, userspace only notifies the kernel to stop the dimmer 
whenever such action is deemed right. Instead when the dimmer is on, 
which is most of the time, no kernespace/userspace switch is necessary.


Changing the brightness might occur every 10 seconds or so. IMHO this 
is more of a realtime (and thus kernelspace) requirement rather than 
userspace. This also saves us from having a whole (heavyweight) 
dedicated process managing such a basic functionality.


Finally, implementing the dimmer in kernelspace delivers this 
functionality everywhere provided Linux is used. No additional 
software is necessary. This latter would usually have to be tailored 
to different audiences such as desktop, embedded, whatever.


I don't think thunderbird is breaking the patch, since i was able to 
apply it from my previous posts. However here is a web location for 
it, just in case: http://www.antonello.org/dimmer/. Please let me know 
if you can't test it!


jacopo

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 25/27] r-o-bind-mounts-track-number-of-mount-writers-make-lockdep-happy-with-r-o-bind-mounts

2007-11-01 Thread Dave Hansen


With the r/o bind mount patches, we can have as many
spinlocks nested as there are CPUs on the system.
Lockdep freaks out after 8.

So, create a new lockdep class of locks for the
mnt_writer spinlocks, and initialize each of the
cpu locks to be in a different class.

It should shut up warnings like this, while still
allowing some of the lockdep goodness to remain:

=
[ INFO: possible recursive locking detected ]
2.6.23-rc6 #6

---

 linux-2.6.git-dave/fs/namespace.c |2 ++
 1 file changed, 2 insertions(+)

diff -puN 
fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers-make-lockdep-happy-with-r-o-bind-mounts
 fs/namespace.c
--- 
linux-2.6.git/fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers-make-lockdep-happy-with-r-o-bind-mounts
  2007-11-01 14:46:21.0 -0700
+++ linux-2.6.git-dave/fs/namespace.c   2007-11-01 14:46:21.0 -0700
@@ -112,6 +112,7 @@ struct mnt_writer {
 * must be ordered by cpu number.
 */
spinlock_t lock;
+   struct lock_class_key lock_class; /* compiles out with !lockdep */
unsigned long count;
struct vfsmount *mnt;
 } cacheline_aligned_in_smp;
@@ -123,6 +124,7 @@ static int __init init_mnt_writers(void)
for_each_possible_cpu(cpu) {
struct mnt_writer *writer = _cpu(mnt_writers, cpu);
spin_lock_init(>lock);
+   lockdep_set_class(>lock, >lock_class);
writer->count = 0;
}
return 0;
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 26/27] r-o-bind-mounts-honor-r-w-changes-at-do_remount-time

2007-11-01 Thread Dave Hansen


Originally from: Herbert Poetzl <[EMAIL PROTECTED]>

This is the core of the read-only bind mount patch set.

Note that this does _not_ add a "ro" option directly to the bind mount
operation.  If you require such a mount, you must first do the bind, then
follow it up with a 'mount -o remount,ro' operation.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Cc: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namespace.c|   46 ++-
 linux-2.6.git-dave/include/linux/mount.h |1 
 2 files changed, 40 insertions(+), 7 deletions(-)

diff -puN fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time 
fs/namespace.c
--- 
linux-2.6.git/fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time
   2007-11-01 14:46:22.0 -0700
+++ linux-2.6.git-dave/fs/namespace.c   2007-11-01 14:46:22.0 -0700
@@ -102,7 +102,11 @@ struct vfsmount *alloc_vfsmnt(const char
  */
 int __mnt_is_readonly(struct vfsmount *mnt)
 {
-   return (mnt->mnt_sb->s_flags & MS_RDONLY);
+   if (mnt->mnt_flags & MNT_READONLY)
+   return 1;
+   if (mnt->mnt_sb->s_flags & MS_RDONLY)
+   return 1;
+   return 0;
 }
 EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 
@@ -277,7 +281,7 @@ void mnt_drop_write(struct vfsmount *mnt
 }
 EXPORT_SYMBOL_GPL(mnt_drop_write);
 
-int mnt_make_readonly(struct vfsmount *mnt)
+static int mnt_make_readonly(struct vfsmount *mnt)
 {
int ret = 0;
 
@@ -290,15 +294,21 @@ int mnt_make_readonly(struct vfsmount *m
goto out;
}
/*
-* actually set mount's r/o flag here to make
-* __mnt_is_readonly() true, which keeps anyone
-* from doing a successful mnt_want_write().
+* nobody can do a successful mnt_want_write() with all
+* of the counts in MNT_DENIED_WRITE and the locks held.
 */
+   if (!ret)
+   mnt->mnt_flags |= MNT_READONLY;
 out:
mnt_unlock_cpus();
return ret;
 }
 
+static void __mnt_unmake_readonly(struct vfsmount *mnt)
+{
+   mnt->mnt_flags &= ~MNT_READONLY;
+}
+
 int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
 {
mnt->mnt_sb = sb;
@@ -607,7 +617,7 @@ static int show_vfsmnt(struct seq_file *
seq_putc(m, '.');
mangle(m, mnt->mnt_sb->s_subtype);
}
-   seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw");
+   seq_puts(m, __mnt_is_readonly(mnt) ? " ro" : " rw");
for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
if (mnt->mnt_sb->s_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
@@ -1195,6 +1205,23 @@ out:
return err;
 }
 
+static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+{
+   int error = 0;
+   int readonly_request = 0;
+
+   if (ms_flags & MS_RDONLY)
+   readonly_request = 1;
+   if (readonly_request == __mnt_is_readonly(mnt))
+   return 0;
+
+   if (readonly_request)
+   error = mnt_make_readonly(mnt);
+   else
+   __mnt_unmake_readonly(mnt);
+   return error;
+}
+
 /*
  * change filesystem flags. dir should be a physical root of filesystem.
  * If you've mounted a non-root directory somewhere and want to do remount
@@ -1216,7 +1243,10 @@ static int do_remount(struct nameidata *
return -EINVAL;
 
down_write(>s_umount);
-   err = do_remount_sb(sb, flags, data, 0);
+   if (flags & MS_BIND)
+   err = change_mount_flags(nd->mnt, flags);
+   else
+   err = do_remount_sb(sb, flags, data, 0);
if (!err)
nd->mnt->mnt_flags = mnt_flags;
up_write(>s_umount);
@@ -1660,6 +1690,8 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_RELATIME)
mnt_flags |= MNT_RELATIME;
+   if (flags & MS_RDONLY)
+   mnt_flags |= MNT_READONLY;
 
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
diff -puN 
include/linux/mount.h~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time 
include/linux/mount.h
--- 
linux-2.6.git/include/linux/mount.h~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time
2007-11-01 14:46:22.0 -0700
+++ linux-2.6.git-dave/include/linux/mount.h2007-11-01 14:46:22.0 
-0700
@@ -29,6 +29,7 @@ struct mnt_namespace;
 #define MNT_NOATIME0x08
 #define MNT_NODIRATIME 0x10
 #define MNT_RELATIME   0x20
+#define MNT_READONLY   0x40/* does the user want this to be r/o? */
 
 #define MNT_SHRINKABLE 0x100
 
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: [PATCH/RFC] eradicate bashisms in scripts/patch-kernel

2007-11-01 Thread Randy Dunlap
On Thu, 1 Nov 2007 23:16:06 +0100 Andreas Mohr wrote:

[add Herbert]

> Hi,
> 
> On Thu, Nov 01, 2007 at 08:24:57AM -0700, Randy Dunlap wrote:
> > On Thu, 1 Nov 2007 13:11:33 +0100 Andreas Mohr wrote:
> > > I'll think a bit more about these couple changed places (and whether
> > > this still truly works as intended) and mail a patch then.
> > > 
> > > (and a big NOTE: I'm no POSIX vs. non-POSIX shell guru at all, only a
> > > semi-versed shell script writer, thus these changes should be reviewed
> > > quite thoroughly)
> > 
> > Neither am I.  I read those web pages quickly yesterday, so after
> > you read them, we can discuss more and/or review more patches.
> 
> OK, next iteration (v2).
> 
> 
> Make the patch-kernel shell script sufficiently compatible with POSIX shells.
> 
> 
> Changes since v1:
> - don't actually change string quoting
> - prepend ./ at mktemp step already
> - remove added superfluous spaces in arithmetic expression
> - don't remove double braces in STOPSUBLEVEL evaluation,
>   since these are integer values to be compared
> 
> Full ChangeLog:
> - replaced non-standard "==" by standard "="
OK

> - replaced non-standard "source" statement by POSIX "dot" statement
OK

> - POSIX shell local file lookup needs ./ prepended, thus have mktemp
>   use this from the beginning and comment it properly

I would like the patch description to say something like:

Use ./ as a prefix on the TEMP filename so that the current search
PATH will not be searched and thus another file with this same
(pseudo random) name won't be used accidentally.

> - replace non-standard bash string parsing by sed expression
>   (is the sed syntax ok? correct? strict enough?)

I think that this is the part that bothers me.  I can't find
anything at
http://www.opengroup.org/onlinepubs/95399/utilities/xcu_chap02.html
that says that this:
EXTRAVER=${EXTRAVER%%[[:punct:]]*}

is invalid or even optional syntax.  OTOH, it does list such syntax,
so why are we working around this syntax?
Please point out to me what I am missing...

> - added missing $ signs to shell variable names
OK


> About the missing $ signs:
> http://www.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_14
> says:
> "If the shell variable x contains a value that forms a valid integer
> constant, then the arithmetic expansions
>  "$((x))" and "$(($x))" shall return the same value."
> 
> Hmm, well, seems dash doesn't... (syntax error).
> Thus I still needed to add the $ signs despite opengroup.org specifying
> it differently.

Herbert?

> Updated version verified again to now work with both bash and dash
> on Debian stable.
> 
> Patch intended for inclusion in -mm, once it has survived some reviews.
> 
> Thanks.
> 
> Signed-off-by: Andreas Mohr <[EMAIL PROTECTED]>
> 
> --- linux-2.6.23/scripts/patch-kernel.orig2007-11-01 22:51:34.0 
> +0100
> +++ linux-2.6.23/scripts/patch-kernel 2007-11-01 22:10:14.0 +0100
> @@ -65,7 +65,7 @@
>  patchdir=${2-.}
>  stopvers=${3-default}
>  
> -if [ "$1" == -h -o "$1" == --help -o ! -r "$sourcedir/Makefile" ]; then
> +if [ "$1" = -h -o "$1" = --help -o ! -r "$sourcedir/Makefile" ]; then
>  cat << USAGE
>  usage: $PNAME [-h] [ sourcedir [ patchdir [ stopversion ] [ -acxx ] ] ]
>source directory defaults to /usr/src/linux,
> @@ -182,10 +182,12 @@
>  }
>  
>  # set current VERSION, PATCHLEVEL, SUBLEVEL, EXTRAVERSION
> -TMPFILE=`mktemp .tmpver.XX` || { echo "cannot make temp file" ; exit 1; }
> +# sourcing $TMPFILE.1 below needs lookup in local directory in a POSIX shell,
> +# thus prepend ./ to mktemp argument
> +TMPFILE=`mktemp ./.tmpver.XX` || { echo "cannot make temp file" ; exit 
> 1; }
>  grep -E "^(VERSION|PATCHLEVEL|SUBLEVEL|EXTRAVERSION)" $sourcedir/Makefile > 
> $TMPFILE
>  tr -d [:blank:] < $TMPFILE > $TMPFILE.1
> -source $TMPFILE.1
> +. $TMPFILE.1
>  rm -f $TMPFILE*
>  if [ -z "$VERSION" -o -z "$PATCHLEVEL" -o -z "$SUBLEVEL" ]
>  then
> @@ -202,13 +204,7 @@
>  EXTRAVER=
>  if [ x$EXTRAVERSION != "x" ]
>  then
> - if [ ${EXTRAVERSION:0:1} == "." ]; then
> - EXTRAVER=${EXTRAVERSION:1}
> - else
> - EXTRAVER=$EXTRAVERSION
> - fi
> - EXTRAVER=${EXTRAVER%%[[:punct:]]*}
> - #echo "$PNAME: changing EXTRAVERSION from $EXTRAVERSION to $EXTRAVER"
> + EXTRAVER=`echo $EXTRAVERSION|sed -s 's/^[\.]\?\([^[:punct:]]*\).*/\1/'`
>  fi
>  
>  #echo "stopvers=$stopvers"
> @@ -251,16 +247,16 @@
>  do
>  CURRENTFULLVERSION="$VERSION.$PATCHLEVEL.$SUBLEVEL"
>  EXTRAVER=
> -if [ $stopvers == $CURRENTFULLVERSION ]; then
> +if [ $stopvers = $CURRENTFULLVERSION ]; then
>  echo "Stopping at $CURRENTFULLVERSION base as requested."
>  break
>  fi
>  
> -SUBLEVEL=$((SUBLEVEL + 1))
> +SUBLEVEL=$(($SUBLEVEL + 1))
>  FULLVERSION="$VERSION.$PATCHLEVEL.$SUBLEVEL"
>  #echo "#___ trying $FULLVERSION ___"
>  
> -if [ $((SUBLEVEL)) -gt $((STOPSUBLEVEL)) ]; then
> +if [ $(($SUBLEVEL)) -gt 

[PATCH 24/27] r-o-bind-mounts-track-number-of-mount-writers

2007-11-01 Thread Dave Hansen


This is the real meat of the entire series.  It actually implements the
tracking of the number of writers to a mount.  However, it causes scalability
problems because there can be hundreds of cpus doing open()/close() on files
on the same mnt at the same time.  Even an atomic_t in the mnt has massive
scalaing problems because the cacheline gets so terribly contended.

This uses a statically-allocated percpu variable.  All operations are local to
a cpu as long that cpu operates on the same mount, and there are no writer
count imbalances.  Writer count imbalances happen when a write is taken on one
cpu, and released on another, like when an open/close pair is performed on two

Upon a remount,ro request, all of the data from the percpu variables is
collected (expensive, but very rare) and we determine if there are any
outstanding writers to the mount.

I've written a little benchmark to sit in a loop for a couple of seconds in
several cpus in parallel doing open/write/close loops.

http://sr71.net/~dave/linux/openbench.c

The code in here is a a worst-possible case for this patch.  It does opens on
a _pair_ of files in two different mounts in parallel.  This should cause my
code to lose its "operate on the same mount" optimization completely.  This
worst-case scenario causes a 3% degredation in the benchmark.

I could probably get rid of even this 3%, but it would be more complex than
what I have here, and I think this is getting into acceptable territory.  In
practice, I expect writing more than 3 bytes to a file, as well as disk I/O to
mask any effects that this has.

(To get rid of that 3%, we could have an #defined number of mounts in the
percpu variable.  So, instead of a CPU getting operate only on percpu data
when it accesses only one mount, it could stay on percpu data when it only
accesses N or fewer mounts.)

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Cc: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namespace.c|  205 ---
 linux-2.6.git-dave/include/linux/mount.h |8 +
 2 files changed, 198 insertions(+), 15 deletions(-)

diff -puN fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers 
fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers  
2007-11-01 14:46:20.0 -0700
+++ linux-2.6.git-dave/fs/namespace.c   2007-11-01 14:46:20.0 -0700
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -52,6 +53,8 @@ static inline unsigned long hash(struct 
return tmp & hash_mask;
 }
 
+#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
+
 struct vfsmount *alloc_vfsmnt(const char *name)
 {
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -65,6 +68,7 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(>mnt_share);
INIT_LIST_HEAD(>mnt_slave_list);
INIT_LIST_HEAD(>mnt_slave);
+   atomic_set(>__mnt_writers, 0);
if (name) {
int size = strlen(name) + 1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -85,6 +89,84 @@ struct vfsmount *alloc_vfsmnt(const char
  * we can determine when writes are able to occur to
  * a filesystem.
  */
+/*
+ * __mnt_is_readonly: check whether a mount is read-only
+ * @mnt: the mount to check for its write status
+ *
+ * This shouldn't be used directly ouside of the VFS.
+ * It does not guarantee that the filesystem will stay
+ * r/w, just that it is right *now*.  This can not and
+ * should not be used in place of IS_RDONLY(inode).
+ * mnt_want/drop_write() will _keep_ the filesystem
+ * r/w.
+ */
+int __mnt_is_readonly(struct vfsmount *mnt)
+{
+   return (mnt->mnt_sb->s_flags & MS_RDONLY);
+}
+EXPORT_SYMBOL_GPL(__mnt_is_readonly);
+
+struct mnt_writer {
+   /*
+* If holding multiple instances of this lock, they
+* must be ordered by cpu number.
+*/
+   spinlock_t lock;
+   unsigned long count;
+   struct vfsmount *mnt;
+} cacheline_aligned_in_smp;
+static DEFINE_PER_CPU(struct mnt_writer, mnt_writers);
+
+static int __init init_mnt_writers(void)
+{
+   int cpu;
+   for_each_possible_cpu(cpu) {
+   struct mnt_writer *writer = _cpu(mnt_writers, cpu);
+   spin_lock_init(>lock);
+   writer->count = 0;
+   }
+   return 0;
+}
+fs_initcall(init_mnt_writers);
+
+static void mnt_unlock_cpus(void)
+{
+   int cpu;
+   struct mnt_writer *cpu_writer;
+
+   for_each_possible_cpu(cpu) {
+   cpu_writer = _cpu(mnt_writers, cpu);
+   spin_unlock(_writer->lock);
+   }
+}
+
+static inline void __clear_mnt_count(struct mnt_writer *cpu_writer)
+{
+   if (!cpu_writer->mnt)
+   return;
+   atomic_add(cpu_writer->count, _writer->mnt->__mnt_writers);
+   cpu_writer->count = 0;
+}
+ 

[PATCH 22/27] r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly

2007-11-01 Thread Dave Hansen


If we depend on the inodes for writeability, we will not catch the r/o mounts
when implemented.

This patches uses __mnt_want_write().  It does not guarantee that the mount
will stay writeable after the check.  But, this is OK for one of the checks
because it is just for a printk().

The other two are probably unnecessary and duplicate existing checks in the
VFS.  This won't make them better checks than before, but it will make them
detect r/o mounts.

Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/nfs/dir.c  |3 ++-
 linux-2.6.git-dave/fs/nfsd/vfs.c |4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff -puN 
fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly 
fs/nfs/dir.c
--- 
linux-2.6.git/fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly
 2007-11-01 14:46:19.0 -0700
+++ linux-2.6.git-dave/fs/nfs/dir.c 2007-11-01 14:46:19.0 -0700
@@ -949,7 +949,8 @@ static int is_atomic_open(struct inode *
if (nd->flags & LOOKUP_DIRECTORY)
return 0;
/* Are we trying to write to a read only partition? */
-   if (IS_RDONLY(dir) && (nd->intent.open.flags & 
(O_CREAT|O_TRUNC|FMODE_WRITE)))
+   if (__mnt_is_readonly(nd->mnt) &&
+   (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
return 0;
return 1;
 }
diff -puN 
fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly 
fs/nfsd/vfs.c
--- 
linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly
2007-11-01 14:46:19.0 -0700
+++ linux-2.6.git-dave/fs/nfsd/vfs.c2007-11-01 14:46:19.0 -0700
@@ -1862,7 +1862,7 @@ nfsd_permission(struct svc_rqst *rqstp, 
inode->i_mode,
IS_IMMUTABLE(inode)?" immut" : "",
IS_APPEND(inode)?   " append" : "",
-   IS_RDONLY(inode)?   " ro" : "");
+   __mnt_is_readonly(exp->ex_mnt)? " ro" : "");
dprintk("  owner %d/%d user %d/%d\n",
inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
 #endif
@@ -1873,7 +1873,7 @@ nfsd_permission(struct svc_rqst *rqstp, 
 */
if (!(acc & MAY_LOCAL_ACCESS))
if (acc & (MAY_WRITE | MAY_SATTR | MAY_TRUNC)) {
-   if (exp_rdonly(rqstp, exp) || IS_RDONLY(inode))
+   if (exp_rdonly(rqstp, exp) || 
__mnt_is_readonly(exp->ex_mnt))
return nfserr_rofs;
if (/* (acc & MAY_WRITE) && */ IS_IMMUTABLE(inode))
return nfserr_perm;
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 23/27] r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create

2007-11-01 Thread Dave Hansen


This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().

So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch and into a helper function
suggested by Christoph.

This also encapsulates a fix for mknod(S_IFREG) that Miklos
found.

Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namei.c |   43 +-
 linux-2.6.git-dave/fs/nfsd/vfs.c  |4 +++
 linux-2.6.git-dave/net/unix/af_unix.c |4 +++
 3 files changed, 40 insertions(+), 11 deletions(-)

diff -puN 
fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create 
fs/namei.c
--- 
linux-2.6.git/fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create
   2007-11-01 14:46:20.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:20.0 -0700
@@ -2022,6 +2022,23 @@ int vfs_mknod(struct inode *dir, struct 
return error;
 }
 
+static int may_mknod(mode_t mode)
+{
+   switch (mode & S_IFMT) {
+   case S_IFREG:
+   case S_IFCHR:
+   case S_IFBLK:
+   case S_IFIFO:
+   case S_IFSOCK:
+   case 0: /* zero mode translates to S_IFREG */
+   return 0;
+   case S_IFDIR:
+   return -EPERM;
+   default:
+   return -EINVAL;
+   }
+}
+
 asmlinkage long sys_mknodat(int dfd, const char __user *filename, int mode,
unsigned dev)
 {
@@ -2040,12 +2057,19 @@ asmlinkage long sys_mknodat(int dfd, con
if (error)
goto out;
dentry = lookup_create(, 0);
-   error = PTR_ERR(dentry);
-
+   if (IS_ERR(dentry)) {
+   error = PTR_ERR(dentry);
+   goto out_unlock;
+   }
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
-   if (!IS_ERR(dentry)) {
-   switch (mode & S_IFMT) {
+   error = may_mknod(mode);
+   if (error)
+   goto out_dput;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
+   switch (mode & S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(nd.dentry->d_inode,dentry,mode,);
break;
@@ -2056,14 +2080,11 @@ asmlinkage long sys_mknodat(int dfd, con
case S_IFIFO: case S_IFSOCK:
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
break;
-   case S_IFDIR:
-   error = -EPERM;
-   break;
-   default:
-   error = -EINVAL;
-   }
-   dput(dentry);
}
+   mnt_drop_write(nd.mnt);
+out_dput:
+   dput(dentry);
+out_unlock:
mutex_unlock(>d_inode->i_mutex);
path_release();
 out:
diff -puN 
fs/nfsd/vfs.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create
 fs/nfsd/vfs.c
--- 
linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create
2007-11-01 14:46:20.0 -0700
+++ linux-2.6.git-dave/fs/nfsd/vfs.c2007-11-01 14:46:20.0 -0700
@@ -1242,7 +1242,11 @@ nfsd_create(struct svc_rqst *rqstp, stru
case S_IFBLK:
case S_IFIFO:
case S_IFSOCK:
+   host_err = mnt_want_write(fhp->fh_export->ex_mnt);
+   if (host_err)
+   break;
host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
+   mnt_drop_write(fhp->fh_export->ex_mnt);
break;
default:
printk("nfsd: bad file type %o in nfsd_create\n", type);
diff -puN 
net/unix/af_unix.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create
 net/unix/af_unix.c
--- 
linux-2.6.git/net/unix/af_unix.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create
   2007-11-01 14:46:20.0 -0700
+++ linux-2.6.git-dave/net/unix/af_unix.c   2007-11-01 14:46:20.0 
-0700
@@ -838,7 +838,11 @@ static int unix_bind(struct socket *sock
 */
mode = S_IFSOCK |
   (SOCK_INODE(sock)->i_mode & ~current->fs->umask);
+   err = mnt_want_write(nd.mnt);
+   if (err)
+   goto out_mknod_dput;
err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
+   mnt_drop_write(nd.mnt);
if (err)
goto out_mknod_dput;
mutex_unlock(>d_inode->i_mutex);
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo 

[PATCH 19/27] r-o-bind-mounts-elevate-writer-count-for-chown-and-friends

2007-11-01 Thread Dave Hansen


chown/chmod,etc...  don't call permission in the same way that the normal
"open for write" calls do.  They still write to the filesystem, so bump the
write count during these operations.

This conflicts with the current (~2.6.23-rc7) audit git tree in -mm. 
wiggle'ing the patch merges it.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/open.c |   39 ++-
 1 file changed, 30 insertions(+), 9 deletions(-)

diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends 
fs/open.c
--- 
linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends
  2007-11-01 14:46:17.0 -0700
+++ linux-2.6.git-dave/fs/open.c2007-11-01 14:46:17.0 -0700
@@ -571,12 +571,12 @@ asmlinkage long sys_fchmod(unsigned int 
 
audit_inode(NULL, dentry);
 
-   err = -EROFS;
-   if (IS_RDONLY(inode))
+   err = mnt_want_write(file->f_vfsmnt);
+   if (err)
goto out_putf;
err = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   goto out_putf;
+   goto out_drop_write;
mutex_lock(>i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -585,6 +585,8 @@ asmlinkage long sys_fchmod(unsigned int 
err = notify_change(dentry, );
mutex_unlock(>i_mutex);
 
+out_drop_write:
+   mnt_drop_write(file->f_vfsmnt);
 out_putf:
fput(file);
 out:
@@ -604,13 +606,13 @@ asmlinkage long sys_fchmodat(int dfd, co
goto out;
inode = nd.dentry->d_inode;
 
-   error = -EROFS;
-   if (IS_RDONLY(inode))
+   error = mnt_want_write(nd.mnt);
+   if (error)
goto dput_and_out;
 
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   goto dput_and_out;
+   goto out_drop_write;
 
mutex_lock(>i_mutex);
if (mode == (mode_t) -1)
@@ -620,6 +622,8 @@ asmlinkage long sys_fchmodat(int dfd, co
error = notify_change(nd.dentry, );
mutex_unlock(>i_mutex);
 
+out_drop_write:
+   mnt_drop_write(nd.mnt);
 dput_and_out:
path_release();
 out:
@@ -642,9 +646,6 @@ static int chown_common(struct dentry * 
printk(KERN_ERR "chown_common: NULL inode\n");
goto out;
}
-   error = -EROFS;
-   if (IS_RDONLY(inode))
-   goto out;
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
goto out;
@@ -675,7 +676,12 @@ asmlinkage long sys_chown(const char __u
error = user_path_walk(filename, );
if (error)
goto out;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_release;
error = chown_common(nd.dentry, user, group);
+   mnt_drop_write(nd.mnt);
+out_release:
path_release();
 out:
return error;
@@ -695,7 +701,12 @@ asmlinkage long sys_fchownat(int dfd, co
error = __user_walk_fd(dfd, filename, follow, );
if (error)
goto out;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_release;
error = chown_common(nd.dentry, user, group);
+   mnt_drop_write(nd.mnt);
+out_release:
path_release();
 out:
return error;
@@ -709,7 +720,12 @@ asmlinkage long sys_lchown(const char __
error = user_path_walk_link(filename, );
if (error)
goto out;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_release;
error = chown_common(nd.dentry, user, group);
+   mnt_drop_write(nd.mnt);
+out_release:
path_release();
 out:
return error;
@@ -726,9 +742,14 @@ asmlinkage long sys_fchown(unsigned int 
if (!file)
goto out;
 
+   error = mnt_want_write(file->f_vfsmnt);
+   if (error)
+   goto out_fput;
dentry = file->f_path.dentry;
audit_inode(NULL, dentry);
error = chown_common(dentry, user, group);
+   mnt_drop_write(file->f_vfsmnt);
+out_fput:
fput(file);
 out:
return error;
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 21/27] r-o-bind-mounts-make-access-use-mnt-check

2007-11-01 Thread Dave Hansen


It is OK to let access() go without using a mnt_want/drop_write() pair because
it doesn't actually do writes to the filesystem, and it is inherently racy
anyway.  This is a rare case when it is OK to use __mnt_is_readonly()
directly.

Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/open.c |   13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff -puN fs/open.c~r-o-bind-mounts-make-access-use-mnt-check fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-make-access-use-mnt-check   
2007-11-01 14:46:18.0 -0700
+++ linux-2.6.git-dave/fs/open.c2007-11-01 14:46:18.0 -0700
@@ -459,8 +459,17 @@ asmlinkage long sys_faccessat(int dfd, c
if(res || !(mode & S_IWOTH) ||
   special_file(nd.dentry->d_inode->i_mode))
goto out_path_release;
-
-   if(IS_RDONLY(nd.dentry->d_inode))
+   /*
+* This is a rare case where using __mnt_is_readonly()
+* is OK without a mnt_want/drop_write() pair.  Since
+* no actual write to the fs is performed here, we do
+* not need to telegraph to that to anyone.
+*
+* By doing this, we accept that this access is
+* inherently racy and know that the fs may change
+* state before we even see this result.
+*/
+   if (__mnt_is_readonly(nd.mnt))
res = -EROFS;
 
 out_path_release:
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 18/27] r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename

2007-11-01 Thread Dave Hansen


This also uses the little helper in the NFS code to make an if() a little bit
less ugly.  We introduced the helper at the beginning of the series.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namei.c|4 
 linux-2.6.git-dave/fs/nfsd/vfs.c |   15 +++
 2 files changed, 15 insertions(+), 4 deletions(-)

diff -puN 
fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename 
fs/namei.c
--- 
linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename
   2007-11-01 14:46:16.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:16.0 -0700
@@ -2734,8 +2734,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;
 
+   error = mnt_want_write(oldnd.mnt);
+   if (error)
+   goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
   new_dir->d_inode, new_dentry);
+   mnt_drop_write(oldnd.mnt);
 exit5:
dput(new_dentry);
 exit4:
diff -puN 
fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename 
fs/nfsd/vfs.c
--- 
linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename
2007-11-01 14:46:16.0 -0700
+++ linux-2.6.git-dave/fs/nfsd/vfs.c2007-11-01 14:46:16.0 -0700
@@ -1668,13 +1668,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;
 
-#ifdef MSNFS
-   if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+   if (svc_msnfs(ffhp) &&
((atomic_read(>d_count) > 1)
 || (atomic_read(>d_count) > 1))) {
host_err = -EPERM;
-   } else
-#endif
+   goto out_dput_new;
+   }
+
+   host_err = -EXDEV;
+   if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+   goto out_dput_new;
+   host_err = mnt_want_write(ffhp->fh_export->ex_mnt);
+   if (host_err)
+   goto out_dput_new;
+
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
host_err = nfsd_sync_dir(tdentry);
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 20/27] r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate

2007-11-01 Thread Dave Hansen


Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/open.c |   14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate 
fs/open.c
--- 
linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate
2007-11-01 14:46:18.0 -0700
+++ linux-2.6.git-dave/fs/open.c2007-11-01 14:46:18.0 -0700
@@ -244,21 +244,21 @@ static long do_sys_truncate(const char _
if (!S_ISREG(inode->i_mode))
goto dput_and_out;
 
-   error = vfs_permission(, MAY_WRITE);
+   error = mnt_want_write(nd.mnt);
if (error)
goto dput_and_out;
 
-   error = -EROFS;
-   if (IS_RDONLY(inode))
-   goto dput_and_out;
+   error = vfs_permission(, MAY_WRITE);
+   if (error)
+   goto mnt_drop_write_and_out;
 
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
error = get_write_access(inode);
if (error)
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
/*
 * Make sure that there are no leases.  get_write_access() protects
@@ -276,6 +276,8 @@ static long do_sys_truncate(const char _
 
 put_write_and_out:
put_write_access(inode);
+mnt_drop_write_and_out:
+   mnt_drop_write(nd.mnt);
 dput_and_out:
path_release();
 out:
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 16/27] r-o-bind-mounts-elevate-write-count-for-some-ioctls

2007-11-01 Thread Dave Hansen


Some ioctl()s can cause writes to the filesystem.  Take these, and make them
use mnt_want/drop_write() instead.

We need to pass the filp one layer deeper in XFS, but somebody _just_ pulled
it out in February because nobody was using it, so I don't feel guilty for
adding it back.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/ext2/ioctl.c  |   46 ++
 linux-2.6.git-dave/fs/ext3/ioctl.c  |  100 +++---
 linux-2.6.git-dave/fs/ext4/ioctl.c  |  105 +++-
 linux-2.6.git-dave/fs/fat/file.c|   10 +-
 linux-2.6.git-dave/fs/hfsplus/ioctl.c   |   40 +
 linux-2.6.git-dave/fs/jfs/ioctl.c   |   33 ---
 linux-2.6.git-dave/fs/ocfs2/ioctl.c |   11 +-
 linux-2.6.git-dave/fs/reiserfs/ioctl.c  |   55 
 linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_ioctl.c |   15 ++-
 linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_iops.c  |7 -
 linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_lrw.c   |9 +-
 11 files changed, 274 insertions(+), 157 deletions(-)

diff -puN fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 
fs/ext2/ioctl.c
--- 
linux-2.6.git/fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls
   2007-11-01 14:46:14.0 -0700
+++ linux-2.6.git-dave/fs/ext2/ioctl.c  2007-11-01 14:46:14.0 -0700
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,6 +21,7 @@
 int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
unsigned long arg)
 {
+   int ret;
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
@@ -34,14 +36,19 @@ int ext2_ioctl (struct inode * inode, st
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;
 
-   if (IS_RDONLY(inode))
-   return -EROFS;
-
-   if (!is_owner_or_cap(inode))
-   return -EACCES;
+   ret = mnt_want_write(filp->f_vfsmnt);
+   if (ret)
+   return ret;
+
+   if (!is_owner_or_cap(inode)) {
+   ret = -EACCES;
+   goto setflags_out;
+   }
 
-   if (get_user(flags, (int __user *) arg))
-   return -EFAULT;
+   if (get_user(flags, (int __user *) arg)) {
+   ret = -EFAULT;
+   goto setflags_out;
+   }
 
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT2_DIRSYNC_FL;
@@ -58,7 +65,8 @@ int ext2_ioctl (struct inode * inode, st
if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(>i_mutex);
-   return -EPERM;
+   ret = -EPERM;
+   goto setflags_out;
}
}
 
@@ -70,20 +78,26 @@ int ext2_ioctl (struct inode * inode, st
ext2_set_inode_flags(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
-   return 0;
+setflags_out:
+   mnt_drop_write(filp->f_vfsmnt);
+   return ret;
}
case EXT2_IOC_GETVERSION:
return put_user(inode->i_generation, (int __user *) arg);
case EXT2_IOC_SETVERSION:
if (!is_owner_or_cap(inode))
return -EPERM;
-   if (IS_RDONLY(inode))
-   return -EROFS;
-   if (get_user(inode->i_generation, (int __user *) arg))
-   return -EFAULT; 
-   inode->i_ctime = CURRENT_TIME_SEC;
-   mark_inode_dirty(inode);
-   return 0;
+   ret = mnt_want_write(filp->f_vfsmnt);
+   if (ret)
+   return ret;
+   if (get_user(inode->i_generation, (int __user *) arg)) {
+   ret = -EFAULT;
+   } else {
+   inode->i_ctime = CURRENT_TIME_SEC;
+   mark_inode_dirty(inode);
+   }
+   mnt_drop_write(filp->f_vfsmnt);
+   return ret;
case EXT2_IOC_GETRSVSZ:
if (test_opt(inode->i_sb, RESERVATION)
&& S_ISREG(inode->i_mode)
diff -puN fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 
fs/ext3/ioctl.c
--- 
linux-2.6.git/fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls
   2007-11-01 14:46:14.0 -0700
+++ linux-2.6.git-dave/fs/ext3/ioctl.c  2007-11-01 14:46:14.0 -0700
@@ -12,6 

[PATCH 17/27] r-o-bind-mounts-elevate-write-count-opend-files

2007-11-01 Thread Dave Hansen

This is an old patch combined with a couple other ones I've
been working on.  It should the issues that Miklos has
spotted.

--

This is the first really tricky patch in the series.  It
elevates the writer count on a mount each time a non-special
file is opened for write.

We used to do this in may_open(), but Miklos pointed out
that __dentry_open() is used as well to create filps.  This
will cover even those cases, while a call in may_open()
would not have.

There is also an elevated count around the vfs_create() call
in open_namei().  See the comments for more details, but we
need this to fix a 'create, remount, fail r/w open()' race.

Some filesystems forego the use of normal vfs calls to create
struct files.   Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/file_table.c |   16 +++-
 linux-2.6.git-dave/fs/namei.c  |   74 -
 linux-2.6.git-dave/fs/open.c   |   36 +-
 linux-2.6.git-dave/ipc/mqueue.c|3 +
 4 files changed, 116 insertions(+), 13 deletions(-)

diff -puN fs/file_table.c~r-o-bind-mounts-elevate-write-count-opend-files 
fs/file_table.c
--- 
linux-2.6.git/fs/file_table.c~r-o-bind-mounts-elevate-write-count-opend-files   
2007-11-01 14:46:16.0 -0700
+++ linux-2.6.git-dave/fs/file_table.c  2007-11-01 14:46:16.0 -0700
@@ -193,6 +193,17 @@ int init_file(struct file *file, struct 
file->f_mapping = dentry->d_inode->i_mapping;
file->f_mode = mode;
file->f_op = fop;
+
+   /*
+* These mounts don't really matter in practice
+* for r/o bind mounts.  They aren't userspace-
+* visible.  We do this for consistency, and so
+* that we can do debugging checks at __fput()e
+*/
+   if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+   error = mnt_want_write(mnt);
+   WARN_ON(error);
+   }
return error;
 }
 EXPORT_SYMBOL(init_file);
@@ -230,8 +241,11 @@ void fastcall __fput(struct file *file)
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
-   if (file->f_mode & FMODE_WRITE)
+   if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
+   if (!special_file(inode->i_mode))
+   mnt_drop_write(mnt);
+   }
put_pid(file->f_owner.pid);
file_kill(file);
file->f_path.dentry = NULL;
diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-opend-files fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-opend-files
2007-11-01 14:46:16.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:16.0 -0700
@@ -1620,12 +1620,12 @@ int may_open(struct nameidata *nd, int a
return -EACCES;
 
flag &= ~O_TRUNC;
-   } else if (IS_RDONLY(inode) && (flag & FMODE_WRITE))
-   return -EROFS;
+   }
 
error = vfs_permission(nd, acc_mode);
if (error)
return error;
+
/*
 * An append-only file must be opened in append mode for writing.
 */
@@ -1721,18 +1721,31 @@ static inline int sys_open_flags_to_name
return flag;
 }
 
+static inline int open_will_write_to_fs(int flag, struct inode *inode)
+{
+   /*
+* We'll never write to the fs underlying
+* a device file.
+*/
+   if (special_file(inode->i_mode))
+   return 0;
+   return (flag & O_TRUNC);
+}
+
 /*
  * open_pathname()
  *
  * namei for open - this is in fact almost the whole open-routine.
  *
- * Note that the low bits of "flag" aren't the same as in the open
- * system call.  See sys_open_flags_to_namei_flags().
+ * Note that the low bits of the passed in "sys_open_flag"
+ * are not the same as in the local variable "flag". See
+ * sys_open_flags_to_namei_flags() for more details.
  * SMP-safe
  */
 struct file *open_pathname(int dfd, const char *pathname,
   int sys_open_flag, int mode)
 {
+   struct file *filp;
struct nameidata nd;
int acc_mode, error;
struct path path;
@@ -1792,17 +1805,30 @@ do_last:
}
 
if (IS_ERR(nd.intent.open.file)) {
-   mutex_unlock(>d_inode->i_mutex);
error = PTR_ERR(nd.intent.open.file);
-   goto exit_dput;
+   goto exit_mutex_unlock;
}
 
/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
-   error = __open_namei_create(, , flag, mode);
+   /*
+* This write is needed to ensure that a
+* ro->rw transition does not occur between
+* the time when the file is 

[PATCH 15/27] r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls

2007-11-01 Thread Dave Hansen


Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namei.c |   10 ++
 1 file changed, 10 insertions(+)

diff -puN 
fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls 
fs/namei.c
--- 
linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls
 2007-11-01 14:46:14.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:14.0 -0700
@@ -2349,7 +2349,12 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(dentry))
goto out_unlock;
 
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
+   mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(>d_inode->i_mutex);
@@ -2444,7 +2449,12 @@ asmlinkage long sys_linkat(int olddfd, c
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_unlock;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+   mnt_drop_write(nd.mnt);
+out_dput:
dput(new_dentry);
 out_unlock:
mutex_unlock(>d_inode->i_mutex);
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 12/27] r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime

2007-11-01 Thread Dave Hansen


Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/inode.c |   20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff -puN 
fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime 
fs/inode.c
--- 
linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime
   2007-11-01 14:46:12.0 -0700
+++ linux-2.6.git-dave/fs/inode.c   2007-11-01 14:46:12.0 -0700
@@ -1203,22 +1203,23 @@ void touch_atime(struct vfsmount *mnt, s
struct inode *inode = dentry->d_inode;
struct timespec now;
 
-   if (inode->i_flags & S_NOATIME)
+   if (mnt && mnt_want_write(mnt))
return;
+   if (inode->i_flags & S_NOATIME)
+   goto out;
if (IS_NOATIME(inode))
-   return;
+   goto out;
if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
-   return;
+   goto out;
 
/*
 * We may have a NULL vfsmount when coming from NFSD
 */
if (mnt) {
if (mnt->mnt_flags & MNT_NOATIME)
-   return;
+   goto out;
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
-   return;
-
+   goto out;
if (mnt->mnt_flags & MNT_RELATIME) {
/*
 * With relative atime, only update atime if the
@@ -1229,16 +1230,19 @@ void touch_atime(struct vfsmount *mnt, s
>i_atime) < 0 &&
timespec_compare(>i_ctime,
>i_atime) < 0)
-   return;
+   goto out;
}
}
 
now = current_fs_time(inode->i_sb);
if (timespec_equal(>i_atime, ))
-   return;
+   goto out;
 
inode->i_atime = now;
mark_inode_dirty_sync(inode);
+out:
+   if (mnt)
+   mnt_drop_write(mnt);
 }
 EXPORT_SYMBOL(touch_atime);
 
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 14/27] r-o-bind-mounts-elevate-write-count-for-file_update_time

2007-11-01 Thread Dave Hansen


Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/inode.c |   13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff -puN fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time 
fs/inode.c
--- 
linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time
   2007-11-01 14:46:13.0 -0700
+++ linux-2.6.git-dave/fs/inode.c   2007-11-01 14:46:13.0 -0700
@@ -1263,10 +1263,19 @@ void file_update_time(struct file *file)
struct inode *inode = file->f_path.dentry->d_inode;
struct timespec now;
int sync_it = 0;
+   int err = 0;
 
if (IS_NOCMTIME(inode))
return;
-   if (IS_RDONLY(inode))
+   /*
+* Ideally, we want to guarantee that 'f_vfsmnt'
+* is non-NULL here.  But, NFS exports need to
+* be fixed up before we can do that.  So, check
+* it for now. - Dave Hansen
+*/
+   if (file->f_vfsmnt)
+   err = mnt_want_write(file->f_vfsmnt);
+   if (err)
return;
 
now = current_fs_time(inode->i_sb);
@@ -1282,6 +1291,8 @@ void file_update_time(struct file *file)
 
if (sync_it)
mark_inode_dirty_sync(inode);
+   if (file->f_vfsmnt)
+   mnt_drop_write(file->f_vfsmnt);
 }
 
 EXPORT_SYMBOL(file_update_time);
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 11/27] r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl

2007-11-01 Thread Dave Hansen

fs/ncpfs/ioctl.c: In function 'ncp_ioctl_need_write':
fs/ncpfs/ioctl.c:852: error: label at end of compound statement

Cc: Dave Hansen <[EMAIL PROTECTED]>
Cc: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/ncpfs/ioctl.c |   57 ++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff -puN 
fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl 
fs/ncpfs/ioctl.c
--- 
linux-2.6.git/fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl
  2007-11-01 14:46:11.0 -0700
+++ linux-2.6.git-dave/fs/ncpfs/ioctl.c 2007-11-01 14:46:11.0 -0700
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv
 }
 #endif /* CONFIG_NCPFS_NLS */
 
-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
  unsigned int cmd, unsigned long arg)
 {
struct ncp_server *server = NCP_SERVER(inode);
@@ -817,11 +818,63 @@ outrel:   
return -EFAULT;
return 0;
}
-
+   default: /* unkown IOCTL command, assume write */
+   ;
}
return -EINVAL;
 }
 
+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+   switch (cmd) {
+   case NCP_IOC_GET_FS_INFO:
+   case NCP_IOC_GET_FS_INFO_V2:
+   case NCP_IOC_NCPREQUEST:
+   case NCP_IOC_SETDENTRYTTL:
+   case NCP_IOC_SIGN_INIT:
+   case NCP_IOC_LOCKUNLOCK:
+   case NCP_IOC_SET_SIGN_WANTED:
+   return 1;
+   case NCP_IOC_GETOBJECTNAME:
+   case NCP_IOC_SETOBJECTNAME:
+   case NCP_IOC_GETPRIVATEDATA:
+   case NCP_IOC_SETPRIVATEDATA:
+   case NCP_IOC_SETCHARSETS:
+   case NCP_IOC_GETCHARSETS:
+   case NCP_IOC_CONN_LOGGED_IN:
+   case NCP_IOC_GETDENTRYTTL:
+   case NCP_IOC_GETMOUNTUID2:
+   case NCP_IOC_SIGN_WANTED:
+   case NCP_IOC_GETROOT:
+   case NCP_IOC_SETROOT:
+   return 0;
+   default:
+   /* unkown IOCTL command, assume write */
+   }
+   return 1;
+}
+
+int ncp_ioctl(struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+   int ret;
+
+   if (ncp_ioctl_need_write(cmd)) {
+   /*
+* inside the ioctl(), any failures which
+* are because of file_permission() are
+* -EACCESS, so it seems consistent to keep
+*  that here.
+*/
+   if (mnt_want_write(filp->f_vfsmnt))
+   return -EACCES;
+   }
+   ret = __ncp_ioctl(inode, filp, cmd, arg);
+   if (ncp_ioctl_need_write(cmd))
+   mnt_drop_write(filp->f_vfsmnt);
+   return ret;
+}
+
 #ifdef CONFIG_COMPAT
 long ncp_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 07/27] r-o-bind-mounts-do_rmdir-elevate-write-count

2007-11-01 Thread Dave Hansen


Elevate the write count during the vfs_rmdir() call.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namei.c |5 +
 1 file changed, 5 insertions(+)

diff -puN fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count   
2007-11-01 14:46:09.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:09.0 -0700
@@ -2174,7 +2174,12 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto exit3;
error = vfs_rmdir(nd.dentry->d_inode, dentry);
+   mnt_drop_write(nd.mnt);
+exit3:
dput(dentry);
 exit2:
mutex_unlock(>d_inode->i_mutex);
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 13/27] r-o-bind-mounts-elevate-write-count-for-do_utimes

2007-11-01 Thread Dave Hansen

Now includes fix for oops seen by akpm.

"never let a libc developer write your kernel code" - hch

"nor, apparently, a kernel developer" - akpm

Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Cc: Christoph Hellwig <[EMAIL PROTECTED]>
Cc: Valdis Kletnieks <[EMAIL PROTECTED]>
Cc: Balbir Singh <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/utimes.c |   18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff -puN fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes 
fs/utimes.c
--- linux-2.6.git/fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes 
2007-11-01 14:46:13.0 -0700
+++ linux-2.6.git-dave/fs/utimes.c  2007-11-01 14:46:13.0 -0700
@@ -2,6 +2,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -58,6 +59,7 @@ long do_utimes(int dfd, char __user *fil
struct inode *inode;
struct iattr newattrs;
struct file *f = NULL;
+   struct vfsmount *mnt;
 
error = -EINVAL;
if (times && (!nsec_valid(times[0].tv_nsec) ||
@@ -78,18 +80,20 @@ long do_utimes(int dfd, char __user *fil
if (!f)
goto out;
dentry = f->f_path.dentry;
+   mnt = f->f_path.mnt;
} else {
error = __user_walk_fd(dfd, filename, (flags & 
AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW, );
if (error)
goto out;
 
dentry = nd.dentry;
+   mnt = nd.mnt;
}
 
inode = dentry->d_inode;
 
-   error = -EROFS;
-   if (IS_RDONLY(inode))
+   error = mnt_want_write(mnt);
+   if (error)
goto dput_and_out;
 
/* Don't worry, the checks are done in inode_change_ok() */
@@ -97,7 +101,7 @@ long do_utimes(int dfd, char __user *fil
if (times) {
error = -EPERM;
 if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
-goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
if (times[0].tv_nsec == UTIME_OMIT)
newattrs.ia_valid &= ~ATTR_ATIME;
@@ -117,22 +121,24 @@ long do_utimes(int dfd, char __user *fil
} else {
error = -EACCES;
 if (IS_IMMUTABLE(inode))
-goto dput_and_out;
+   goto mnt_drop_write_and_out;
 
if (!is_owner_or_cap(inode)) {
if (f) {
if (!(f->f_mode & FMODE_WRITE))
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
} else {
error = vfs_permission(, MAY_WRITE);
if (error)
-   goto dput_and_out;
+   goto mnt_drop_write_and_out;
}
}
}
mutex_lock(>i_mutex);
error = notify_change(dentry, );
mutex_unlock(>i_mutex);
+mnt_drop_write_and_out:
+   mnt_drop_write(mnt);
 dput_and_out:
if (f)
fput(f);
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 10/27] r-o-bind-mounts-elevate-mount-count-for-extended-attributes

2007-11-01 Thread Dave Hansen

This basically audits the callers of xattr_permission(), which calls
permission() and can perform writes to the filesystem.

Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/nfsd/nfs4proc.c |7 ++-
 linux-2.6.git-dave/fs/xattr.c |   16 ++--
 2 files changed, 20 insertions(+), 3 deletions(-)

diff -puN 
fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes 
fs/nfsd/nfs4proc.c
--- 
linux-2.6.git/fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes
2007-11-01 14:46:11.0 -0700
+++ linux-2.6.git-dave/fs/nfsd/nfs4proc.c   2007-11-01 14:46:11.0 
-0700
@@ -658,14 +658,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
return status;
}
}
+   status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt);
+   if (status)
+   return status;
status = nfs_ok;
if (setattr->sa_acl != NULL)
status = nfsd4_set_nfs4_acl(rqstp, >current_fh,
setattr->sa_acl);
if (status)
-   return status;
+   goto out;
status = nfsd_setattr(rqstp, >current_fh, >sa_iattr,
0, (time_t)0);
+out:
+   mnt_drop_write(cstate->current_fh.fh_export->ex_mnt);
return status;
 }
 
diff -puN 
fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes 
fs/xattr.c
--- 
linux-2.6.git/fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes
2007-11-01 14:46:11.0 -0700
+++ linux-2.6.git-dave/fs/xattr.c   2007-11-01 14:46:11.0 -0700
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co
 * filesystem  or on an immutable / append-only inode.
 */
if (mask & MAY_WRITE) {
-   if (IS_RDONLY(inode))
-   return -EROFS;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
return -EPERM;
}
@@ -235,7 +234,11 @@ sys_setxattr(char __user *path, char __u
error = user_path_walk(path, );
if (error)
return error;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   return error;
error = setxattr(nd.dentry, name, value, size, flags);
+   mnt_drop_write(nd.mnt);
path_release();
return error;
 }
@@ -250,7 +253,11 @@ sys_lsetxattr(char __user *path, char __
error = user_path_walk_link(path, );
if (error)
return error;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   return error;
error = setxattr(nd.dentry, name, value, size, flags);
+   mnt_drop_write(nd.mnt);
path_release();
return error;
 }
@@ -266,9 +273,14 @@ sys_fsetxattr(int fd, char __user *name,
f = fget(fd);
if (!f)
return error;
+   error = mnt_want_write(f->f_vfsmnt);
+   if (error)
+   goto out_fput;
dentry = f->f_path.dentry;
audit_inode(NULL, dentry);
error = setxattr(dentry, name, value, size, flags);
+   mnt_drop_write(f->f_vfsmnt);
+out_fput:
fput(f);
return error;
 }
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 01/27] do namei_flags calculation inside open_namei()

2007-11-01 Thread Dave Hansen

My end goal here is to make sure all users of may_open()
return filps.  This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().

This patch moves the sys_open flags to namei flags
calculation into fs/namei.c.  We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namei.c |   43 +-
 linux-2.6.git-dave/fs/open.c  |   22 +
 2 files changed, 36 insertions(+), 29 deletions(-)

diff -puN fs/namei.c~do-namei_flags-calculation-inside-open_namei fs/namei.c
--- linux-2.6.git/fs/namei.c~do-namei_flags-calculation-inside-open_namei   
2007-11-01 14:46:04.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:04.0 -0700
@@ -1674,7 +1674,12 @@ int may_open(struct nameidata *nd, int a
return 0;
 }
 
-static int open_namei_create(struct nameidata *nd, struct path *path,
+/*
+ * Be careful about ever adding any more callers of this
+ * function.  Its flags must be in the namei format, not
+ * what get passed to sys_open().
+ */
+static int __open_namei_create(struct nameidata *nd, struct path *path,
int flag, int mode)
 {
int error;
@@ -1693,26 +1698,46 @@ static int open_namei_create(struct name
 }
 
 /*
+ * Note that while the flag value (low two bits) for sys_open means:
+ * 00 - read-only
+ * 01 - write-only
+ * 10 - read-write
+ * 11 - special
+ * it is changed into
+ * 00 - no permissions needed
+ * 01 - read-permission
+ * 10 - write-permission
+ * 11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * later).
+ *
+*/
+static inline int sys_open_flags_to_namei_flags(int flag)
+{
+   if ((flag+1) & O_ACCMODE)
+   flag++;
+   return flag;
+}
+
+/*
  * open_namei()
  *
  * namei for open - this is in fact almost the whole open-routine.
  *
  * Note that the low bits of "flag" aren't the same as in the open
- * system call - they are 00 - no permissions needed
- *   01 - read permission needed
- *   10 - write permission needed
- *   11 - read/write permissions needed
- * which is a lot more logical, and also allows the "no perm" needed
- * for symlinks (where the permissions are checked later).
+ * system call.  See sys_open_flags_to_namei_flags().
  * SMP-safe
  */
-int open_namei(int dfd, const char *pathname, int flag,
+int open_namei(int dfd, const char *pathname, int sys_open_flag,
int mode, struct nameidata *nd)
 {
int acc_mode, error;
struct path path;
struct dentry *dir;
int count = 0;
+   int flag = sys_open_flags_to_namei_flags(sys_open_flag);
 
acc_mode = ACC_MODE(flag);
 
@@ -1773,7 +1798,7 @@ do_last:
 
/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
-   error = open_namei_create(nd, , flag, mode);
+   error = __open_namei_create(nd, , flag, mode);
if (error)
goto exit;
return 0;
diff -puN fs/open.c~do-namei_flags-calculation-inside-open_namei fs/open.c
--- linux-2.6.git/fs/open.c~do-namei_flags-calculation-inside-open_namei
2007-11-01 14:46:04.0 -0700
+++ linux-2.6.git-dave/fs/open.c2007-11-01 14:46:04.0 -0700
@@ -800,31 +800,13 @@ cleanup_file:
return ERR_PTR(error);
 }
 
-/*
- * Note that while the flag value (low two bits) for sys_open means:
- * 00 - read-only
- * 01 - write-only
- * 10 - read-write
- * 11 - special
- * it is changed into
- * 00 - no permissions needed
- * 01 - read-permission
- * 10 - write-permission
- * 11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc). 00 is
- * used by symlinks.
- */
 static struct file *do_filp_open(int dfd, const char *filename, int flags,
 int mode)
 {
-   int namei_flags, error;
+   int error;
struct nameidata nd;
 
-   namei_flags = flags;
-   if ((namei_flags+1) & O_ACCMODE)
-   namei_flags++;
-
-   error = open_namei(dfd, filename, namei_flags, mode, );
+   error = open_namei(dfd, filename, flags, mode, );
if (!error)
return nameidata_to_filp(, flags);
 
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 00/27] Read-only bind mounts (-mm resend)

2007-11-01 Thread Dave Hansen
This is against 2.6.24-rc1 + recent git.

I've integrated all of the fixes from mm, and included cleanups
in a different order.  This also includes some extra fput-time
checking to ensure that we have balanced mount writer counts.

These replace the patches in -mm mostly because the new fixes
require some cleanups that are functionally independent from
the r/o bind mount code itself.  These fixes precent the rest
of the set and require some by hand merging.

If you're going to review one and only one patch, 17/27 (the
one for "opend" files) is the most critical.

---

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.
In the process of doing that, it also provides infrastructure for
keeping track of the number of writers to any given mount.

This has a number of uses.  It allows chroots to have parts of
filesystems writable.  It will be useful for containers in the future
because users may have root inside a container, but should not
be allowed to write to somefilesystems.  This also replaces 
patches that vserver has had out of the tree for several years.

It allows security enhancement by making sure that parts of
your filesystem are read-only (such as when you don't trust your
FTP server), when you don't want to have entire new filesystems
mounted, or when you want atime selectively updated.
I've been using this script:

http://sr71.net/~dave/linux/robind-test.sh

to test that the feature is working as desired.  It takes a
directory and makes a regular bind and a r/o bind mount of it.
It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail,
like creating a file on the r/o mount.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 06/27] r-o-bind-mounts-stub-functions

2007-11-01 Thread Dave Hansen


This patch adds two function mnt_want_write() and mnt_drop_write().  These are
used like a lock pair around and fs operations that might cause a write to the
filesystem.

Before these can become useful, we must first cover each place in the VFS
where writes are performed with a want/drop pair.  When that is complete, we
can actually introduce code that will safely check the counts before allowing
r/w<->r/o transitions to occur.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Cc: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
Acked-by: Serge Hallyn <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namespace.c|   54 +++
 linux-2.6.git-dave/include/linux/mount.h |3 +
 2 files changed, 57 insertions(+)

diff -puN fs/namespace.c~r-o-bind-mounts-stub-functions fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-stub-functions 2007-11-01 
14:46:08.0 -0700
+++ linux-2.6.git-dave/fs/namespace.c   2007-11-01 14:46:08.0 -0700
@@ -77,6 +77,60 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
 }
 
+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
+/**
+ * mnt_want_write - get write access to a mount
+ * @mnt: the mount on which to take a write
+ *
+ * This tells the low-level filesystem that a write is
+ * about to be performed to it, and makes sure that
+ * writes are allowed before returning success.  When
+ * the write operation is finished, mnt_drop_write()
+ * must be called.  This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *mnt)
+{
+   if (__mnt_is_readonly(mnt))
+   return -EROFS;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(mnt_want_write);
+
+/**
+ * mnt_drop_write - give up write access to a mount
+ * @mnt: the mount on which to give up write access
+ *
+ * Tells the low-level filesystem that we are done
+ * performing writes to it.  Must be matched with
+ * mnt_want_write() call above.
+ */
+void mnt_drop_write(struct vfsmount *mnt)
+{
+}
+EXPORT_SYMBOL_GPL(mnt_drop_write);
+
+/*
+ * __mnt_is_readonly: check whether a mount is read-only
+ * @mnt: the mount to check for its write status
+ *
+ * This shouldn't be used directly ouside of the VFS.
+ * It does not guarantee that the filesystem will stay
+ * r/w, just that it is right *now*.  This can not and
+ * should not be used in place of IS_RDONLY(inode).
+ */
+int __mnt_is_readonly(struct vfsmount *mnt)
+{
+   return (mnt->mnt_sb->s_flags & MS_RDONLY);
+}
+EXPORT_SYMBOL_GPL(__mnt_is_readonly);
+
 int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
 {
mnt->mnt_sb = sb;
diff -puN include/linux/mount.h~r-o-bind-mounts-stub-functions 
include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~r-o-bind-mounts-stub-functions  
2007-11-01 14:46:08.0 -0700
+++ linux-2.6.git-dave/include/linux/mount.h2007-11-01 14:46:08.0 
-0700
@@ -70,9 +70,12 @@ static inline struct vfsmount *mntget(st
return mnt;
 }
 
+extern int mnt_want_write(struct vfsmount *mnt);
+extern void mnt_drop_write(struct vfsmount *mnt);
 extern void mntput_no_expire(struct vfsmount *mnt);
 extern void mnt_pin(struct vfsmount *mnt);
 extern void mnt_unpin(struct vfsmount *mnt);
+extern int __mnt_is_readonly(struct vfsmount *mnt);
 
 static inline void mntput(struct vfsmount *mnt)
 {
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 09/27] r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers

2007-11-01 Thread Dave Hansen


Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namei.c   |4 
 linux-2.6.git-dave/ipc/mqueue.c |5 -
 2 files changed, 8 insertions(+), 1 deletion(-)

diff -puN fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers 
fs/namei.c
--- 
linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers
 2007-11-01 14:46:10.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:10.0 -0700
@@ -2264,7 +2264,11 @@ static long do_unlinkat(int dfd, const c
inode = dentry->d_inode;
if (inode)
atomic_inc(>i_count);
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto exit2;
error = vfs_unlink(nd.dentry->d_inode, dentry);
+   mnt_drop_write(nd.mnt);
exit2:
dput(dentry);
}
diff -puN 
ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers 
ipc/mqueue.c
--- 
linux-2.6.git/ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers
   2007-11-01 14:46:10.0 -0700
+++ linux-2.6.git-dave/ipc/mqueue.c 2007-11-01 14:46:10.0 -0700
@@ -743,8 +743,11 @@ asmlinkage long sys_mq_unlink(const char
inode = dentry->d_inode;
if (inode)
atomic_inc(>i_count);
-
+   err = mnt_want_write(mqueue_mnt);
+   if (err)
+   goto out_err;
err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+   mnt_drop_write(mqueue_mnt);
 out_err:
dput(dentry);
 
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 05/27] rename open_namei() to open_pathname()

2007-11-01 Thread Dave Hansen

open_namei() no longer touches namei's.  rename it
to something more appropriate: open_pathname().

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/drivers/usb/gadget/file_storage.c |4 ++--
 linux-2.6.git-dave/fs/exec.c |2 +-
 linux-2.6.git-dave/fs/namei.c|8 
 linux-2.6.git-dave/fs/open.c |6 --
 linux-2.6.git-dave/fs/reiserfs/journal.c |2 +-
 linux-2.6.git-dave/kernel/acct.c |2 +-
 linux-2.6.git-dave/mm/swapfile.c |4 ++--
 linux-2.6.git-dave/sound/sound_firmware.c|2 +-
 8 files changed, 12 insertions(+), 18 deletions(-)

diff -puN drivers/usb/gadget/file_storage.c~rename-open_namei 
drivers/usb/gadget/file_storage.c
--- linux-2.6.git/drivers/usb/gadget/file_storage.c~rename-open_namei   
2007-11-01 14:46:07.0 -0700
+++ linux-2.6.git-dave/drivers/usb/gadget/file_storage.c2007-11-01 
14:46:07.0 -0700
@@ -3473,12 +3473,12 @@ static int open_backing_file(struct lun 
/* R/W if we can, R/O if we must */
ro = curlun->ro;
if (!ro) {
-   filp = open_namei(AT_FDCWD, filename, O_RDWR | mode, 0);
+   filp = open_pathname(AT_FDCWD, filename, O_RDWR | mode, 0);
if (-EROFS == PTR_ERR(filp))
ro = 1;
}
if (ro)
-   filp = open_namei(AT_FDCWD, filename, O_RDONLY | mode, 0);
+   filp = open_pathname(AT_FDCWD, filename, O_RDONLY | mode, 0);
if (IS_ERR(filp)) {
LINFO(curlun, "unable to open backing file: %s\n", filename);
return PTR_ERR(filp);
diff -puN fs/exec.c~rename-open_namei fs/exec.c
--- linux-2.6.git/fs/exec.c~rename-open_namei   2007-11-01 14:46:07.0 
-0700
+++ linux-2.6.git-dave/fs/exec.c2007-11-01 14:46:07.0 -0700
@@ -1763,7 +1763,7 @@ int do_coredump(long signr, int exit_cod
goto fail_unlock;
}
} else
-   file = open_namei(AT_FDCWD, corename,
+   file = open_pathname(AT_FDCWD, corename,
 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
 0600);
if (IS_ERR(file))
diff -puN fs/namei.c~rename-open_namei fs/namei.c
--- linux-2.6.git/fs/namei.c~rename-open_namei  2007-11-01 14:46:07.0 
-0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:07.0 -0700
@@ -81,7 +81,7 @@
  */
 
 /* [16-Dec-97 Kevin Buhr] For security reasons, we change some symlink
- * semantics.  See the comments in "open_namei" and "do_link" below.
+ * semantics.  See the comments in "open_pathname" and "do_link" below.
  *
  * [10-Sep-98 Alan Modra] Another symlink change.
  */
@@ -571,7 +571,7 @@ out:
if (nd->depth || res || nd->last_type!=LAST_NORM)
return res;
/*
-* If it is an iterative symlinks resolution in open_namei() we
+* If it is an iterative symlinks resolution in open_pathname() we
 * have to copy the last component. And all that crap because of
 * bloody create() on broken symlinks. Furrfu...
 */
@@ -1708,7 +1708,7 @@ static int __open_namei_create(struct na
  * 01 - read-permission
  * 10 - write-permission
  * 11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc)
+ * for the internal routines (ie open_pathname()/follow_link() etc)
  * This is more logical, and also allows the 00 "no perm needed"
  * to be used for symlinks (where the permissions are checked
  * later).
@@ -1722,7 +1722,7 @@ static inline int sys_open_flags_to_name
 }
 
 /*
- * open_namei()
+ * open_pathname()
  *
  * namei for open - this is in fact almost the whole open-routine.
  *
diff -puN fs/open.c~rename-open_namei fs/open.c
--- linux-2.6.git/fs/open.c~rename-open_namei   2007-11-01 14:46:07.0 
-0700
+++ linux-2.6.git-dave/fs/open.c2007-11-01 14:46:07.0 -0700
@@ -800,12 +800,6 @@ cleanup_file:
return ERR_PTR(error);
 }
 
-struct file *filp_open(const char *filename, int flags, int mode)
-{
-   return open_namei(AT_FDCWD, filename, flags, mode);
-}
-EXPORT_SYMBOL(filp_open);
-
 /**
  * lookup_instantiate_filp - instantiates the open intent filp
  * @nd: pointer to nameidata
diff -puN fs/reiserfs/journal.c~rename-open_namei fs/reiserfs/journal.c
--- linux-2.6.git/fs/reiserfs/journal.c~rename-open_namei   2007-11-01 
14:46:07.0 -0700
+++ linux-2.6.git-dave/fs/reiserfs/journal.c2007-11-01 14:46:07.0 
-0700
@@ -2625,7 +2625,7 @@ static int journal_init_dev(struct super
return 0;
}
 
-   journal->j_dev_file = open_namei(AT_FDCWD, jdev_name, 0, 0);
+   journal->j_dev_file = open_pathname(AT_FDCWD, jdev_name, 0, 0);
if (!IS_ERR(journal->j_dev_file)) {
struct inode 

[PATCH 08/27] r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir

2007-11-01 Thread Dave Hansen


Pretty self-explanatory.  Fits in with the rest of the series.

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
Acked-by: Christoph Hellwig <[EMAIL PROTECTED]>
Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/fs/namei.c|5 +
 linux-2.6.git-dave/fs/nfsd/nfs4recover.c |5 +
 2 files changed, 10 insertions(+)

diff -puN 
fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir 
fs/namei.c
--- 
linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir
   2007-11-01 14:46:09.0 -0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:09.0 -0700
@@ -2067,7 +2067,12 @@ asmlinkage long sys_mkdirat(int dfd, con
 
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
+   error = mnt_want_write(nd.mnt);
+   if (error)
+   goto out_dput;
error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+   mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(>d_inode->i_mutex);
diff -puN 
fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir
 fs/nfsd/nfs4recover.c
--- 
linux-2.6.git/fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir
2007-11-01 14:46:09.0 -0700
+++ linux-2.6.git-dave/fs/nfsd/nfs4recover.c2007-11-01 14:46:09.0 
-0700
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -154,7 +155,11 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n");
goto out_put;
}
+   status = mnt_want_write(rec_dir.mnt);
+   if (status)
+   goto out_put;
status = vfs_mkdir(rec_dir.dentry->d_inode, dentry, S_IRWXU);
+   mnt_drop_write(rec_dir.mnt);
 out_put:
dput(dentry);
 out_unlock:
_
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 04/27] kill filp_open()

2007-11-01 Thread Dave Hansen

Replace all callers with open_namei() directly, and move the
nameidata stack allocation into open_namei().

Signed-off-by: Dave Hansen <[EMAIL PROTECTED]>
---

 linux-2.6.git-dave/drivers/usb/gadget/file_storage.c |5 -
 linux-2.6.git-dave/fs/exec.c |2 
 linux-2.6.git-dave/fs/namei.c|   71 +--
 linux-2.6.git-dave/fs/nfsctl.c   |5 +
 linux-2.6.git-dave/fs/open.c |6 -
 linux-2.6.git-dave/fs/reiserfs/journal.c |2 
 linux-2.6.git-dave/include/linux/fs.h|3 
 linux-2.6.git-dave/kernel/acct.c |2 
 linux-2.6.git-dave/mm/swapfile.c |4 -
 linux-2.6.git-dave/sound/sound_firmware.c|2 
 10 files changed, 53 insertions(+), 49 deletions(-)

diff -puN drivers/usb/gadget/file_storage.c~kill-filp_open 
drivers/usb/gadget/file_storage.c
--- linux-2.6.git/drivers/usb/gadget/file_storage.c~kill-filp_open  
2007-11-01 14:46:06.0 -0700
+++ linux-2.6.git-dave/drivers/usb/gadget/file_storage.c2007-11-01 
14:46:06.0 -0700
@@ -3468,16 +3468,17 @@ static int open_backing_file(struct lun 
struct inode*inode = NULL;
loff_t  size;
loff_t  num_sectors;
+   int mode = O_LARGEFILE;
 
/* R/W if we can, R/O if we must */
ro = curlun->ro;
if (!ro) {
-   filp = filp_open(filename, O_RDWR | O_LARGEFILE, 0);
+   filp = open_namei(AT_FDCWD, filename, O_RDWR | mode, 0);
if (-EROFS == PTR_ERR(filp))
ro = 1;
}
if (ro)
-   filp = filp_open(filename, O_RDONLY | O_LARGEFILE, 0);
+   filp = open_namei(AT_FDCWD, filename, O_RDONLY | mode, 0);
if (IS_ERR(filp)) {
LINFO(curlun, "unable to open backing file: %s\n", filename);
return PTR_ERR(filp);
diff -puN fs/exec.c~kill-filp_open fs/exec.c
--- linux-2.6.git/fs/exec.c~kill-filp_open  2007-11-01 14:46:06.0 
-0700
+++ linux-2.6.git-dave/fs/exec.c2007-11-01 14:46:06.0 -0700
@@ -1763,7 +1763,7 @@ int do_coredump(long signr, int exit_cod
goto fail_unlock;
}
} else
-   file = filp_open(corename,
+   file = open_namei(AT_FDCWD, corename,
 O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
 0600);
if (IS_ERR(file))
diff -puN fs/namei.c~kill-filp_open fs/namei.c
--- linux-2.6.git/fs/namei.c~kill-filp_open 2007-11-01 14:46:06.0 
-0700
+++ linux-2.6.git-dave/fs/namei.c   2007-11-01 14:46:06.0 -0700
@@ -1730,9 +1730,10 @@ static inline int sys_open_flags_to_name
  * system call.  See sys_open_flags_to_namei_flags().
  * SMP-safe
  */
-struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
-   int mode, struct nameidata *nd)
+struct file *open_pathname(int dfd, const char *pathname,
+  int sys_open_flag, int mode)
 {
+   struct nameidata nd;
int acc_mode, error;
struct path path;
struct dentry *dir;
@@ -1755,7 +1756,7 @@ struct file *open_namei(int dfd, const c
 */
if (!(flag & O_CREAT)) {
error = path_lookup_open(dfd, pathname, lookup_flags(flag),
-nd, flag);
+, flag);
if (error)
return ERR_PTR(error);
goto ok;
@@ -1764,7 +1765,7 @@ struct file *open_namei(int dfd, const c
/*
 * Create - we need to know the parent.
 */
-   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+   error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,,flag,mode);
if (error)
return ERR_PTR(error);
 
@@ -1774,14 +1775,14 @@ struct file *open_namei(int dfd, const c
 * will not do.
 */
error = -EISDIR;
-   if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
+   if (nd.last_type != LAST_NORM || nd.last.name[nd.last.len])
goto exit;
 
-   dir = nd->dentry;
-   nd->flags &= ~LOOKUP_PARENT;
+   dir = nd.dentry;
+   nd.flags &= ~LOOKUP_PARENT;
mutex_lock(>d_inode->i_mutex);
-   path.dentry = lookup_hash(nd);
-   path.mnt = nd->mnt;
+   path.dentry = lookup_hash();
+   path.mnt = nd.mnt;
 
 do_last:
error = PTR_ERR(path.dentry);
@@ -1790,18 +1791,18 @@ do_last:
goto exit;
}
 
-   if (IS_ERR(nd->intent.open.file)) {
+   if (IS_ERR(nd.intent.open.file)) {
mutex_unlock(>d_inode->i_mutex);
-   error = PTR_ERR(nd->intent.open.file);
+   

  1   2   3   4   5   6   7   8   9   >