Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-15 Thread Phil Dier
On Sun, 14 Aug 2005 21:20:35 -0600 (MDT)
Zwane Mwaikambo <[EMAIL PROTECTED]> wrote:

> On Sun, 14 Aug 2005, Robert Love wrote:
> 
> > On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote:
> > 
> > > I'm new here, if the inode isn't being watched, what's to stop d_delete 
> > > from removing the inode before fsnotify_unlink proceeds to use it?
> > 
> > Nothing.  But check out
> > 
> > http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a
> 
> That git web interface looks rather spiffy.
> 
> > Should solve this problem?
> 
> Seems to fit the bill perfectly.
> 
> Thanks,
>   Zwane
> 


So, for the record, I patched my 2.6.13-rc6 kernel with the patch at this 
location:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a;hp=1963c907b21e140082d081b1c8f8c2154593c7d7

and I will be testing it today.


Thanks to all of you guys.

-- 

Phil Dier (ICGLink.com -- 615 370-1530 x733)

/* vim:set noai nocindent ts=8 sw=8: */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-15 Thread Phil Dier
On Sun, 14 Aug 2005 21:20:35 -0600 (MDT)
Zwane Mwaikambo [EMAIL PROTECTED] wrote:

 On Sun, 14 Aug 2005, Robert Love wrote:
 
  On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote:
  
   I'm new here, if the inode isn't being watched, what's to stop d_delete 
   from removing the inode before fsnotify_unlink proceeds to use it?
  
  Nothing.  But check out
  
  http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a
 
 That git web interface looks rather spiffy.
 
  Should solve this problem?
 
 Seems to fit the bill perfectly.
 
 Thanks,
   Zwane
 


So, for the record, I patched my 2.6.13-rc6 kernel with the patch at this 
location:

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff_plain;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a;hp=1963c907b21e140082d081b1c8f8c2154593c7d7

and I will be testing it today.


Thanks to all of you guys.

-- 

Phil Dier (ICGLink.com -- 615 370-1530 x733)

/* vim:set noai nocindent ts=8 sw=8: */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Zwane Mwaikambo
On Sun, 14 Aug 2005, Robert Love wrote:

> On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote:
> 
> > I'm new here, if the inode isn't being watched, what's to stop d_delete 
> > from removing the inode before fsnotify_unlink proceeds to use it?
> 
> Nothing.  But check out
> 
> http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a

That git web interface looks rather spiffy.

> Should solve this problem?

Seems to fit the bill perfectly.

Thanks,
Zwane

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Robert Love
On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote:

> I'm new here, if the inode isn't being watched, what's to stop d_delete 
> from removing the inode before fsnotify_unlink proceeds to use it?

Nothing.  But check out

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a

Should solve this problem?

Robert Love


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Zwane Mwaikambo
On Sun, 14 Aug 2005, Phil Dier wrote:

> I just got this:
> 
> Unable to handle kernel paging request at virtual address eeafefc0
>  printing eip:
> c0188487
> *pde = 00681067
> *pte = 2eafe000
> Oops:  [#1]
> SMP DEBUG_PAGEALLOC
> Modules linked in:
> CPU:1
> EIP:0060:[]Not tainted VLI
> EFLAGS: 00010296   (2.6.13-rc6)
> EIP is at inotify_inode_queue_event+0x17/0x130
> eax: eeafefc0   ebx:    ecx: 0200   edx: eeafee9c
> esi:    edi: ef4cbe9c   ebp: f66e1eac   esp: f66e1e84
> ds: 007b   es: 007b   ss: 0068
> Process nfsd (pid: 6259, threadinfo=f66e task=f6307b00)
> Stack: eeafee9c c0536a34 ee900f6c f66e1eac c0179949 eeafefc0 23644d80 
> ef4cbe9c f66e1ed4 c01713ad eeafee9c 0400  
>eeafee9c ee900f6c f0940f6c f6f0adf8 f66e1f00 c020caa1 ef4cbe9c ee900f6c
> Call Trace:
>  [] show_stack+0x7f/0xa0
>  [] show_registers+0x160/0x1d0
>  [] die+0x100/0x180
>  [] do_page_fault+0x369/0x6ed
>  [] error_code+0x4f/0x54
>  [] vfs_unlink+0x17d/0x210
>  [] nfsd_unlink+0x161/0x240
>  [] nfsd_proc_remove+0x44/0x90
>  [] nfsd_dispatch+0xd7/0x200
>  [] svc_process+0x533/0x670
>  [] nfsd+0x1bd/0x350
>  [] kernel_thread_helper+0x5/0x10
> Code: ff ff ff 8b 5d f8 8b 75 fc 89 ec 5d c3 8d b4 26 00 00 00 00 55 89 e5 57 
> 56 53 83 ec 1c 8b 45 08 8b 55 08 05 24 01 00 00 89 45 ec <39> 82 24 01 00 00 
> 74 5d f0 ff 8a 2c 01 00 00 0f 88 d1 0b 00 00

int vfs_unlink(struct inode *dir, struct dentry *dentry)
{

/* We don't d_delete() NFS sillyrenamed files--they still exist. 
*/
if (!error && !(dentry->d_flags & DCACHE_NFSFS_RENAMED)) {
struct inode *inode = dentry->d_inode;
d_delete(dentry);
<==
fsnotify_unlink(dentry, inode, dir);
}

return error;
}

static inline void fsnotify_unlink(struct dentry *dentry, struct inode *inode, 
struct inode *dir)
{

inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL); <=

}

void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie,
   const char *name)
{
struct inotify_watch *watch, *next;

if (!inotify_inode_watched(inode)) <==
return;

}

static inline int inotify_inode_watched(struct inode *inode)
{
return !list_empty(>inotify_watches); <===
}


I'm new here, if the inode isn't being watched, what's to stop d_delete 
from removing the inode before fsnotify_unlink proceeds to use it?

Thanks,
Zwane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Phil Dier
I just got this:

Unable to handle kernel paging request at virtual address eeafefc0
 printing eip:
c0188487
*pde = 00681067
*pte = 2eafe000
Oops:  [#1]
SMP DEBUG_PAGEALLOC
Modules linked in:
CPU:1
EIP:0060:[]Not tainted VLI
EFLAGS: 00010296   (2.6.13-rc6)
EIP is at inotify_inode_queue_event+0x17/0x130
eax: eeafefc0   ebx:    ecx: 0200   edx: eeafee9c
esi:    edi: ef4cbe9c   ebp: f66e1eac   esp: f66e1e84
ds: 007b   es: 007b   ss: 0068
Process nfsd (pid: 6259, threadinfo=f66e task=f6307b00)
Stack: eeafee9c c0536a34 ee900f6c f66e1eac c0179949 eeafefc0 23644d80 
    ef4cbe9c f66e1ed4 c01713ad eeafee9c 0400  
   eeafee9c ee900f6c f0940f6c f6f0adf8 f66e1f00 c020caa1 ef4cbe9c ee900f6c
Call Trace:
 [] show_stack+0x7f/0xa0
 [] show_registers+0x160/0x1d0
 [] die+0x100/0x180
 [] do_page_fault+0x369/0x6ed
 [] error_code+0x4f/0x54
 [] vfs_unlink+0x17d/0x210
 [] nfsd_unlink+0x161/0x240
 [] nfsd_proc_remove+0x44/0x90
 [] nfsd_dispatch+0xd7/0x200
 [] svc_process+0x533/0x670
 [] nfsd+0x1bd/0x350
 [] kernel_thread_helper+0x5/0x10
Code: ff ff ff 8b 5d f8 8b 75 fc 89 ec 5d c3 8d b4 26 00 00 00 00 55 89 e5 57 
56 53 83 ec 1c 8b 45 08 8b 55 08 05 24 01 00 00 89 45 ec <39> 82 24 01 00 00 74 
5d f0 ff 8a 2c 01 00 00 0f 88 d1 0b 00 00



-- 

Phil Dier <[EMAIL PROTECTED]>

/* vim:set ts=8 sw=8 nocindent noai: */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Phil Dier
I just got this:

Unable to handle kernel paging request at virtual address eeafefc0
 printing eip:
c0188487
*pde = 00681067
*pte = 2eafe000
Oops:  [#1]
SMP DEBUG_PAGEALLOC
Modules linked in:
CPU:1
EIP:0060:[c0188487]Not tainted VLI
EFLAGS: 00010296   (2.6.13-rc6)
EIP is at inotify_inode_queue_event+0x17/0x130
eax: eeafefc0   ebx:    ecx: 0200   edx: eeafee9c
esi:    edi: ef4cbe9c   ebp: f66e1eac   esp: f66e1e84
ds: 007b   es: 007b   ss: 0068
Process nfsd (pid: 6259, threadinfo=f66e task=f6307b00)
Stack: eeafee9c c0536a34 ee900f6c f66e1eac c0179949 eeafefc0 23644d80 
    ef4cbe9c f66e1ed4 c01713ad eeafee9c 0400  
   eeafee9c ee900f6c f0940f6c f6f0adf8 f66e1f00 c020caa1 ef4cbe9c ee900f6c
Call Trace:
 [c0103e7f] show_stack+0x7f/0xa0
 [c0104030] show_registers+0x160/0x1d0
 [c0104260] die+0x100/0x180
 [c0116199] do_page_fault+0x369/0x6ed
 [c0103aa3] error_code+0x4f/0x54
 [c01713ad] vfs_unlink+0x17d/0x210
 [c020caa1] nfsd_unlink+0x161/0x240
 [c0207c64] nfsd_proc_remove+0x44/0x90
 [c0206747] nfsd_dispatch+0xd7/0x200
 [c0491b13] svc_process+0x533/0x670
 [c02064dd] nfsd+0x1bd/0x350
 [c01011e5] kernel_thread_helper+0x5/0x10
Code: ff ff ff 8b 5d f8 8b 75 fc 89 ec 5d c3 8d b4 26 00 00 00 00 55 89 e5 57 
56 53 83 ec 1c 8b 45 08 8b 55 08 05 24 01 00 00 89 45 ec 39 82 24 01 00 00 74 
5d f0 ff 8a 2c 01 00 00 0f 88 d1 0b 00 00



-- 

Phil Dier [EMAIL PROTECTED]

/* vim:set ts=8 sw=8 nocindent noai: */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Zwane Mwaikambo
On Sun, 14 Aug 2005, Phil Dier wrote:

 I just got this:
 
 Unable to handle kernel paging request at virtual address eeafefc0
  printing eip:
 c0188487
 *pde = 00681067
 *pte = 2eafe000
 Oops:  [#1]
 SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU:1
 EIP:0060:[c0188487]Not tainted VLI
 EFLAGS: 00010296   (2.6.13-rc6)
 EIP is at inotify_inode_queue_event+0x17/0x130
 eax: eeafefc0   ebx:    ecx: 0200   edx: eeafee9c
 esi:    edi: ef4cbe9c   ebp: f66e1eac   esp: f66e1e84
 ds: 007b   es: 007b   ss: 0068
 Process nfsd (pid: 6259, threadinfo=f66e task=f6307b00)
 Stack: eeafee9c c0536a34 ee900f6c f66e1eac c0179949 eeafefc0 23644d80 
 ef4cbe9c f66e1ed4 c01713ad eeafee9c 0400  
eeafee9c ee900f6c f0940f6c f6f0adf8 f66e1f00 c020caa1 ef4cbe9c ee900f6c
 Call Trace:
  [c0103e7f] show_stack+0x7f/0xa0
  [c0104030] show_registers+0x160/0x1d0
  [c0104260] die+0x100/0x180
  [c0116199] do_page_fault+0x369/0x6ed
  [c0103aa3] error_code+0x4f/0x54
  [c01713ad] vfs_unlink+0x17d/0x210
  [c020caa1] nfsd_unlink+0x161/0x240
  [c0207c64] nfsd_proc_remove+0x44/0x90
  [c0206747] nfsd_dispatch+0xd7/0x200
  [c0491b13] svc_process+0x533/0x670
  [c02064dd] nfsd+0x1bd/0x350
  [c01011e5] kernel_thread_helper+0x5/0x10
 Code: ff ff ff 8b 5d f8 8b 75 fc 89 ec 5d c3 8d b4 26 00 00 00 00 55 89 e5 57 
 56 53 83 ec 1c 8b 45 08 8b 55 08 05 24 01 00 00 89 45 ec 39 82 24 01 00 00 
 74 5d f0 ff 8a 2c 01 00 00 0f 88 d1 0b 00 00

int vfs_unlink(struct inode *dir, struct dentry *dentry)
{
snipped
/* We don't d_delete() NFS sillyrenamed files--they still exist. 
*/
if (!error  !(dentry-d_flags  DCACHE_NFSFS_RENAMED)) {
struct inode *inode = dentry-d_inode;
d_delete(dentry);
==
fsnotify_unlink(dentry, inode, dir);
}

return error;
}

static inline void fsnotify_unlink(struct dentry *dentry, struct inode *inode, 
struct inode *dir)
{
snipped
inotify_inode_queue_event(inode, IN_DELETE_SELF, 0, NULL); =
snipped
}

void inotify_inode_queue_event(struct inode *inode, u32 mask, u32 cookie,
   const char *name)
{
struct inotify_watch *watch, *next;

if (!inotify_inode_watched(inode)) ==
return;
snipped
}

static inline int inotify_inode_watched(struct inode *inode)
{
return !list_empty(inode-inotify_watches); ===
}


I'm new here, if the inode isn't being watched, what's to stop d_delete 
from removing the inode before fsnotify_unlink proceeds to use it?

Thanks,
Zwane
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Robert Love
On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote:

 I'm new here, if the inode isn't being watched, what's to stop d_delete 
 from removing the inode before fsnotify_unlink proceeds to use it?

Nothing.  But check out

http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a

Should solve this problem?

Robert Love


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-14 Thread Zwane Mwaikambo
On Sun, 14 Aug 2005, Robert Love wrote:

 On Sun, 2005-08-14 at 20:40 -0600, Zwane Mwaikambo wrote:
 
  I'm new here, if the inode isn't being watched, what's to stop d_delete 
  from removing the inode before fsnotify_unlink proceeds to use it?
 
 Nothing.  But check out
 
 http://kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7a91bf7f5c22c8407a9991cbd9ce5bb87caa6b4a

That git web interface looks rather spiffy.

 Should solve this problem?

Seems to fit the bill perfectly.

Thanks,
Zwane

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-12 Thread Sonny Rao
On Fri, Aug 12, 2005 at 12:35:05PM -0500, Phil Dier wrote:
> On Fri, 12 Aug 2005 12:07:21 +1000
> Neil Brown <[EMAIL PROTECTED]> wrote:
> > You could possibly put something like
> > 
> > struct bio_vec *from;
> > int i;
> > bio_for_each_segment(from, bio, i)
> > BUG_ON(page_zone(from->bv_page)==NULL);
> > 
> > in generic_make_requst in drivers/block/ll_rw_blk.c, just before
> > the call to q->make_request_fn.
> > This might trigger the bug early enough to see what is happening.
> 
> 
> I've got tests running with this code in place, by I/O is so slow now
> I don't think it's going to oops (or if it does, it'll be a while)..
> 
> Is there any other info I can collect to help track this down?

Well, while we are slowing things down in the name of debugging..
you might try setting the following debug options in your config:

CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_HIGHMEM
CONFIG_DEBUG_SLAB
CONFIG_FRAME_POINTER 

Can anyone think of anything else?

According to the website you don't have these on right now.

Sonny
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-12 Thread Phil Dier
On Fri, 12 Aug 2005 12:07:21 +1000
Neil Brown <[EMAIL PROTECTED]> wrote:
> You could possibly put something like
> 
>   struct bio_vec *from;
>   int i;
>   bio_for_each_segment(from, bio, i)
>   BUG_ON(page_zone(from->bv_page)==NULL);
> 
> in generic_make_requst in drivers/block/ll_rw_blk.c, just before
> the call to q->make_request_fn.
> This might trigger the bug early enough to see what is happening.


I've got tests running with this code in place, by I/O is so slow now
I don't think it's going to oops (or if it does, it'll be a while)..

Is there any other info I can collect to help track this down?

-- 

Phil Dier (ICGLink.com -- 615 370-1530 x733)

/* vim:set noai nocindent ts=8 sw=8: */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-12 Thread Phil Dier
On Fri, 12 Aug 2005 12:07:21 +1000
Neil Brown [EMAIL PROTECTED] wrote:
 You could possibly put something like
 
   struct bio_vec *from;
   int i;
   bio_for_each_segment(from, bio, i)
   BUG_ON(page_zone(from-bv_page)==NULL);
 
 in generic_make_requst in drivers/block/ll_rw_blk.c, just before
 the call to q-make_request_fn.
 This might trigger the bug early enough to see what is happening.


I've got tests running with this code in place, by I/O is so slow now
I don't think it's going to oops (or if it does, it'll be a while)..

Is there any other info I can collect to help track this down?

-- 

Phil Dier (ICGLink.com -- 615 370-1530 x733)

/* vim:set noai nocindent ts=8 sw=8: */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-12 Thread Sonny Rao
On Fri, Aug 12, 2005 at 12:35:05PM -0500, Phil Dier wrote:
 On Fri, 12 Aug 2005 12:07:21 +1000
 Neil Brown [EMAIL PROTECTED] wrote:
  You could possibly put something like
  
  struct bio_vec *from;
  int i;
  bio_for_each_segment(from, bio, i)
  BUG_ON(page_zone(from-bv_page)==NULL);
  
  in generic_make_requst in drivers/block/ll_rw_blk.c, just before
  the call to q-make_request_fn.
  This might trigger the bug early enough to see what is happening.
 
 
 I've got tests running with this code in place, by I/O is so slow now
 I don't think it's going to oops (or if it does, it'll be a while)..
 
 Is there any other info I can collect to help track this down?

Well, while we are slowing things down in the name of debugging..
you might try setting the following debug options in your config:

CONFIG_DEBUG_PAGEALLOC
CONFIG_DEBUG_HIGHMEM
CONFIG_DEBUG_SLAB
CONFIG_FRAME_POINTER 

Can anyone think of anything else?

According to the website you don't have these on right now.

Sonny
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-11 Thread Phil Dier
On Fri, 12 Aug 2005 12:07:21 +1000
Neil Brown <[EMAIL PROTECTED]> wrote:

> On Thursday August 11, [EMAIL PROTECTED] wrote:
> > Hi,
> > 
> > I posted an oops a few days ago from 2.6.12.3 [1].  Here are the results
> > of my tests on 2.6.13-rc6.  The kernel oopses, but it the box isn't 
> > completely
> > hosed; I can still log in and move around.  It appears that the only things 
> > that are
> > locked are the apps that were doing i/o to the test partition.  More 
> > detailed info 
> > about my configuration can be found here:
> > 
> > 
> 
> You don't seem to give details on how lvm is used to combine the md
> arrays, though I'm not sure that would help particularly.
> 

FYI:

vgdisplay -v vg1  
Using volume group(s) on command line
Finding volume group "vg1"
  --- Volume group ---
  VG Name   vg1
  System ID 
  Formatlvm2
  Metadata Areas2
  Metadata Sequence No  8
  VG Access read/write
  VG Status resizable
  MAX LV255
  Cur LV1
  Open LV   0
  Max PV255
  Cur PV2
  Act PV2
  VG Size   410.00 GB
  PE Size   128.00 MB
  Total PE  3280
  Alloc PE / Size   1093 / 136.62 GB
  Free  PE / Size   2187 / 273.38 GB
  VG UUID   XuRomW-O6Uw-oQGq-vdwD-YwMT-Dltj-NExFmV
   
  --- Logical volume ---
  LV Name/dev/vg1/home
  VG Namevg1
  LV UUIDK7Gq9l-Vjte-ksFt-s0vn-ejqT-RGYc-5Aibtx
  LV Write Accessread/write
  LV Status  available
  # open 0
  LV Size136.62 GB
  Current LE 1093
  Segments   1
  Allocation inherit
  Read ahead sectors 0
  Block device   253:3
   
  --- Physical volumes ---
  PV Name   /dev/md4 
  PV UUID   VgHU6k-lZmE-j686-dvfX-OSsM-yh28-Jyfidn
  PV Status allocatable
  Total PE / Free PE1093 / 0
   
  PV Name   /dev/md7 
  PV UUID   n4rVmy-rARO-a5mY-Iiqo-GvOx-2nbG-HluaTa
  PV Status allocatable
  Total PE / Free PE2187 / 2187

md7 is in there to test live migration from smaller disks to larger ones.


> 
>   struct bio_vec *from;
>   int i;
>   bio_for_each_segment(from, bio, i)
>   BUG_ON(page_zone(from->bv_page)==NULL);
> 
> in generic_make_requst in drivers/block/ll_rw_blk.c, just before
> the call to q->make_request_fn.
> This might trigger the bug early enough to see what is happening.


I'll try this and report the results.


-- 

Phil Dier <[EMAIL PROTECTED]>

/* vim:set ts=8 sw=8 nocindent noai: */
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-11 Thread Neil Brown
On Thursday August 11, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I posted an oops a few days ago from 2.6.12.3 [1].  Here are the results
> of my tests on 2.6.13-rc6.  The kernel oopses, but it the box isn't completely
> hosed; I can still log in and move around.  It appears that the only things 
> that are
> locked are the apps that were doing i/o to the test partition.  More detailed 
> info 
> about my configuration can be found here:
> 
> 

You don't seem to give details on how lvm is used to combine the md
arrays, though I'm not sure that would help particularly.


> 
> Here is the oops:
> 
> Oops:  [#1]
> SMP
> Modules linked in:
> CPU:0
> EIP:0060:[]Not tainted VLI
> EFLAGS: 00010207   (2.6.13-rc6)
> EIP is at kmap+0x10/0x30
> eax: 0003   ebx: d0977440   ecx: c9efb470   edx: 
> esi: c100   edi:    ebp: ce59d570   esp: f7adde18
> ds: 007b   es: 007b   ss: 0068
> Process md4_raid1 (pid: 6442, threadinfo=f7adc000 task=f70eda20)
> Stack: c014e0bd c9efb470 0001   cb129000 0001 e0c65f00
>f7dcbe18 087641ef  f7dcbe18 c014e146 f7dcbe18 f7addeb0 c3146940
>c033f03f f7dcbe18 f7addeb0 0021d906  003f 0040 
> Call Trace:
>  [] __blk_queue_bounce+0x20d/0x260
--snip---
> Code: 00 40 c7 46 0c 90 30 15 c0 c7 46 10 90 31 15 c0 eb b9 90 90 90 90 90 90 
> 90 90 90 8b 4c 24 04 8b 01 c1 e8 1e 8b 14 85 14 f4 63 c0 <8b> 82 0c 04 00 00 
> 05 00 09 00 00 39 c2 74 05 e9 ac 73 03 00 89
> 

The code is Oopsing in a call to kmap in arch/i386/highmem.c
The PageHighMem macro calls is_highmem(page_zone(page)).
page_zone is defined in mm.h
static inline struct zone *page_zone(struct page *page)
{
return zone_table[(page->flags >> ZONETABLE_PGSHIFT) &
ZONETABLE_MASK];
}

Now at the point of the crash, eax is
(page->flags >> ZONETABLE_PGSHIFT),
which is '3'.  So it seems that this page is in zone 3.
However  zone_table[3] is now in edx, and we can see it is '0'.
There are only 3 zones (normal, dma, highmem), so nothing should ever
by in zone 3.  This page is clearly bad.

However that is as far as I can get.  I don't know whether this is a
bad page pointer passed down from jfs or nfsd, a page pointer that was
corrupted by either lvm or md, or a valid page pointer that has
managed to get a bad zone number encoded in it's flags.

You could possibly put something like

struct bio_vec *from;
int i;
bio_for_each_segment(from, bio, i)
BUG_ON(page_zone(from->bv_page)==NULL);

in generic_make_requst in drivers/block/ll_rw_blk.c, just before
the call to q->make_request_fn.
This might trigger the bug early enough to see what is happening.


> 
> Thanks for looking..
> 

Thanks for testing.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-11 Thread Neil Brown
On Thursday August 11, [EMAIL PROTECTED] wrote:
 Hi,
 
 I posted an oops a few days ago from 2.6.12.3 [1].  Here are the results
 of my tests on 2.6.13-rc6.  The kernel oopses, but it the box isn't completely
 hosed; I can still log in and move around.  It appears that the only things 
 that are
 locked are the apps that were doing i/o to the test partition.  More detailed 
 info 
 about my configuration can be found here:
 
 http://www.icglink.com/debug-2.6.13-rc6.html

You don't seem to give details on how lvm is used to combine the md
arrays, though I'm not sure that would help particularly.


 
 Here is the oops:
 
 Oops:  [#1]
 SMP
 Modules linked in:
 CPU:0
 EIP:0060:[c0116dd0]Not tainted VLI
 EFLAGS: 00010207   (2.6.13-rc6)
 EIP is at kmap+0x10/0x30
 eax: 0003   ebx: d0977440   ecx: c9efb470   edx: 
 esi: c100   edi:    ebp: ce59d570   esp: f7adde18
 ds: 007b   es: 007b   ss: 0068
 Process md4_raid1 (pid: 6442, threadinfo=f7adc000 task=f70eda20)
 Stack: c014e0bd c9efb470 0001   cb129000 0001 e0c65f00
f7dcbe18 087641ef  f7dcbe18 c014e146 f7dcbe18 f7addeb0 c3146940
c033f03f f7dcbe18 f7addeb0 0021d906  003f 0040 
 Call Trace:
  [c014e0bd] __blk_queue_bounce+0x20d/0x260
--snip---
 Code: 00 40 c7 46 0c 90 30 15 c0 c7 46 10 90 31 15 c0 eb b9 90 90 90 90 90 90 
 90 90 90 8b 4c 24 04 8b 01 c1 e8 1e 8b 14 85 14 f4 63 c0 8b 82 0c 04 00 00 
 05 00 09 00 00 39 c2 74 05 e9 ac 73 03 00 89
 

The code is Oopsing in a call to kmap in arch/i386/highmem.c
The PageHighMem macro calls is_highmem(page_zone(page)).
page_zone is defined in mm.h
static inline struct zone *page_zone(struct page *page)
{
return zone_table[(page-flags  ZONETABLE_PGSHIFT) 
ZONETABLE_MASK];
}

Now at the point of the crash, eax is
(page-flags  ZONETABLE_PGSHIFT),
which is '3'.  So it seems that this page is in zone 3.
However  zone_table[3] is now in edx, and we can see it is '0'.
There are only 3 zones (normal, dma, highmem), so nothing should ever
by in zone 3.  This page is clearly bad.

However that is as far as I can get.  I don't know whether this is a
bad page pointer passed down from jfs or nfsd, a page pointer that was
corrupted by either lvm or md, or a valid page pointer that has
managed to get a bad zone number encoded in it's flags.

You could possibly put something like

struct bio_vec *from;
int i;
bio_for_each_segment(from, bio, i)
BUG_ON(page_zone(from-bv_page)==NULL);

in generic_make_requst in drivers/block/ll_rw_blk.c, just before
the call to q-make_request_fn.
This might trigger the bug early enough to see what is happening.


 
 Thanks for looking..
 

Thanks for testing.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.13-rc6 Oops with Software RAID, LVM, JFS, NFS

2005-08-11 Thread Phil Dier
On Fri, 12 Aug 2005 12:07:21 +1000
Neil Brown [EMAIL PROTECTED] wrote:

 On Thursday August 11, [EMAIL PROTECTED] wrote:
  Hi,
  
  I posted an oops a few days ago from 2.6.12.3 [1].  Here are the results
  of my tests on 2.6.13-rc6.  The kernel oopses, but it the box isn't 
  completely
  hosed; I can still log in and move around.  It appears that the only things 
  that are
  locked are the apps that were doing i/o to the test partition.  More 
  detailed info 
  about my configuration can be found here:
  
  http://www.icglink.com/debug-2.6.13-rc6.html
 
 You don't seem to give details on how lvm is used to combine the md
 arrays, though I'm not sure that would help particularly.
 

FYI:

vgdisplay -v vg1  
Using volume group(s) on command line
Finding volume group vg1
  --- Volume group ---
  VG Name   vg1
  System ID 
  Formatlvm2
  Metadata Areas2
  Metadata Sequence No  8
  VG Access read/write
  VG Status resizable
  MAX LV255
  Cur LV1
  Open LV   0
  Max PV255
  Cur PV2
  Act PV2
  VG Size   410.00 GB
  PE Size   128.00 MB
  Total PE  3280
  Alloc PE / Size   1093 / 136.62 GB
  Free  PE / Size   2187 / 273.38 GB
  VG UUID   XuRomW-O6Uw-oQGq-vdwD-YwMT-Dltj-NExFmV
   
  --- Logical volume ---
  LV Name/dev/vg1/home
  VG Namevg1
  LV UUIDK7Gq9l-Vjte-ksFt-s0vn-ejqT-RGYc-5Aibtx
  LV Write Accessread/write
  LV Status  available
  # open 0
  LV Size136.62 GB
  Current LE 1093
  Segments   1
  Allocation inherit
  Read ahead sectors 0
  Block device   253:3
   
  --- Physical volumes ---
  PV Name   /dev/md4 
  PV UUID   VgHU6k-lZmE-j686-dvfX-OSsM-yh28-Jyfidn
  PV Status allocatable
  Total PE / Free PE1093 / 0
   
  PV Name   /dev/md7 
  PV UUID   n4rVmy-rARO-a5mY-Iiqo-GvOx-2nbG-HluaTa
  PV Status allocatable
  Total PE / Free PE2187 / 2187

md7 is in there to test live migration from smaller disks to larger ones.


 
   struct bio_vec *from;
   int i;
   bio_for_each_segment(from, bio, i)
   BUG_ON(page_zone(from-bv_page)==NULL);
 
 in generic_make_requst in drivers/block/ll_rw_blk.c, just before
 the call to q-make_request_fn.
 This might trigger the bug early enough to see what is happening.


I'll try this and report the results.


-- 

Phil Dier [EMAIL PROTECTED]

/* vim:set ts=8 sw=8 nocindent noai: */
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/