[Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-08 Thread chas williams - CONTRACTOR
we have seen a few crashes that look like:

[250696.381575] RIP: 0010:[]  [] 
mdc_exit_request+0x74/0xb0 [mdc]
...
[250696.381575] Call Trace:
[250696.381575]  [] 
mdc_intent_getattr_async_interpret+0x82/0x500 [mdc]
[250696.381575]  [] ptlrpc_check_set+0x200/0x1690 [ptlrpc]
[250696.381575]  [] ptlrpcd_check+0x110/0x250 [ptlrpc]

and i sort of gather the problem arises from mdc_enter_request().
it allocates an mdc_cache_waiter on the stack and inserts it into the
wait list and then returns.

int mdc_enter_request(struct client_obd *cli)
...
struct mdc_cache_waiter mcw;
...
list_add_tail(&mcw.mcw_entry, &cli->cl_cache_waiters);
init_waitqueue_head(&mcw.mcw_waitq);

later mdc_exit_request() finds this mcw by iterating the list.
seeing as mcw was allocated on the stack, i dont think you can do this.
mcw might have been reused by the time mdc_exit_request() gets around
to removing it.

void mdc_exit_request(struct client_obd *cli)
...
mcw = list_entry(l, struct mdc_cache_waiter, mcw_entry);
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-08 Thread Andreas Dilger
On 2011-08-08, at 10:03 AM, chas williams - CONTRACTOR wrote:
> we have seen a few crashes that look like:
> 
> [250696.381575] RIP: 0010:[]  [] 
> mdc_exit_request+0x74/0xb0 [mdc]
> ...
> [250696.381575] Call Trace:
> [250696.381575]  [] 
> mdc_intent_getattr_async_interpret+0x82/0x500 [mdc]
> [250696.381575]  [] ptlrpc_check_set+0x200/0x1690 [ptlrpc]
> [250696.381575]  [] ptlrpcd_check+0x110/0x250 [ptlrpc]
> 
> and i sort of gather the problem arises from mdc_enter_request().
> it allocates an mdc_cache_waiter on the stack and inserts it into the
> wait list and then returns.
> 
>   int mdc_enter_request(struct client_obd *cli)
>   ...
>   struct mdc_cache_waiter mcw;
>   ...
>   list_add_tail(&mcw.mcw_entry, &cli->cl_cache_waiters);
>   init_waitqueue_head(&mcw.mcw_waitq);
> 
> later mdc_exit_request() finds this mcw by iterating the list.
> seeing as mcw was allocated on the stack, i dont think you can do this.
> mcw might have been reused by the time mdc_exit_request() gets around
> to removing it.

What version of Lustre is this?

Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-08 Thread chas williams - CONTRACTOR
On Mon, 08 Aug 2011 12:03:25 -0400
chas williams - CONTRACTOR  wrote:

> later mdc_exit_request() finds this mcw by iterating the list.
> seeing as mcw was allocated on the stack, i dont think you can do this.
> mcw might have been reused by the time mdc_exit_request() gets around
> to removing it.

nevermind. i see this has been fixed in later releases apparently (i
was looking at 1.8.5). if l_wait_event() returns "early" (like
from being interrupted) mdc_enter_request() does the cleanup itself now.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-08 Thread Oleg Drokin
Hello!

   I guess this is some sort of 1.8 due to the init_waitq_head call.
   2.1 code is notably different in this case after LU-234 landed, namely 
removing
   mcw_entry from the list on error.
   The patch originates from bug 18213 and claimed as 1.8 port to 2.1, but I 
don't see anything like this in the 1.8 patch.

Bye,
Oleg
On Aug 8, 2011, at 2:07 PM, Andreas Dilger wrote:

> On 2011-08-08, at 10:03 AM, chas williams - CONTRACTOR wrote:
>> we have seen a few crashes that look like:
>> 
>> [250696.381575] RIP: 0010:[]  [] 
>> mdc_exit_request+0x74/0xb0 [mdc]
>> ...
>> [250696.381575] Call Trace:
>> [250696.381575]  [] 
>> mdc_intent_getattr_async_interpret+0x82/0x500 [mdc]
>> [250696.381575]  [] ptlrpc_check_set+0x200/0x1690 [ptlrpc]
>> [250696.381575]  [] ptlrpcd_check+0x110/0x250 [ptlrpc]
>> 
>> and i sort of gather the problem arises from mdc_enter_request().
>> it allocates an mdc_cache_waiter on the stack and inserts it into the
>> wait list and then returns.
>> 
>>  int mdc_enter_request(struct client_obd *cli)
>>  ...
>>  struct mdc_cache_waiter mcw;
>>  ...
>>  list_add_tail(&mcw.mcw_entry, &cli->cl_cache_waiters);
>>  init_waitqueue_head(&mcw.mcw_waitq);
>> 
>> later mdc_exit_request() finds this mcw by iterating the list.
>> seeing as mcw was allocated on the stack, i dont think you can do this.
>> mcw might have been reused by the time mdc_exit_request() gets around
>> to removing it.
> 
> What version of Lustre is this?
> 
> Cheers, Andreas
> --
> Andreas Dilger 
> Principal Engineer
> Whamcloud, Inc.
> 
> 
> 

--
Oleg Drokin
Senior Software Engineer
Whamcloud, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-09 Thread Kevin Van Maren
chas williams - CONTRACTOR wrote:
> On Mon, 08 Aug 2011 12:03:25 -0400
> chas williams - CONTRACTOR  wrote:
>
>   
>> later mdc_exit_request() finds this mcw by iterating the list.
>> seeing as mcw was allocated on the stack, i dont think you can do this.
>> mcw might have been reused by the time mdc_exit_request() gets around
>> to removing it.
>> 
>
> nevermind. i see this has been fixed in later releases apparently (i
> was looking at 1.8.5). if l_wait_event() returns "early" (like
> from being interrupted) mdc_enter_request() does the cleanup itself now.
>   

That code is unchanged in 1.8.6.

Kevin

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-09 Thread chas williams - CONTRACTOR
On Tue, 09 Aug 2011 10:29:43 -0600
Kevin Van Maren  wrote:

> > chas williams - CONTRACTOR wrote:
> > nevermind. i see this has been fixed in later releases apparently (i
> > was looking at 1.8.5). if l_wait_event() returns "early" (like
> > from being interrupted) mdc_enter_request() does the cleanup itself now.
> 
> That code is unchanged in 1.8.6.

it appears to have been fixed in the 2.x releases.  i think this is the
relevant change http://review.whamcloud.com/#change,506
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [bug?] mdc_enter_request() problems

2011-08-09 Thread Johann Lombardi
On Tue, Aug 09, 2011 at 10:29:43AM -0600, Kevin Van Maren wrote:
> That code is unchanged in 1.8.6.

The two relevant patches for 1.8 are the following:
http://review.whamcloud.com/#change,457
http://review.whamcloud.com/#change,506

Both patches are included in 1.8.6-wc1 and waiting for landing approval on 
Oracle's side (see bugzilla 24508).

Cheers,
Johann
-- 
Johann Lombardi
Whamcloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss