[Lustre-discuss] Too many client eviction

2011-05-03 Thread DEGREMONT Aurelien
Hello

We often see some of our Lustre clients being evicted abusively (clients 
seem healthy).
The pattern is always the same:

All of this on Lustre 2.0, with adaptative timeout enabled

1 - A server complains about a client :
### lock callback timer expired... after 25315s...
(nothing on client)

(few seconds later)

2 - The client receives -107 to a obd_ping for this target
(server says "@@@processing error 107")

3 - Client realize its connection was lost.
Client notices it was evicted.
It reconnects.

(To be sure) When client is evicted, all undergoing I/O are lost, no 
recovery will be done for that?

We are thinking to increase timeout to give more time to clients to 
answer the ldlm revocation.
(maybe it is just too loaded)
- Is ldlm_timeout enough to do so?
- Do we need to also change obd_timeout in accordance? Is there a risk 
to trigger new timeouts if we just change ldlm_timeout (cascading timeout).

Any feedback in this area is welcomed.

Thank you

Aurélien Degrémont
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread Andreas Dilger
I don't think ldlm_timeout and obd_timeout have much effect when AT is enabled. 
I believe that LLNL has some adjusted tunables for AT that might help for you 
(increased at_min, etc).

Hopefully Chris or someone at LLNL can comment. I think they were also 
documented in bugzilla, though I don't know the bug number. 

Cheers, Andreas

On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien  wrote:

> Hello
> 
> We often see some of our Lustre clients being evicted abusively (clients 
> seem healthy).
> The pattern is always the same:
> 
> All of this on Lustre 2.0, with adaptative timeout enabled
> 
> 1 - A server complains about a client :
> ### lock callback timer expired... after 25315s...
> (nothing on client)
> 
> (few seconds later)
> 
> 2 - The client receives -107 to a obd_ping for this target
> (server says "@@@processing error 107")
> 
> 3 - Client realize its connection was lost.
> Client notices it was evicted.
> It reconnects.
> 
> (To be sure) When client is evicted, all undergoing I/O are lost, no 
> recovery will be done for that?
> 
> We are thinking to increase timeout to give more time to clients to 
> answer the ldlm revocation.
> (maybe it is just too loaded)
> - Is ldlm_timeout enough to do so?
> - Do we need to also change obd_timeout in accordance? Is there a risk 
> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
> 
> Any feedback in this area is welcomed.
> 
> Thank you
> 
> Aurélien Degrémont
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread DEGREMONT Aurelien
Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
that client is adapting its timeout, but not the server. I'm understood 
that server->client RPC still use the old mechanism, especially for our 
case where it seems server is revoking a client lock (ldlm_timeout is 
used for that?) and client did not respond.

I forgot to say that we have LNET routers also involved for some cases.

Thank you

Aurélien

Andreas Dilger a écrit :
> I don't think ldlm_timeout and obd_timeout have much effect when AT is 
> enabled. I believe that LLNL has some adjusted tunables for AT that might 
> help for you (increased at_min, etc).
>
> Hopefully Chris or someone at LLNL can comment. I think they were also 
> documented in bugzilla, though I don't know the bug number. 
>
> Cheers, Andreas
>
> On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien  
> wrote:
>
>   
>> Hello
>>
>> We often see some of our Lustre clients being evicted abusively (clients 
>> seem healthy).
>> The pattern is always the same:
>>
>> All of this on Lustre 2.0, with adaptative timeout enabled
>>
>> 1 - A server complains about a client :
>> ### lock callback timer expired... after 25315s...
>> (nothing on client)
>>
>> (few seconds later)
>>
>> 2 - The client receives -107 to a obd_ping for this target
>> (server says "@@@processing error 107")
>>
>> 3 - Client realize its connection was lost.
>> Client notices it was evicted.
>> It reconnects.
>>
>> (To be sure) When client is evicted, all undergoing I/O are lost, no 
>> recovery will be done for that?
>>
>> We are thinking to increase timeout to give more time to clients to 
>> answer the ldlm revocation.
>> (maybe it is just too loaded)
>> - Is ldlm_timeout enough to do so?
>> - Do we need to also change obd_timeout in accordance? Is there a risk 
>> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
>>
>> Any feedback in this area is welcomed.
>>
>> Thank you
>>
>> Aurélien Degrémont
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread Nathan Rutman

On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:

> Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
> that client is adapting its timeout, but not the server. I'm understood 
> that server->client RPC still use the old mechanism, especially for our 
> case where it seems server is revoking a client lock (ldlm_timeout is 
> used for that?) and client did not respond.

Server and client cooperate together for the adaptive timeouts.  I don't 
remember which bug the ORNL settings were in, maybe 14071, bugzilla's not 
responding at the moment.  But a big question here is why 25315 seconds for a 
callback - that's well beyond anything at_max should allow...
 

> 
> I forgot to say that we have LNET routers also involved for some cases.
> 
> Thank you
> 
> Aurélien
> 
> Andreas Dilger a écrit :
>> I don't think ldlm_timeout and obd_timeout have much effect when AT is 
>> enabled. I believe that LLNL has some adjusted tunables for AT that might 
>> help for you (increased at_min, etc).
>> 
>> Hopefully Chris or someone at LLNL can comment. I think they were also 
>> documented in bugzilla, though I don't know the bug number. 
>> 
>> Cheers, Andreas
>> 
>> On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien  
>> wrote:
>> 
>> 
>>> Hello
>>> 
>>> We often see some of our Lustre clients being evicted abusively (clients 
>>> seem healthy).
>>> The pattern is always the same:
>>> 
>>> All of this on Lustre 2.0, with adaptative timeout enabled
>>> 
>>> 1 - A server complains about a client :
>>> ### lock callback timer expired... after 25315s...
>>> (nothing on client)
>>> 
>>> (few seconds later)
>>> 
>>> 2 - The client receives -107 to a obd_ping for this target
>>> (server says "@@@processing error 107")
>>> 
>>> 3 - Client realize its connection was lost.
>>> Client notices it was evicted.
>>> It reconnects.
>>> 
>>> (To be sure) When client is evicted, all undergoing I/O are lost, no 
>>> recovery will be done for that?
>>> 
>>> We are thinking to increase timeout to give more time to clients to 
>>> answer the ldlm revocation.
>>> (maybe it is just too loaded)
>>> - Is ldlm_timeout enough to do so?
>>> - Do we need to also change obd_timeout in accordance? Is there a risk 
>>> to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
>>> 
>>> Any feedback in this area is welcomed.
>>> 
>>> Thank you
>>> 
>>> Aurélien Degrémont
>>> ___
>>> Lustre-discuss mailing list
>>> Lustre-discuss@lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>> 
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-03 Thread Andreas Dilger
On May 3, 2011, at 13:41, Nathan Rutman wrote:
> On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:
>> Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
>> that client is adapting its timeout, but not the server. I'm understood 
>> that server->client RPC still use the old mechanism, especially for our 
>> case where it seems server is revoking a client lock (ldlm_timeout is 
>> used for that?) and client did not respond.
> 
> Server and client cooperate together for the adaptive timeouts.  I don't 
> remember which bug the ORNL settings were in, maybe 14071, bugzilla's not 
> responding at the moment.  But a big question here is why 25315 seconds for a 
> callback - that's well beyond anything at_max should allow...

I assume that the 25315s is from a bug (fixed in 1.8.5 I think, not sure if it 
was ported to 2.x) that calculated the wrong time when printing this error 
message for LDLM lock timeouts.

>> I forgot to say that we have LNET routers also involved for some cases.

If there are routers they can cause dropped RPCs from the server to the client, 
and the client will be evicted for unresponsiveness even though it is not at 
fault.  At one time Johann was working on a patch (or at least investigating) 
the ability to have servers resend RPCs before evicting clients.  The tricky 
part is that you don't want to send 2 RPCs each with 1/2 the timeout interval, 
since that may reduce stability instead of increasing it.

I think the bugzilla bug was called "limited server-side resend" or similar, 
filed by me several years ago.

>> Andreas Dilger a écrit :
>>> I don't think ldlm_timeout and obd_timeout have much effect when AT is 
>>> enabled. I believe that LLNL has some adjusted tunables for AT that might 
>>> help for you (increased at_min, etc).
>>> 
>>> Hopefully Chris or someone at LLNL can comment. I think they were also 
>>> documented in bugzilla, though I don't know the bug number. 
>>> 
>>> Cheers, Andreas
>>> 
>>> On 2011-05-03, at 6:59 AM, DEGREMONT Aurelien  
>>> wrote:
>>> 
>>> 
 Hello
 
 We often see some of our Lustre clients being evicted abusively (clients 
 seem healthy).
 The pattern is always the same:
 
 All of this on Lustre 2.0, with adaptative timeout enabled
 
 1 - A server complains about a client :
 ### lock callback timer expired... after 25315s...
 (nothing on client)
 
 (few seconds later)
 
 2 - The client receives -107 to a obd_ping for this target
 (server says "@@@processing error 107")
 
 3 - Client realize its connection was lost.
 Client notices it was evicted.
 It reconnects.
 
 (To be sure) When client is evicted, all undergoing I/O are lost, no 
 recovery will be done for that?
 
 We are thinking to increase timeout to give more time to clients to 
 answer the ldlm revocation.
 (maybe it is just too loaded)
 - Is ldlm_timeout enough to do so?
 - Do we need to also change obd_timeout in accordance? Is there a risk 
 to trigger new timeouts if we just change ldlm_timeout (cascading timeout).
 
 Any feedback in this area is welcomed.
 
 Thank you
 
 Aurélien Degrémont
 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss
 
>> 
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 


Cheers, Andreas
--
Andreas Dilger 
Principal Engineer
Whamcloud, Inc.



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-04 Thread DEGREMONT Aurelien
Nathan Rutman a écrit :
> On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:
>
> Server and client cooperate together for the adaptive timeouts.
Ok they cooperate, the client will change its timeout through this 
cooperation, but will also do the same ?
If yes, obd_timeout and ldlm_timeout are very rarely used now?

>   I don't remember which bug the ORNL settings were in, maybe 14071, 
> bugzilla's not responding at the moment.  
Bugzilla is ok now, 14071 is the AT feature bug. The ORNL settings are 
not there.



Aurélien
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-04 Thread DEGREMONT Aurelien
Hello

Andreas Dilger a écrit :
> On May 3, 2011, at 13:41, Nathan Rutman wrote:
>   
>> On May 3, 2011, at 10:09 AM, DEGREMONT Aurelien wrote:
>> 
>>> Correct me if I'm wrong, but when I'm looking at Lustre manual, it said 
>>> that client is adapting its timeout, but not the server. I'm understood 
>>> that server->client RPC still use the old mechanism, especially for our 
>>> case where it seems server is revoking a client lock (ldlm_timeout is 
>>> used for that?) and client did not respond.
>>>   
>> Server and client cooperate together for the adaptive timeouts.  I don't 
>> remember which bug the ORNL settings were in, maybe 14071, bugzilla's not 
>> responding at the moment.  But a big question here is why 25315 seconds for 
>> a callback - that's well beyond anything at_max should allow...
>> 
>
> I assume that the 25315s is from a bug (fixed in 1.8.5 I think, not sure if 
> it was ported to 2.x) that calculated the wrong time when printing this error 
> message for LDLM lock timeouts.
>   
I did not find the bug for that.
>>> I forgot to say that we have LNET routers also involved for some cases.
>>>   
> If there are routers they can cause dropped RPCs from the server to the 
> client, and the client will be evicted for unresponsiveness even though it is 
> not at fault.  At one time Johann was working on a patch (or at least 
> investigating) the ability to have servers resend RPCs before evicting 
> clients.  The tricky part is that you don't want to send 2 RPCs each with 1/2 
> the timeout interval, since that may reduce stability instead of increasing 
> it.
>   
How can I track those dropped RPCs on routers?
Is this an expected behaviour? How could I protect my filesystem from 
that? If I increase the timeout this won't change anything if 
client/server do not re-send their RPC.

> I think the bugzilla bug was called "limited server-side resend" or similar, 
> filed by me several years ago.
>   
Did not find either :)

Aurélien
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-04 Thread Johann Lombardi
On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
> > I assume that the 25315s is from a bug

BTW, do you see this problem with both extent & inodebits locks?

> (fixed in 1.8.5 I think, not sure if it was ported to 2.x) that calculated 
> the wrong time when printing this error message for LDLM lock timeouts.
> >
> I did not find the bug for that.

I think Andreas was referring to bug 17887. However you should have the patch 
applied already since it was landed for 2.0.0.

> > If there are routers they can cause dropped RPCs from the server to the 
> > client, and the client will be evicted for unresponsiveness even though it 
> > is not at fault.  At one time Johann was working on a patch (or at least 
> > investigating) the ability to have servers resend RPCs before evicting 
> > clients.  The tricky part is that you don't want to send 2 RPCs each with 
> > 1/2 the timeout interval, since that may reduce stability instead of 
> > increasing it.
> >
> How can I track those dropped RPCs on routers?

I don't think routers can drop RPCs w/o a good reason. It is just that a router 
failure can lead to packet loss and given that servers don't resend local 
callbacks, this can result in client evictions.

> Is this an expected behaviour?

Well, let's call this a known problem we would like to address at some point.

> How could I protect my filesystem from that? If I increase the timeout
> this won't change anything

Right, tweaking timeouts cannot help here.

> if client/server do not re-send their RPC.

To be clear, clients go through a disconnect/reconnect cycle and eventually 
resend RPCs.

> > I think the bugzilla bug was called "limited server-side resend" or 
> > similar, filed by me several years ago.
> >
> Did not find either :)

That's bug 3622. Fanyong also used to work on a patch, see 
http://review.whamcloud.com/#change,125.

HTH

Cheers,
Johann
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-04 Thread DEGREMONT Aurelien
Johann Lombardi a écrit :
> On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
>   
>>> I assume that the 25315s is from a bug
>>>   
> BTW, do you see this problem with both extent & inodebits locks?
>   
Yes both. But more often on MDS.
>> How can I track those dropped RPCs on routers?
>> 
>
> I don't think routers can drop RPCs w/o a good reason. It is just that a 
> router failure can lead to packet loss and given that servers don't resend 
> local callbacks, this can result in client evictions.
>   
Currently I do not see any issue with the routers.
Logs are very silent and load is very low. Nothing looks like router 
failure.
If LNET decides to drop packet for some buggy reason, I would expect to 
have it, at least, say something in kernel log ("omg i've drop 2 
packets, please expect evictions :))"

>> if client/server do not re-send their RPC.
>> 
> To be clear, clients go through a disconnect/reconnect cycle and eventually 
> resend RPCs.
>   
I'm not sure I understand clearly what happens there.
If client did not respond to server ast, it will be evicted by the 
server. Server do not seem to send a message to tell it (why bother as 
it seems it is unresponsive or dead anyway?).
Client realizes at next obd_ping that connection does not exist anymore 
(rc=-107 ENOTCONN).
Then it try to reconnect, and at that time, server tells it, it is 
really evicted. Client says "in progress operation will fail". AFAIK, 
this means dropping all locks, all dirty pages. Async I/O are lost. 
Connection status becomes EVICTED. I/O during this window will receive 
-108, ESHUTDOWN, (kernel log said @@@ IMP_INVALID, see 
ptlrpc_import_delay_req()).
Then client reconnects, but some I/O were lost, user program could have 
experienced errors from I/O syscall.

This is not the same as a connection timeout, where client will try a 
failover and do a disconnect/recovery cycle, everything is ok.

Is this correct?

> That's bug 3622. Fanyong also used to work on a patch, see 
> http://review.whamcloud.com/#change,125.
>   
This looks very interesting as it seems to match our issue. But 
unfortunately, no news since 2 months.



Aurélien

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-04 Thread Johann Lombardi
On Wed, May 04, 2011 at 04:05:56PM +0200, DEGREMONT Aurelien wrote:
> >> if client/server do not re-send their RPC.
> >> 
> > To be clear, clients go through a disconnect/reconnect cycle and eventually 
> > resend RPCs.
> >   
> I'm not sure I understand clearly what happens there.

I just meant that after a router failure, RPCs sent by clients should be resent 
at some point whereas lock callbacks (sent by servers) won't be.

> This is not the same as a connection timeout, where client will try a 
> failover and do a disconnect/recovery cycle, everything is ok.
> 
> Is this correct?

Right.

Johann
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-07 Thread Andreas Dilger
Aurelien, now that I think about it, it may be that the LNET errors are turned 
off by default. You should check if the "neterr" debug flag is on. Otherwise 
LNET errors are nor printed to the console by default. 

Cheers, Andreas

On 2011-05-04, at 8:05 AM, DEGREMONT Aurelien  wrote:

> Johann Lombardi a écrit :
>> On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
>>  
 I assume that the 25315s is from a bug
  
>> BTW, do you see this problem with both extent & inodebits locks?
>>  
> Yes both. But more often on MDS.
>>> How can I track those dropped RPCs on routers?
>>>
>> 
>> I don't think routers can drop RPCs w/o a good reason. It is just that a 
>> router failure can lead to packet loss and given that servers don't resend 
>> local callbacks, this can result in client evictions.
>>  
> Currently I do not see any issue with the routers.
> Logs are very silent and load is very low. Nothing looks like router failure.
> If LNET decides to drop packet for some buggy reason, I would expect to have 
> it, at least, say something in kernel log ("omg i've drop 2 packets, please 
> expect evictions :))"
> 
>>> if client/server do not re-send their RPC.
>>>
>> To be clear, clients go through a disconnect/reconnect cycle and eventually 
>> resend RPCs.
>>  
> I'm not sure I understand clearly what happens there.
> If client did not respond to server ast, it will be evicted by the server. 
> Server do not seem to send a message to tell it (why bother as it seems it is 
> unresponsive or dead anyway?).
> Client realizes at next obd_ping that connection does not exist anymore 
> (rc=-107 ENOTCONN).
> Then it try to reconnect, and at that time, server tells it, it is really 
> evicted. Client says "in progress operation will fail". AFAIK, this means 
> dropping all locks, all dirty pages. Async I/O are lost. Connection status 
> becomes EVICTED. I/O during this window will receive -108, ESHUTDOWN, (kernel 
> log said @@@ IMP_INVALID, see ptlrpc_import_delay_req()).
> Then client reconnects, but some I/O were lost, user program could have 
> experienced errors from I/O syscall.
> 
> This is not the same as a connection timeout, where client will try a 
> failover and do a disconnect/recovery cycle, everything is ok.
> 
> Is this correct?
> 
>> That's bug 3622. Fanyong also used to work on a patch, see 
>> http://review.whamcloud.com/#change,125.
>>  
> This looks very interesting as it seems to match our issue. But 
> unfortunately, no news since 2 months.
> 
> 
> 
> Aurélien
> 
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Too many client eviction

2011-05-08 Thread Aurélien Degrémont
Thanks for the notice but I had already checked that, and its was ok.
I will see if my latest tunings will change something.

Aurélien

Le 08/05/2011 05:29, Andreas Dilger a écrit :
> Aurelien, now that I think about it, it may be that the LNET errors are 
> turned off by default. You should check if the "neterr" debug flag is on. 
> Otherwise LNET errors are nor printed to the console by default.
>
> Cheers, Andreas
>
> On 2011-05-04, at 8:05 AM, DEGREMONT Aurelien  
> wrote:
>
>> Johann Lombardi a écrit :
>>> On Wed, May 04, 2011 at 01:37:14PM +0200, DEGREMONT Aurelien wrote:
>>>
> I assume that the 25315s is from a bug
>
>>> BTW, do you see this problem with both extent&  inodebits locks?
>>>
>> Yes both. But more often on MDS.
 How can I track those dropped RPCs on routers?

>>> I don't think routers can drop RPCs w/o a good reason. It is just that a 
>>> router failure can lead to packet loss and given that servers don't resend 
>>> local callbacks, this can result in client evictions.
>>>
>> Currently I do not see any issue with the routers.
>> Logs are very silent and load is very low. Nothing looks like router failure.
>> If LNET decides to drop packet for some buggy reason, I would expect to have 
>> it, at least, say something in kernel log ("omg i've drop 2 packets, please 
>> expect evictions :))"
>>
 if client/server do not re-send their RPC.

>>> To be clear, clients go through a disconnect/reconnect cycle and eventually 
>>> resend RPCs.
>>>
>> I'm not sure I understand clearly what happens there.
>> If client did not respond to server ast, it will be evicted by the server. 
>> Server do not seem to send a message to tell it (why bother as it seems it 
>> is unresponsive or dead anyway?).
>> Client realizes at next obd_ping that connection does not exist anymore 
>> (rc=-107 ENOTCONN).
>> Then it try to reconnect, and at that time, server tells it, it is really 
>> evicted. Client says "in progress operation will fail". AFAIK, this means 
>> dropping all locks, all dirty pages. Async I/O are lost. Connection status 
>> becomes EVICTED. I/O during this window will receive -108, ESHUTDOWN, 
>> (kernel log said @@@ IMP_INVALID, see ptlrpc_import_delay_req()).
>> Then client reconnects, but some I/O were lost, user program could have 
>> experienced errors from I/O syscall.
>>
>> This is not the same as a connection timeout, where client will try a 
>> failover and do a disconnect/recovery cycle, everything is ok.
>>
>> Is this correct?
>>
>>> That's bug 3622. Fanyong also used to work on a patch, see 
>>> http://review.whamcloud.com/#change,125.
>>>
>> This looks very interesting as it seems to match our issue. But 
>> unfortunately, no news since 2 months.
>>
>>
>>
>> Aurélien
>>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss