https://github.com/ceph/ceph/pull/3797

The assert failure is hard to reproduce for existing test suites
because it's only happen when lossless_peer_reuse policy and it's hard
to simulate the reproduce steps from upper layer.

But we can easily reproduce this when do stress tests for Messenger, I
added a inject-error stress test for lossless_peer_reuse policy, it
can reproduce it easily

On Wed, Feb 25, 2015 at 2:27 AM, Gregory Farnum <gfar...@redhat.com> wrote:
>
>> On Feb 24, 2015, at 7:18 AM, Haomai Wang <haomaiw...@gmail.com> wrote:
>>
>> On Tue, Feb 24, 2015 at 12:04 AM, Greg Farnum <gfar...@redhat.com> wrote:
>>> On Feb 12, 2015, at 9:17 PM, Haomai Wang <haomaiw...@gmail.com> wrote:
>>>>
>>>> On Fri, Feb 13, 2015 at 1:26 AM, Greg Farnum <gfar...@redhat.com> wrote:
>>>>> Sorry for the delayed response.
>>>>>
>>>>>> On Feb 11, 2015, at 3:48 AM, Haomai Wang <haomaiw...@gmail.com> wrote:
>>>>>>
>>>>>> Hmm, I got it.
>>>>>>
>>>>>> There exists another problem I'm not sure whether captured by upper 
>>>>>> layer:
>>>>>>
>>>>>> two monitor node(A,B) connected with lossless_peer_reuse policy,
>>>>>> 1. lots of messages has been transmitted....
>>>>>> 2. markdown A
>>>>>
>>>>> I don’t think monitors ever mark each other down?
>>>>>
>>>>>> 3. restart A and call send_message(message will be in out_q)
>>>>>
>>>>> oh, maybe you just mean rebooting it, not an interface thing, okay...
>>>>>
>>>>>> 4. network error injected and A failed to build a *session* with B
>>>>>> 5. because of policy and in_queue() == true, we will reconnect in 
>>>>>> "writer()"
>>>>>> 6. connect_seq++ and try to reconnect
>>>>>
>>>>> I think you’re wrong about this step. The messenger won’t increment 
>>>>> connect_seq directly in writer() because it will be in STATE_CONNECTING, 
>>>>> so it just calls connect() directly.
>>>>> connect() doesn’t increment the connect_seq unless it successfully 
>>>>> finishes a session negotiation.
>>>>
>>>> Hmm, sorry. I checked log again. Actually A doesn't have any message
>>>> in queue. So it will enter standby state and increase connect_seq. It
>>>> will not be *STATE_CONNECTING*.
>>>>
>>>
>>> Urgh, that case does seem broken, yes. I take it this is something you’ve 
>>> actually run across?
>>>
>>> It looks like that connect_seq++ was added by 
>>> https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25.
>>>  Which makes me think we might just be able to increment the connect_seq 
>>> appropriately in the connect() function if we need to do so (on 
>>> replacement, I assume). Would you like to look at that and how this change 
>>> might impact the peer with regards to the referenced assert failure?
>>>
>>> -A very slow-to-reply and apologetic Greg
>>
>> Thanks to Greg!
>>
>> I looked at 
>> commit(https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25#diff-500cebf3665775f2e77db1ff88255b9bR1553)
>> and think it's useless now.
>>
>> When accepting and "connect.connect_seq == existing->connect_seq",
>> existing->state maybe STATE_OPEN, STATE_STANDBY or STANDY_CONNECTING.
>> This commit only fix partial problem and want to assert
>> "(existing->state == STATE_CONNECTING)". So later we added codes to
>> catch "(existing->state == STATE_OPEN || existing->state ==
>> STATE_STANDBY)" before asserting.
>>
>> So from my point, this commit is unnecessary now and we can drop this commit.
>>
>> @sage, what do you think? Any other corner point this commit considered?
>
> Yeah, that looks right to me. Are you seeing this error frequently enough to 
> validate it in testing? We can run it through our suite as well of course, 
> but the race is very narrow and I don’t think our current tests capture it 
> anyway. :/
> Maybe even a PR to enable testing on this kind of connection network failure 
> race? :)
> -Greg
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to