[jira] [Comment Edited] (TS-1661) Cluster Communications fails without retries ..

Dow Buzzell (JIRA) Fri, 22 Mar 2013 00:05:19 -0700

    [ 
https://issues.apache.org/jira/browse/TS-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610004#comment-13610004
 ]


Dow Buzzell edited comment on TS-1661 at 3/22/13 7:04 AM:
----------------------------------------------------------

Thank you for looking into this.  We just moved a 3 node vm cluster to a more 
stable network, and we have not lost a node out of the cluster in over 24 
hours.. on the very noisy network where we saw the screen shots we would lose a 
node out of the cluster in about 5 minutes.. So we have further confirmation of 
what we captured in wire shark on the noisy vlan.  We still had to pin the vm's 
to a single blade however.

We really are trying to keep these on VM's so that we can begin to think about 
running these with elastic nodes, and plan on working on solution to get our 
nodes off the same blade so we do not take a outage if that blade fails and 
they flight to another ESX host.

Our next steps are to build out a private VLAN between blades for cluster 
communication.  I will update the thread on our progress. In our other 
environment we run physicals with 2 nics and never see the issue.. Also in this 
other environment we have many esx vm's and never see the issue as the network 
is not nearly as subject to packet loss.

Again we really appreciate your help.  The team was very happy to know that 
this issue is being looked into, as there is quite a lot of support to get to a 
elastic vm solution using ATS.  Cheers.. 
                
      was (Author: dowbuzz):
    Thank you for looking into this.  We just moved a 3 node vm cluster to a 
more stable network, and we have not lost a node out of the cluster in over 24 
hours.. on the very noisy network where we saw the screen shots we would lose a 
node out of the cluster in about 5 minutes.. So we have further confirmation of 
what we captured in wire shark on the noisy vlan.  We still had to pin the vm's 
to a single blade however.

We really are trying to keep these on VM's so that we can begin to think about 
running these with elastic nodes, and plan on working on solution to get our 
nodes off the same blade so we do not take a outage if that blade fails and 
they flight to another ESX host.

Our next steps are to build out a private VLAN between blades for cluster 
communication.  I will update the thread on our progress. In our other 
environment we run physicals with 2 nics and never see the issue..
                  
> Cluster Communications fails without retries ..
> -----------------------------------------------
>
>                 Key: TS-1661
>                 URL: https://issues.apache.org/jira/browse/TS-1661
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Clustering
>            Reporter: Dow Buzzell
>            Assignee: Bin Chen
>             Fix For: sometime
>
>         Attachments: dropped-packet.png, dropped-packet.png, 
> refuse-reconnect.png
>
>
> It has been observed that once a lost packet is seen in the TCP cache calls 
> to fetch a object from a member of the cache cluster, that the connection is 
> fin'ed and refused reconnect attempts with resets .. Is this the expected 
> behavior.. a private cache network with perfect delivery solves this problem, 
> however, seems like this might help if the member would accept reconnect 
> attempts.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Comment Edited] (TS-1661) Cluster Communications fails without retries ..

Reply via email to