[ https://issues.apache.org/jira/browse/TS-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13610004#comment-13610004 ]
Dow Buzzell edited comment on TS-1661 at 3/22/13 7:04 AM: ---------------------------------------------------------- Thank you for looking into this. We just moved a 3 node vm cluster to a more stable network, and we have not lost a node out of the cluster in over 24 hours.. on the very noisy network where we saw the screen shots we would lose a node out of the cluster in about 5 minutes.. So we have further confirmation of what we captured in wire shark on the noisy vlan. We still had to pin the vm's to a single blade however. We really are trying to keep these on VM's so that we can begin to think about running these with elastic nodes, and plan on working on solution to get our nodes off the same blade so we do not take a outage if that blade fails and they flight to another ESX host. Our next steps are to build out a private VLAN between blades for cluster communication. I will update the thread on our progress. In our other environment we run physicals with 2 nics and never see the issue.. Also in this other environment we have many esx vm's and never see the issue as the network is not nearly as subject to packet loss. Again we really appreciate your help. The team was very happy to know that this issue is being looked into, as there is quite a lot of support to get to a elastic vm solution using ATS. Cheers.. was (Author: dowbuzz): Thank you for looking into this. We just moved a 3 node vm cluster to a more stable network, and we have not lost a node out of the cluster in over 24 hours.. on the very noisy network where we saw the screen shots we would lose a node out of the cluster in about 5 minutes.. So we have further confirmation of what we captured in wire shark on the noisy vlan. We still had to pin the vm's to a single blade however. We really are trying to keep these on VM's so that we can begin to think about running these with elastic nodes, and plan on working on solution to get our nodes off the same blade so we do not take a outage if that blade fails and they flight to another ESX host. Our next steps are to build out a private VLAN between blades for cluster communication. I will update the thread on our progress. In our other environment we run physicals with 2 nics and never see the issue.. > Cluster Communications fails without retries .. > ----------------------------------------------- > > Key: TS-1661 > URL: https://issues.apache.org/jira/browse/TS-1661 > Project: Traffic Server > Issue Type: Bug > Components: Clustering > Reporter: Dow Buzzell > Assignee: Bin Chen > Fix For: sometime > > Attachments: dropped-packet.png, dropped-packet.png, > refuse-reconnect.png > > > It has been observed that once a lost packet is seen in the TCP cache calls > to fetch a object from a member of the cache cluster, that the connection is > fin'ed and refused reconnect attempts with resets .. Is this the expected > behavior.. a private cache network with perfect delivery solves this problem, > however, seems like this might help if the member would accept reconnect > attempts. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira