Re: Cascading cluster failure

2014-12-29 Thread Kris Davey
ons? >>>>>>> >>>>>>> >>>>>>> On 24 December 2014 at 22:29, Abhishek wrote: >>>>>>> >>>>>>>> Thanks for reading vineeth. That was my initial thought but I >>>>>>>> cou

Re: Cascading cluster failure

2014-12-25 Thread Mark Walkom
hek wrote: >>>>>> >>>>>>> Thanks for reading vineeth. That was my initial thought but I >>>>>>> couldn't find any old gc during the outage. Each es node has 32 gigs. >>>>>>> Each >>>>>>> bo

Re: Cascading cluster failure

2014-12-25 Thread vineeth mohan
ge. Each es node has 32 gigs. >>>>>> Each >>>>>> box has 128gigs split between 2 es nodes(32G each) and file system cache >>>>>> (64G). >>>>>> >>>>>> On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan >>>>>

Re: Cascading cluster failure

2014-12-25 Thread Abhishek
uring the outage. Each es node has 32 gigs. Each box >>>>>> has 128gigs split between 2 es nodes(32G each) and file system cache >>>>>> (64G). >>>>>> >>>>>>> On Wed, Dec 24, 2014 at 4:49 PM, vineeth mohan >>>&g

Re: Cascading cluster failure

2014-12-25 Thread vineeth mohan
gt; What is the memory for each of these machines ? >>>>>> Also see if there is any correlation between garbage collection and >>>>>> the time this anomaly happens. >>>>>> Chances are that the stop the world time might block the ping for >>>

Re: Cascading cluster failure

2014-12-24 Thread Abhishek Andhavarapu
h mohan >>>> wrote: >>>> >>>>> Hi , >>>>> >>>>> What is the memory for each of these machines ? >>>>> Also see if there is any correlation between garbage collection and >>>>> the time this anomaly happens

Re: Cascading cluster failure

2014-12-24 Thread Mark Walkom
>>>> time this anomaly happens. >>>> Chances are that the stop the world time might block the ping for >>>> sometime and the cluster might feel some nodes are gone. >>>> >>>> Thanks >>>> Vineeth >>>> >>>

Re: Cascading cluster failure

2014-12-24 Thread Pat Wright
t;>> Chances are that the stop the world time might block the ping for >>> sometime and the cluster might feel some nodes are gone. >>> >>> Thanks >>> Vineeth >>> >>> On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu < >

Re: Cascading cluster failure

2014-12-24 Thread Nikolas Everett
gt;> time this anomaly happens. >>> Chances are that the stop the world time might block the ping for >>> sometime and the cluster might feel some nodes are gone. >>> >>> Thanks >>> Vineeth >>> >>> On Wed, Dec 24, 2014 at

Re: Cascading cluster failure

2014-12-24 Thread Mark Walkom
for >> sometime and the cluster might feel some nodes are gone. >> >> Thanks >> Vineeth >> >> On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu < >> abhishek...@gmail.com> wrote: >> >>> Hi all, >>> >>> We recently

Re: Cascading cluster failure

2014-12-24 Thread Abhishek
> Thanks > Vineeth > > On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu < > abhishek...@gmail.com> wrote: > >> Hi all, >> >> We recently had a cascading cluster failure. From 16:35 to 16:42 the >> cluster went red and recovered it self. I c

Re: Cascading cluster failure

2014-12-24 Thread vineeth mohan
Vineeth On Wed, Dec 24, 2014 at 4:23 PM, Abhishek Andhavarapu wrote: > Hi all, > > We recently had a cascading cluster failure. From 16:35 to 16:42 the > cluster went red and recovered it self. I can't seem to find any obvious > logs around this time. > > The cluster has

Cascading cluster failure

2014-12-24 Thread Abhishek Andhavarapu
Hi all, We recently had a cascading cluster failure. From 16:35 to 16:42 the cluster went red and recovered it self. I can't seem to find any obvious logs around this time. The cluster has about 19 nodes. 9 physical boxes running two instances of elasticsearch. And one vm as balance

cluster failure

2014-05-16 Thread Jilles van Gurp
I just had an incident where my entire cluster (all nodes) ended up using 100% cpu on each nod at the same time and become completely unresponsive even to /_cluster/health. This happened while I was using Kibana, which was working fine up to that point. I was running a few simple queries (nothin

Re: Complete cluster failure

2014-03-24 Thread Ivan Brusic
Just another update since there have been others that had issues with multicast in the past and switched to unicast. My issue appears to be with the multicast group. The default in Elasticsearch is 224.2.2.4, which according to the RFC is within the SDP/SAP Block. Our internal application uses mDN

Re: Complete cluster failure

2014-03-20 Thread Ivan Brusic
Don't bother trying digging deeper since I suspect network. I tried many different configurations while trying to pinpoint the problem, so I did not write down the various states, just the successes/failures. Using the described methods, IPV4 was indeed working, but multicast was still not coopera

Re: Complete cluster failure

2014-03-20 Thread Zachary Tong
Nice post-mortem, thanks for the writeup. Hopefully someone will stumble on this in the future and avoid the same headache you had :) How would you force IPV4? I tried using preferIPv4Stack and setting > network.host to _eth0:ipv4_, but it still did not work. Even switched off > iptables at a

Re: Complete cluster failure

2014-03-19 Thread Ivan Brusic
Responses inline. On Wed, Mar 19, 2014 at 7:25 PM, Zachary Tong wrote: > Yeah, in case anyone reads this thread in the future, this log output is a > good indicator of multicast problems. You can see that the the nodes are > pinging and talking to each other on this log line: > > --> target [[se

Re: Complete cluster failure

2014-03-19 Thread Zachary Tong
Yeah, in case anyone reads this thread in the future, this log output is a good indicator of multicast problems. You can see that the the nodes are pinging and talking to each other on this log line: --> target [[search6][T3tINFmqREK9W6oqZV0r7A][inet[/192.168.50.106:9300]]], master [null] Th

Re: Complete cluster failure

2014-03-18 Thread Ivan Brusic
My mind was not clear since I was debugging this issue for a few hours. Once I realized it was a multicast issue, I switched to unicast and the cluster was back up and running. So it was multicast after all. I should have been more careful when I received an email on Friday that said " will have to

Re: Complete cluster failure

2014-03-18 Thread 熊贻青
How many NIC are there on each of your nodes? We got some issue on boxes with 4 NIC, some address were not reachable due to linux kernel setting. I'd suggest you test the full connection matrix via some shell script, so as to rule out this cause. My 2 cents -- You received this message because yo

Re: Complete cluster failure

2014-03-18 Thread Ivan Brusic
No matter in what order I restart the servers, the same 4 node clusters get created. I suspect network, especially since there was some work done this past Friday on the underlying VM host. Would Elasticsearch cache multicast information? The servers have not been restarted in at least a week. Iva

Complete cluster failure

2014-03-18 Thread Ivan Brusic
I have been running Elasticsearch for years and I have never encountered a collapse such as the one I am experiencing. Even when experiencing split brain clusters, I still had it running and accepting search requests. 8 node development cluster running 0.90.2 using multicast. Last time the cluster