Re: OutOfMemoryError on marvel node brought down the production cluster

Boaz Leskes Mon, 21 Apr 2014 14:26:30 -0700

I think the biggest different with bigdesk and relatives is that they lack
of history, which is why Marvel stores data - so you can always go back and
find out what went wrong at night.


If you don't mind me chasing this more (I do want to know what went wrong
:) ) - in your production cluster, how many nodes and indices do you have?
I'm asking this to get a grip of your 37GB of data (if you prefer to share
it privately, you should be able to via the groups interface, otherwise I'm
bleskes on freenode at the #elasticsearch where I'm online for most of
european waking hours).

Cheers,
Boaz


On Mon, Apr 21, 2014 at 9:45 AM, T Vinod Gupta <tvi...@readypulse.com>wrote:

> Thanks Boaz for the reply.. I was using the latest marvel 1.1 by the way.
> Looks like you need marvel for marvel!
> Actually, my marvel cluster got so messed up that no matter what i did it
> would show shard failures in the dashboard and nothing was functional. i
> actually had a 2 node cluster for marvel monitoring. and after restart,
> they never got out of red state.
> so i just gave up on my experimentation with marvel and abandoned it
> fully..
>
> i probably will go back to bigdesk. any other alternatives that are good?
>
> thanks
>
> ps - my feedback to the marvel team would be to provide marvel as a
> service.. that will be huge! I noticed that the size of my data dir on
> marvel node was 37G just from a few days of monitoring. thats heavy.
>
>
> On Sat, Apr 19, 2014 at 1:05 AM, Boaz Leskes <b.les...@gmail.com> wrote:
>
>> Hi,
>>
>> Regarding monitoring node sizing - you have to go through pretty much the
>> same procedure as with your main cluster. See how much data it generates
>> per day and montior the memory usage of the node while using marvel on a
>> single day index. That would be the basis for you calculation. Based on
>> that and the number of days of data you want to retain you can decide how
>> many nodes you need and how much memory each should get. BTW - make sure
>> you use the latest version of marvel (1.1) - it has a way smaller data
>> signature.
>>
>> Regarding error on you main production cluster. I'm a bit puzzled but the
>> log output as the events are pretty far apart. It starts by a timeout of
>> the marvel agent, 6 hours later it failed to connect (in between it seems
>> everything is fine). Almsot 13 hours later  the node has had an OOM (after
>> which you have restarted it right? it has a different name). Then 40m later
>> the log shows that another node (10.183.42.216) is under pressure and
>> rejecting searchers.
>>
>> I'm not sure the first part is related to the second part. Can you share
>> your marvel chart of JVM memory regarding the Darkoth node? it seems your
>> main cluster is also under memory pressure.
>>
>> Cheers,
>> Boaz
>>
>> On Thursday, April 17, 2014 10:08:04 PM UTC+2, T Vinod Gupta wrote:
>>>
>>> hi,
>>> in my setup, marvel node is different from production cluster.. the
>>> production nodes send data to marvel node.. marvel node had OOM exception.
>>> this brings me to the quesiton, how much heap does it need? i ran with
>>> default config.
>>>
>>> in my prod cluster, i have a load balancer which is no data node. it
>>> runs with just 2GB heap. due to marvel failure, this node was getting
>>> timeouts and for some strange reason went down.
>>>
>>> what are the best practices here? how can i avoid this in the future?
>>>
>>> marvel node -
>>> [2014-04-17 09:13:33,715][WARN ][index.engine.internal    ]
>>> [Gorilla-Man] [.marvel-2014.04.17][0] failed engine
>>> java.lang.OutOfMemoryError: Java heap space
>>> [2014-04-17 09:13:46,890][ERROR][index.engine.internal    ]
>>> [Gorilla-Man] [.marvel-2014.04.17][0] failed to acquire searcher, source
>>> search_factory
>>> org.apache.lucene.store.AlreadyClosedException: this ReferenceManager
>>> is closed
>>>         at org.apache.lucene.search.ReferenceManager.acquire(
>>> ReferenceManager.java:98)
>>> ...
>>>
>>>
>>> ES LB node -
>>> [2014-04-17 00:01:00,567][ERROR][marvel.agent.exporter    ] [Darkoth]
>>> create fai
>>> lure (index:[.marvel-2014.04.16] type: [node_stats]):
>>> UnavailableShardsException
>>> [[.marvel-2014.04.16][0] [2] shardIt, [0] active : Timeout waiting for
>>> [1m], req
>>> uest: org.elasticsearch.action.bulk.BulkShardRequest@5d9be928]
>>> [2014-04-17 06:41:46,975][ERROR][marvel.agent.exporter    ] [Darkoth]
>>> error conn
>>> ecting to [ip-10-68-145-124.ec2.internal:9200]
>>> java.net.SocketTimeoutException: connect timed out
>>>  [2014-04-17 18:53:09,969][DEBUG][action.admin.cluster.node.info]
>>> [Darkoth] faile
>>> d to execute on node [L1f57myxQLK1SSRHRFcvFQ]
>>> java.lang.OutOfMemoryError: Java heap space
>>> [2014-04-17 19:35:05,805][DEBUG][action.search.type       ] [Witchfire]
>>> [twitter
>>> _072013][0], node[5GNeFfbPTGi-1EccVvR7Nw], [P], s[STARTED]: Failed to
>>> execute [o
>>> rg.elasticsearch.action.search.SearchRequest@2f94d571] lastShard [true]
>>> org.elasticsearch.transport.RemoteTransportException: [Mauvais][inet[/
>>> 10.183.42.
>>> 216:9300]][search/phase/query]
>>> Caused by: org.elasticsearch.common.util.concurrent.
>>> EsRejectedExecutionException
>>> : rejected execution (queue capacity 1000) on
>>> org.elasticsearch.transport.netty.
>>> MessageChannelHandler$RequestHandler@4c75d754
>>>         at org.elasticsearch.common.util.concurrent.EsAbortPolicy.
>>> rejectedExecut
>>> ion(EsAbortPolicy.java:62)
>>>
>>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/614c3f0e-6aa4-4848-9f47-1a9b93e536f5%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/614c3f0e-6aa4-4848-9f47-1a9b93e536f5%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/Syi85qoZ3Uo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAHau4yvOpujFvW%2BqDkjE4j0xTpqdejR_-py5Nx_H6%2BzaQP5Vkw%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAHau4yvOpujFvW%2BqDkjE4j0xTpqdejR_-py5Nx_H6%2BzaQP5Vkw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKzwz0rmdvto8SOOLNz6yFqOUMVhx6FcGV9Rk3y%3Di-%2B_e7UcJg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: OutOfMemoryError on marvel node brought down the production cluster

Reply via email to