Re: ElasticSearch server lock up

simonw Tue, 11 Feb 2014 07:29:59 -0800

Those commits look good to me! I'd be super curious what you see in the 
logs especially coming from this:


logger.warn("Searcher was released twice", new 
ElasticSearchIllegalStateException("Double release"));

The bulk index does it contain delete by query or so?

simon

On Tuesday, February 11, 2014 1:53:02 PM UTC+1, Jay Modi wrote:
>
> Simon, prior to this post I decided to try a custom build from the 0.90.11 
> tag with the following commits cherry picked onto my branch:
> ad1097f1ba109d6cb235ba541251ba63abb27c16
> b4ec18814b3eeb35d948c01abec3e04745f57458
> 93e9d2146e77f6c0523875b93c768ab7f81cfe04
> 6056d2cb5553e6277b023e6860739847fbd95bb7
>
> I tried to minimize changes from the latest released version to the subset 
> that deal with changes for managing references. I'll see how this works.
>
> Also, we were able to get some log information (client side) to find out 
> more details. The problems started occurring around 4:16am EST and 
> continued until 4:45am EST. We noticed issues started with a bulk index 
> request (not sure if that helps at all).
>
> On Tuesday, February 11, 2014 4:21:12 AM UTC-5, simonw wrote:
>>
>> Jay, first of all good that this prevents the server from going into an 
>> infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn 
>> clean -DskipTests packge) and deploy the artifact that is in 
>> target/release/  - we added some stuff to prevent this can you maybe give 
>> it a go if that fixes your problem?
>>
>> simon
>>
>> On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote:
>>>
>>> I took the latest patch from the Lucene issue and patched the 4.6.0 tag 
>>> and started using that jar in an instance (running 0.90.8). It looks like 
>>> that instance hit the condition that previously would have caused the 
>>> infinite loop. Our logs filled up with exception messages about failing to 
>>> acquire a searcher. I have created a gist of a portion of the log file with 
>>> all of the failed to acquire messages 
>>> https://gist.github.com/jaymode/a73a9695e31d7ec53f3f
>>>
>>> This appears to have gone on for about 15 to 30 minutes and then it 
>>> looks as though a flush occurred and the errors went away, see the 
>>> following gist: https://gist.github.com/jaymode/421919a3471f8358d5d4
>>>
>>> Doing some log analysis I have received these counts of log messages:
>>>
>>> Count     Message
>>> 272,190 [Collector 3] [elements][2] failed to acquire searcher, source 
>>> search
>>> 2              [Collector 3] [elements][2] failed to acquire searcher, 
>>> source search_factory
>>> 1              [Collector 3] [elements][2] failed to prepare/warm
>>>
>>> 1              [Collector 3] [elements][2] Failed to close current 
>>> SearcherManager
>>>
>>> It is interesting that it failed to prepare/warm because we do not 
>>> explicitly call any index warmer. 
>>>
>>> Is there something to be done here so the logs do not fill up with these 
>>> messages? We save about 550mb of logs and we churned through that much in 
>>> about 15 minutes when this happened, so I do not have log message from the 
>>> beginning of this event.
>>> On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote:
>>>>
>>>> Thanks so much for the infos here! That is really helpful! Just to keep 
>>>> you updated, I committed a fix for the infinite loop to lucene 5 min ago (
>>>> https://issues.apache.org/jira/browse/LUCENE-5436) and we are working 
>>>> on porting the fixes to ES as we speak. We are adding additional checks to 
>>>> ES core to prevent this as much as possible. We will keep you updated. 
>>>> Nevertheless I guess upgrading to the latest version 0.90.11 would be good 
>>>> for you there are lots of sneaky things that are fixed!
>>>>
>>>> simon
>>>>
>>>> On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote:
>>>>>
>>>>> On our case were were running on CentOS 6.4
>>>>>
>>>>> * We aren't deleting by query.
>>>>> * We are running parent child queries (although not a lot of them and 
>>>>> maybe not at the time this happened).  We do regularly run nested queries.
>>>>> * We didn't see any exceptions in the logs.
>>>>> * Not sure what you mean by embedding ElasticSearch, but we were 
>>>>> running a 3 machine ElasticSearch cluster in Amazon and using no-data 
>>>>> nodes 
>>>>> to connect to it using the Java API.
>>>>> * We have not reproduced the problem
>>>>> * It happened while running 90.7, but we haven't upgraded past that 
>>>>> yet.
>>>>> * We are not calling flush or updating engine settings. 
>>>>>
>>>>> On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote:
>>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> we are trying to get to the bottom of this and I'd want to have some 
>>>>>> more infos here. Can you provide us with more insight in what you are 
>>>>>> doing 
>>>>>> with ES. What I'd be interested in is:
>>>>>>
>>>>>>  * are you using delete by query
>>>>>>  * are you running parent child queries
>>>>>>  * are you seeing any exceptions in the logs
>>>>>>  * are you embedding elasticsearch
>>>>>>  * can you reproduce the problem and if so what are you doing to do 
>>>>>> so.
>>>>>>  * can you tell me the latest version you have see this happening? I 
>>>>>> could see some problems in 0.90.7 that are fixed in 0.90.8 though.
>>>>>>  * are you forcefully flushing ie. calling flush via the API or 
>>>>>> update engine settings of any kind?
>>>>>>
>>>>>> simon
>>>>>>
>>>>>> On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote:
>>>>>>>
>>>>>>> Additionally, I took a heap dump from the ES 0.90.7 instance and I 
>>>>>>> believe I see why the code was stuck in acquire. 
>>>>>>>
>>>>>>> I looked at all of the SearcherManager instance and the index 
>>>>>>> readers since it looks like that is what the code is trying to 
>>>>>>> increment a 
>>>>>>> count in when it acquires one. There were 24 readers in total 
>>>>>>> associated 
>>>>>>> with SearcherManager instances. Every reader except one had a refCount 
>>>>>>> of 
>>>>>>> 1, one had a refCount of 0. Now if there is a race condition while 
>>>>>>> the code is in acquire where the refCount goes down to 0 (reader 
>>>>>>> closes), 
>>>>>>> while the acquire method is being called the loop will run forever as 
>>>>>>> tryIncRef in a IndexReader will always return false if the refCount is 
>>>>>>> not 
>>>>>>> greater than 0.
>>>>>>>
>>>>>>> Hope that makes sense. FWIW my understanding of the code could be 
>>>>>>> wrong and something else could be going on; if clarification or more 
>>>>>>> information from the heap dump/jstack is needed, please let me know.
>>>>>>>
>>>>>>> On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote:
>>>>>>>>
>>>>>>>> No we are not running any plugins. Our application uses the Java 
>>>>>>>> API to query elasticsearch.
>>>>>>>>
>>>>>>>> We also had this happen on another instance running ES 0.20.2 with 
>>>>>>>> JDK6. Restarting "fixes" it but we only run a single node so when this 
>>>>>>>> happens everything with our application stops working.
>>>>>>>>
>>>>>>>> On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <jay...@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>> Coincidentally we hit a similar issue on Windows yesterday with 
>>>>>>>>>> 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my 
>>>>>>>>>> search 
>>>>>>>>>> threads were in ReferenceManager.acquire(). Looking at the Lucene 
>>>>>>>>>> code for 
>>>>>>>>>> acquire there is a "busy wait" while it tries to increment a 
>>>>>>>>>> reference. The 
>>>>>>>>>> system has stayed in this state and subsequent jstack dumps show the 
>>>>>>>>>> same 
>>>>>>>>>> behavior with all search threads in acquire. Prior to the lockup a 
>>>>>>>>>> index 
>>>>>>>>>> had completed and the lockup with a few searches and count queries.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This looks bad. Are you using any particular plugin that may 
>>>>>>>>> acquire/release searchers?
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Adrien Grand
>>>>>>>>>  
>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/bf9bb845-c30a-47f6-928c-f1a814c0e06e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: ElasticSearch server lock up

Reply via email to