Those commits look good to me! I'd be super curious what you see in the logs especially coming from this:
logger.warn("Searcher was released twice", new ElasticSearchIllegalStateException("Double release")); The bulk index does it contain delete by query or so? simon On Tuesday, February 11, 2014 1:53:02 PM UTC+1, Jay Modi wrote: > > Simon, prior to this post I decided to try a custom build from the 0.90.11 > tag with the following commits cherry picked onto my branch: > ad1097f1ba109d6cb235ba541251ba63abb27c16 > b4ec18814b3eeb35d948c01abec3e04745f57458 > 93e9d2146e77f6c0523875b93c768ab7f81cfe04 > 6056d2cb5553e6277b023e6860739847fbd95bb7 > > I tried to minimize changes from the latest released version to the subset > that deal with changes for managing references. I'll see how this works. > > Also, we were able to get some log information (client side) to find out > more details. The problems started occurring around 4:16am EST and > continued until 4:45am EST. We noticed issues started with a bulk index > request (not sure if that helps at all). > > On Tuesday, February 11, 2014 4:21:12 AM UTC-5, simonw wrote: >> >> Jay, first of all good that this prevents the server from going into an >> infinite loop! Can you maybe build elasticsearch from the 0.90 branch (mvn >> clean -DskipTests packge) and deploy the artifact that is in >> target/release/ - we added some stuff to prevent this can you maybe give >> it a go if that fixes your problem? >> >> simon >> >> On Monday, February 10, 2014 4:47:51 PM UTC+1, Jay Modi wrote: >>> >>> I took the latest patch from the Lucene issue and patched the 4.6.0 tag >>> and started using that jar in an instance (running 0.90.8). It looks like >>> that instance hit the condition that previously would have caused the >>> infinite loop. Our logs filled up with exception messages about failing to >>> acquire a searcher. I have created a gist of a portion of the log file with >>> all of the failed to acquire messages >>> https://gist.github.com/jaymode/a73a9695e31d7ec53f3f >>> >>> This appears to have gone on for about 15 to 30 minutes and then it >>> looks as though a flush occurred and the errors went away, see the >>> following gist: https://gist.github.com/jaymode/421919a3471f8358d5d4 >>> >>> Doing some log analysis I have received these counts of log messages: >>> >>> Count Message >>> 272,190 [Collector 3] [elements][2] failed to acquire searcher, source >>> search >>> 2 [Collector 3] [elements][2] failed to acquire searcher, >>> source search_factory >>> 1 [Collector 3] [elements][2] failed to prepare/warm >>> >>> 1 [Collector 3] [elements][2] Failed to close current >>> SearcherManager >>> >>> It is interesting that it failed to prepare/warm because we do not >>> explicitly call any index warmer. >>> >>> Is there something to be done here so the logs do not fill up with these >>> messages? We save about 550mb of logs and we churned through that much in >>> about 15 minutes when this happened, so I do not have log message from the >>> beginning of this event. >>> On Thursday, February 6, 2014 3:18:14 PM UTC-5, simonw wrote: >>>> >>>> Thanks so much for the infos here! That is really helpful! Just to keep >>>> you updated, I committed a fix for the infinite loop to lucene 5 min ago ( >>>> https://issues.apache.org/jira/browse/LUCENE-5436) and we are working >>>> on porting the fixes to ES as we speak. We are adding additional checks to >>>> ES core to prevent this as much as possible. We will keep you updated. >>>> Nevertheless I guess upgrading to the latest version 0.90.11 would be good >>>> for you there are lots of sneaky things that are fixed! >>>> >>>> simon >>>> >>>> On Thursday, February 6, 2014 8:59:52 PM UTC+1, JBartak wrote: >>>>> >>>>> On our case were were running on CentOS 6.4 >>>>> >>>>> * We aren't deleting by query. >>>>> * We are running parent child queries (although not a lot of them and >>>>> maybe not at the time this happened). We do regularly run nested queries. >>>>> * We didn't see any exceptions in the logs. >>>>> * Not sure what you mean by embedding ElasticSearch, but we were >>>>> running a 3 machine ElasticSearch cluster in Amazon and using no-data >>>>> nodes >>>>> to connect to it using the Java API. >>>>> * We have not reproduced the problem >>>>> * It happened while running 90.7, but we haven't upgraded past that >>>>> yet. >>>>> * We are not calling flush or updating engine settings. >>>>> >>>>> On Thursday, February 6, 2014 11:47:38 AM UTC-8, simonw wrote: >>>>>> >>>>>> Hey folks, >>>>>> >>>>>> we are trying to get to the bottom of this and I'd want to have some >>>>>> more infos here. Can you provide us with more insight in what you are >>>>>> doing >>>>>> with ES. What I'd be interested in is: >>>>>> >>>>>> * are you using delete by query >>>>>> * are you running parent child queries >>>>>> * are you seeing any exceptions in the logs >>>>>> * are you embedding elasticsearch >>>>>> * can you reproduce the problem and if so what are you doing to do >>>>>> so. >>>>>> * can you tell me the latest version you have see this happening? I >>>>>> could see some problems in 0.90.7 that are fixed in 0.90.8 though. >>>>>> * are you forcefully flushing ie. calling flush via the API or >>>>>> update engine settings of any kind? >>>>>> >>>>>> simon >>>>>> >>>>>> On Thursday, February 6, 2014 3:03:17 PM UTC+1, Jay Modi wrote: >>>>>>> >>>>>>> Additionally, I took a heap dump from the ES 0.90.7 instance and I >>>>>>> believe I see why the code was stuck in acquire. >>>>>>> >>>>>>> I looked at all of the SearcherManager instance and the index >>>>>>> readers since it looks like that is what the code is trying to >>>>>>> increment a >>>>>>> count in when it acquires one. There were 24 readers in total >>>>>>> associated >>>>>>> with SearcherManager instances. Every reader except one had a refCount >>>>>>> of >>>>>>> 1, one had a refCount of 0. Now if there is a race condition while >>>>>>> the code is in acquire where the refCount goes down to 0 (reader >>>>>>> closes), >>>>>>> while the acquire method is being called the loop will run forever as >>>>>>> tryIncRef in a IndexReader will always return false if the refCount is >>>>>>> not >>>>>>> greater than 0. >>>>>>> >>>>>>> Hope that makes sense. FWIW my understanding of the code could be >>>>>>> wrong and something else could be going on; if clarification or more >>>>>>> information from the heap dump/jstack is needed, please let me know. >>>>>>> >>>>>>> On Thursday, February 6, 2014 8:50:22 AM UTC-5, Jay Modi wrote: >>>>>>>> >>>>>>>> No we are not running any plugins. Our application uses the Java >>>>>>>> API to query elasticsearch. >>>>>>>> >>>>>>>> We also had this happen on another instance running ES 0.20.2 with >>>>>>>> JDK6. Restarting "fixes" it but we only run a single node so when this >>>>>>>> happens everything with our application stops working. >>>>>>>> >>>>>>>> On Thursday, February 6, 2014 7:28:34 AM UTC-5, Adrien Grand wrote: >>>>>>>>> >>>>>>>>> On Fri, Jan 24, 2014 at 5:17 PM, Jay Modi <jay...@gmail.com>wrote: >>>>>>>>> >>>>>>>>>> Coincidentally we hit a similar issue on Windows yesterday with >>>>>>>>>> 0.90.8 and JDK 7u45. I took a jstack dump and noticed that all of my >>>>>>>>>> search >>>>>>>>>> threads were in ReferenceManager.acquire(). Looking at the Lucene >>>>>>>>>> code for >>>>>>>>>> acquire there is a "busy wait" while it tries to increment a >>>>>>>>>> reference. The >>>>>>>>>> system has stayed in this state and subsequent jstack dumps show the >>>>>>>>>> same >>>>>>>>>> behavior with all search threads in acquire. Prior to the lockup a >>>>>>>>>> index >>>>>>>>>> had completed and the lockup with a few searches and count queries. >>>>>>>>>> >>>>>>>>> >>>>>>>>> This looks bad. Are you using any particular plugin that may >>>>>>>>> acquire/release searchers? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Adrien Grand >>>>>>>>> >>>>>>>> -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/bf9bb845-c30a-47f6-928c-f1a814c0e06e%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.