Sure, I'm happy to bundle up all logs and ship them to you guys. Assuming 
zipped in an email is fine?

We think the OOM caused the caused corruption which *later* led write.lock 
file handles being left open on the ES process when hitting EOF errors 
(which seemed like a bug, but I'm not very versed in ES failure scenarios), 
so IndexLock was a bit of a red herring until we found the underlying 
corruption. In fact, it would often accept writes/queries for a while, I 
assume until it tried to read the broken segment and broke (perhaps trying 
to promote a different, also broken shard to primary).

In hindsight, simply fixing the Lucene segments and restarting the entire 
cluster (to clear file handles) would have done the trick, but since this 
was production we wanted to do it one node at a time.


On Wednesday, December 18, 2013 1:03:13 AM UTC-8, Alexander Reelsen wrote:
>
> Hey,
>
> great, you got it running again. The replica corruption thing makes sense, 
> btw.
> Do you still have a stack trace of the OOM exception you found first? 
> Would like to see what caused it and maybe what one can do about it in the 
> future, if there is more information.
>
>
> --Alex
>
>
> On Wed, Dec 18, 2013 at 9:12 AM, Bryan Helmig <[email protected]<javascript:>
> > wrote:
>
>> Okay, a combination of CheckIndex -fix, careful manual allocation of 
>> shard 0, and restarts to clear the lock files has resulted in a green 
>> cluster.
>>
>>
>> On Tuesday, December 17, 2013 11:51:04 PM UTC-8, Bryan Helmig wrote:
>>>
>>> So, a little more digging and it looks like it was holding onto a 
>>> write.lock that was gone.
>>>
>>> sudo lsof -uelasticsearch | grep 'legacy/0'
>>> java    27517 elasticsearch 1042uW  REG              202,1          0 
>>>  525279 /var/data/elasticsearch/Rage Against the Machine/nodes/0/indices/
>>> zapier_legacy/0/index/write.lock (deleted)
>>>
>>> We did delete some leftover lock files after the nodes powered down, but 
>>> that seems like it shouldn't have caused this (unless we made a mistake and 
>>> nuked it on a live instance). Somehow that plus the OOM corruption led to a 
>>> pretty crazy situation. We're almost back from it after some restarts, we 
>>> should be able to have a blog post on the situation after. I'll follow up 
>>> with results and a link ASAP.
>>>
>>>
>>> On Tuesday, December 17, 2013 8:13:39 PM UTC-8, Bryan Helmig wrote:
>>>>
>>>> We're also fine with loosing a few docs as we can reindex them from 
>>>> another source, so dropping the documents works for us.
>>>>
>>>>
>>>> On Tuesday, December 17, 2013 7:47:21 PM UTC-8, Bryan Helmig wrote:
>>>>>
>>>>> All replicas have the same corruption, it seems. We can't get a 
>>>>> primary up for shard 0, therefore the replica never comes up. Does that 
>>>>> make sense?
>>>>>
>>>>>
>>>>> On Tuesday, December 17, 2013 6:33:19 PM UTC-8, Jörg Prante wrote:
>>>>>>
>>>>>> Hm, just wanted to clarify that I'm not familiar with the effects of 
>>>>>> latest ES on Lucene 4 "index.shard.check_on_startup: fix"
>>>>>>
>>>>>> Even if I can test it, there is no guarantee that it works for you. 
>>>>>> Different systems, different index, different corruptions... who knows.
>>>>>>
>>>>>> I'm quite puzzled, you don't have a replica shard? The "CheckIndex" 
>>>>>> is really a last resort if there are no replica, and it is not the 
>>>>>> preferred method to ensure data integrity in ES...
>>>>>>
>>>>>> Jörg
>>>>>>
>>>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/f5413107-6701-4081-9e2c-be7035865cfb%40googlegroups.com
>> .
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/70b5a680-de43-46c6-aae2-5d76352181e2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to