Re: Garbage collection pauses causing cluster to get unresponsive

Srinath C Thu, 17 Jul 2014 21:39:08 -0700

Hi Michael,
   Did you get a chance to look at the hot_threads and iostat output?
   I also tried with EBS Provisioned SSB with 4000 IOPS and with that I was
able to ingest only at around 30K per second after which there are
EsRejectedExecutionException. There were 4 elasticsearch instances of type
c3.2xlarge. CPU utilization was around 650% (out of 800). The iostat output
on the instances looks like this:


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.66    0.00    0.14    0.15    0.04   98.01

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
xvdep1            7.86        36.95       266.05     392378    2825424
xvdf              0.03         0.20         0.00       2146          8
xvdg              0.03         0.21         0.07       2178        736
*xvdj             52.53         0.33      2693.62       3506   28605624*


On an instance store SSD I can go upto 48K per second with occasional
occurrences of EsRejectedExecutionException. Do you think I should try
storage optimized instances like i2.xlarge or i2.2xlarge to handle this
kind of load?

Regards,
Srinath.






On Wed, Jul 16, 2014 at 5:57 PM, Srinath C <srinat...@gmail.com> wrote:

> Hi Michael,
>    You were right. Its the IO that was the bottleneck. The data was being
> written into a standard EBS device - no provisioned IOPS.
>
>    After redirecting data into the local instance store SSD storage, I was
> able to get to a rate of around 50-55K without any EsRejectExceptions. The
> CPU utilization too is not too high - around 200 - 400%. I have attached
> the hot_threads output with this email. After running for around 1.5 hrs I
> could see a lot of EsRejectedExecutionException for certain periods of time.
>
> std_ebs_all_fine.txt - when using standard EBS. Around 25K docs per
> second. No EsRejectedExecutionExceptions.
> std_ebs_bulk_rejects.txt - when using standard EBS. Around 28K docs per
> second. No EsRejectedExecutionExceptions.
>
> instance_ssd_40K.txt - when using instance store SSD. Around 40K docs per
> second. No EsRejectedExecutionExceptions.
> instance_ssd_60K_few_rejects.txt - when using instance store SSD. Around
> 60K docs per second. Some  EsRejectedExecutionExceptions were seen.
> instance_ssd_60K_lot_of_rejects.txt - when using instance store SSD.
> Around 60K docs per second. A lot of  EsRejectedExecutionExceptions were
> seen.
>
>    Also attaching the iostat output for these instances.
>
> Regards,
> Srinath.
>
>
>
>
> On Wed, Jul 16, 2014 at 3:34 PM, joergpra...@gmail.com <
> joergpra...@gmail.com> wrote:
>
>> Adding to this recommendations, I would suggest running iostat tool to
>> monitor for any suspicious "%iowait" states while
>> ESRejectedExecutionExceptions do arise.
>>
>> Jörg
>>
>>
>> On Wed, Jul 16, 2014 at 11:53 AM, Michael McCandless <
>> m...@elasticsearch.com> wrote:
>>
>>> Where is the index stored in your EC2 instances?  It's it just an EBS
>>> attached storage (magnetic or SSDs?  provisioned IOPs or the default).
>>>
>>> Maybe try putting the index on the SSD instance storage instead?  I
>>> realize this is not a long term solution (limited storage, and it's cleared
>>> on reboot), but it would be a simple test to see if the IO limitations of
>>> EBS is the bottleneck here.
>>>
>>> Can you capture the hot threads output when you're at 200% CPU after
>>> indexing for a while?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Wed, Jul 16, 2014 at 3:03 AM, Srinath C <srinat...@gmail.com> wrote:
>>>
>>>> Hi Joe/Michael,
>>>>    I tried all your suggestions and found a remarkable difference in
>>>> the way elasticsearch is able to handle the bulk indexing.
>>>>    Right now, I'm able to ingest at the rate of 25K per second with the
>>>> same setup. But occasionally there are still some
>>>> EsRejectedExecutionException being raised. The CPUUtilization on the
>>>> elasticsearch nodes is so low (around 200% on an 8 core system) that it
>>>> seems that something else is wrong. I have also tried to increase
>>>> queue_size but it just delays the EsRejectedExecutionException.
>>>>
>>>> Any more suggestions on how to handle this?
>>>>
>>>> *Current setup*: 4 c3.2xlarge instances of ES 1.2.2.
>>>> *Current Configurations*:
>>>> index.codec.bloom.load: false
>>>> index.compound_format: false
>>>> index.compound_on_flush: false
>>>> index.merge.policy.max_merge_at_once: 4
>>>> index.merge.policy.max_merge_at_once_explicit: 4
>>>> index.merge.policy.max_merged_segment: 1gb
>>>> index.merge.policy.segments_per_tier: 4
>>>> index.merge.policy.type: tiered
>>>> index.merge.scheduler.max_thread_count: 4
>>>> index.merge.scheduler.type: concurrent
>>>> index.refresh_interval: 10s
>>>> index.translog.flush_threshold_ops: 50000
>>>> index.translog.interval: 10s
>>>> index.warmer.enabled: false
>>>> indices.memory.index_buffer_size: 50%
>>>> indices.store.throttle.type: none
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jul 15, 2014 at 6:24 PM, Srinath C <srinat...@gmail.com> wrote:
>>>>
>>>>> Thanks Joe, Michael and all. Really appreciate you help.
>>>>> I'll try out as per your suggestions and run the tests. Will post back
>>>>> on my progress.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jul 15, 2014 at 3:17 PM, Michael McCandless <
>>>>> m...@elasticsearch.com> wrote:
>>>>>
>>>>>> First off, upgrade ES to the latest (1.2.2) release; there have been
>>>>>> a number of bulk indexing improvements since 1.1.
>>>>>>
>>>>>> Second, disable merge IO throttling.
>>>>>>
>>>>>> Third, use the default settings, but increase index.refresh_interval
>>>>>> to perhaps 5s, and set index.translog.flush_threshold_ops to maybe 50000:
>>>>>> this decreases the frequency of Lucene level commits (= filesystem 
>>>>>> fsyncs).
>>>>>>
>>>>>> If possible, use SSDs: they are much faster for merging.
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 14, 2014 at 11:03 PM, Srinath C <srinat...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Each document is around 300 bytes on average so that bring up the
>>>>>>> data rate to around 17Mb per sec.
>>>>>>> This is running on ES version 1.1.1. I have been trying out
>>>>>>> different values for these configurations. queue_size was increased 
>>>>>>> when I
>>>>>>> got EsRejectedException due to queue going full (default size of 50).
>>>>>>> segments_per_tier was picked up from some articles on scaling. What 
>>>>>>> would
>>>>>>> be a reasonable value based on my data rate?
>>>>>>>
>>>>>>> If 60K seems to be too high are there any benchmarks available for
>>>>>>> ElasticSearch?
>>>>>>>
>>>>>>> Thanks all for your replies.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Monday, 14 July 2014 15:25:13 UTC+5:30, Jörg Prante wrote:
>>>>>>>
>>>>>>>> index.merge.policy.segments_per_tier: 100 and
>>>>>>>> threadpool.bulk.queue_size: 500 are extreme settings that should
>>>>>>>> be avoided as they allocate much resources. What you see by 
>>>>>>>> UnavailbleShardException
>>>>>>>> / NoNodes is congestion because of such extreme values.
>>>>>>>>
>>>>>>>> What ES version is this? Why don't you use the default settings?
>>>>>>>>
>>>>>>>> Jörg
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 14, 2014 at 4:46 AM, Srinath C <srin...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>    I'm having a tough time to keep ElasticSearch running healthily
>>>>>>>>> for even 20-30 mins in my setup. At an indexing rate of 28-36K per 
>>>>>>>>> second,
>>>>>>>>> the CPU utilization soon drops to 100% and never recovers. All client
>>>>>>>>> requests fail with UnavailbleShardException or "No Nodes" exception. 
>>>>>>>>> The
>>>>>>>>> logs show warnings from "monitor.jvm" saying that GC did not free up 
>>>>>>>>> much
>>>>>>>>> of memory.
>>>>>>>>>
>>>>>>>>>  The ultimate requirement is to import data into the ES cluster at
>>>>>>>>> around 60K per second on a setup explained below. The only operation 
>>>>>>>>> being
>>>>>>>>> performed is bulk import of documents. Soon the ES nodes become
>>>>>>>>> unresponsive and the CPU utilization drops to 100% (from 400-500%). 
>>>>>>>>> They
>>>>>>>>> don't seem to recover even after the bulk import operations are 
>>>>>>>>> ceased.
>>>>>>>>>
>>>>>>>>>   Any suggestions on how to tune the GC based on my requirements?
>>>>>>>>> What other information would be needed to look into this?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Srinath.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The setup:
>>>>>>>>>   - Cluster: a 4 node cluster of c3.2xlarge instances on aws-ec2.
>>>>>>>>>   - Load: The only operation during this test is bulk import of
>>>>>>>>> data. The documents are small around the size of ~200-500 bytes and 
>>>>>>>>> are
>>>>>>>>> being bulk imported into the cluster using storm.
>>>>>>>>>   - Bulk Import: A total of 7-9 storm workers using a single
>>>>>>>>> BulkProcessor each to import data into the ES cluster. As seen from 
>>>>>>>>> the
>>>>>>>>> logs, each of the worker processes are importing around 4K docs per 
>>>>>>>>> second
>>>>>>>>> from each worker i.e. around 28-36K docs per second getting imported 
>>>>>>>>> into
>>>>>>>>> ES.
>>>>>>>>>   - JVM Args: Around 8G of heap, tried with CMS collector as well
>>>>>>>>> as G1 collector
>>>>>>>>>   - ES configuration:
>>>>>>>>>      - "mlockall": true
>>>>>>>>>      - "threadpool.bulk.size": 20
>>>>>>>>>      - "threadpool.bulk.queue_size": 500
>>>>>>>>>      - "indices.memory.index_buffer_size": "50%"
>>>>>>>>>      - "index.refresh_interval": "30s"
>>>>>>>>>      - "index.merge.policy.segments_per_tier": 100
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "elasticsearch" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>>>
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/CAHhx-
>>>>>>>>> GLy1mftUtFT6eyrnuzcpNTu%3DDt3maj3YnuEdYKP4NaYWA%40mail.gmail.com
>>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAHhx-GLy1mftUtFT6eyrnuzcpNTu%3DDt3maj3YnuEdYKP4NaYWA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "elasticsearch" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to elasticsearch+unsubscr...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/elasticsearch/615b3774-b3b1-4104-bd22-0a7e4d8b6d4e%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/615b3774-b3b1-4104-bd22-0a7e4d8b6d4e%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to a topic in
>>>>>> the Google Groups "elasticsearch" group.
>>>>>> To unsubscribe from this topic, visit
>>>>>> https://groups.google.com/d/topic/elasticsearch/rlIuagf2_zY/unsubscribe
>>>>>> .
>>>>>> To unsubscribe from this group and all its topics, send an email to
>>>>>> elasticsearch+unsubscr...@googlegroups.com.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/elasticsearch/CAD7smRe7xv4yxVr1G9WNJbGXW0g1jRkKut7GSU0qYAUUPuSSDQ%40mail.gmail.com
>>>>>> <https://groups.google.com/d/msgid/elasticsearch/CAD7smRe7xv4yxVr1G9WNJbGXW0g1jRkKut7GSU0qYAUUPuSSDQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CAHhx-GK2LxUGTNOpnwH-PfnFx1Vwz8tGw7V-r50LZ5%3DUp5MPgQ%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/CAHhx-GK2LxUGTNOpnwH-PfnFx1Vwz8tGw7V-r50LZ5%3DUp5MPgQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>>  To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CAD7smRdFrgsyFnUQqgSgvvr4MnJDtDz-sWci3Z3qmTt_FrKJ2g%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CAD7smRdFrgsyFnUQqgSgvvr4MnJDtDz-sWci3Z3qmTt_FrKJ2g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/elasticsearch/rlIuagf2_zY/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to
>> elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFKb5fV86CSgWJyDNgKepMi40KOHyxPqyxk38FPOEmP8g%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFKb5fV86CSgWJyDNgKepMi40KOHyxPqyxk38FPOEmP8g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAHhx-GLL5MN_6E%2B4F%3Duq_4sNwYkgdBJuLNaJf%2BNLMdf7Ld_6mw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Garbage collection pauses causing cluster to get unresponsive

Reply via email to