Re: Bulk Indexing Problems

joergpra...@gmail.com Tue, 09 Sep 2014 10:56:31 -0700

You mentioned problems around 200.000 docs. What are these problems and how
do you think you can fix them? How does your bulk indexing procedure look
like?


By finetuning I mean slimming down all ES settings to the absolute minimum
to slow down indexing and allocate less resources. But in your case, unless
you are tied to 512mb, you really don't need to think about that.

Jörg

On Tue, Sep 9, 2014 at 7:28 PM, Joshua P <jpetersen...@gmail.com> wrote:

> Hi Jörg,
>
> Can you elaborate on what you mean by I still need more fine tuning?
>
> I've upped the heap size to 4g (in both places I mentioned before because
> it's not clear to me which one ES actually uses). I haven't tried to index
> again yet.
> Other than throttling my indexing, what are some other things I need to be
> thinking about?
>
> On Tuesday, September 9, 2014 12:53:35 PM UTC-4, Jörg Prante wrote:
>>
>> Let ES_HEAP_SIZE at least to 1 GB, for smaller heaps like 512m and
>> indexing around 1 million docs, you need some more fine tuning, which is
>> complicated. Your machine is ok to set the heap to 4 GB which is 50% of 8
>> GB RAM.
>>
>> Jörg
>>
>> On Tue, Sep 9, 2014 at 5:39 PM, Joshua P <jpeter...@gmail.com> wrote:
>>
>>> Here is /etc/default/elasticsearch
>>>
>>> # Run Elasticsearch as this user ID and group ID
>>> #ES_USER=elasticsearch
>>> #ES_GROUP=elasticsearch
>>>
>>> # Heap Size (defaults to 256m min, 1g max)
>>> ES_HEAP_SIZE=512m
>>>
>>> # Heap new generation
>>> #ES_HEAP_NEWSIZE=
>>>
>>> # max direct memory
>>> #ES_DIRECT_SIZE=
>>>
>>> # Maximum number of open files, defaults to 65535.
>>> MAX_OPEN_FILES=65535
>>>
>>> # Maximum locked memory size. Set to "unlimited" if you use the
>>> # bootstrap.mlockall option in elasticsearch.yml. You must also set
>>> # ES_HEAP_SIZE.
>>> MAX_LOCKED_MEMORY=unlimited
>>>
>>> # Maximum number of VMA (Virtual Memory Areas) a process can own
>>> #MAX_MAP_COUNT=262144
>>>
>>> # Elasticsearch log directory
>>> #LOG_DIR=/var/log/elasticsearch
>>>
>>> # Elasticsearch data directory
>>> #DATA_DIR=/var/lib/elasticsearch
>>>
>>> # Elasticsearch work directory
>>> #WORK_DIR=/tmp/elasticsearch
>>>
>>> # Elasticsearch configuration directory
>>> #CONF_DIR=/etc/elasticsearch
>>>
>>> # Elasticsearch configuration file (elasticsearch.yml)
>>> #CONF_FILE=/etc/elasticsearch/elasticsearch.yml
>>>
>>> # Additional Java OPTS
>>> #ES_JAVA_OPTS=
>>>
>>> # Configure restart on package upgrade (true, every other setting will
>>> lead to not restarting)
>>> #RESTART_ON_UPGRADE=true
>>>
>>> I also see the same setting in /etc/init.d/elasticsearch. Do you know
>>> which file takes priority? And what a good size would be?
>>>
>>> On Tuesday, September 9, 2014 11:32:19 AM UTC-4, vineeth mohan wrote:
>>>>
>>>> Hello Joshua ,
>>>>
>>>> I am not sure which variable you are referring to on the memory
>>>> settings in the config file , please paste the comment and config.
>>>> I usually change the config from init.d script.
>>>>
>>>> Best approach would be to bulk index say 10,000 feeds in sync mode ,
>>>> wait until is everything is indexed and then proceed to the next batch.
>>>> I am not sure about the java API , but long back i used to curl to this
>>>> stats API and see how much request was rejected.
>>>>
>>>> Thanks
>>>>           Vineeth
>>>>
>>>> On Tue, Sep 9, 2014 at 8:58 PM, Joshua P <jpeter...@gmail.com> wrote:
>>>>
>>>>> You also said you wouldn't recommend indexing that much information at
>>>>> once. How would you suggest breaking it up and what status should I look
>>>>> for before doing another batch? I have to come up with some process that 
>>>>> is
>>>>> repeatable and mostly automated.
>>>>>
>>>>> On Tuesday, September 9, 2014 11:12:59 AM UTC-4, Joshua P wrote:
>>>>>>
>>>>>> Thanks for the reply, Vineeth!
>>>>>>
>>>>>> What's a practical heap size? I've seen some people saying they set
>>>>>> it to 30gb but this confuses me because in the /etc/default/elasticsearch
>>>>>> file, the comment suggests the max is only 1gb?
>>>>>>
>>>>>> I'll look into the threadpool issue. Is there a Java API for
>>>>>> monitoring Cluster Node health? Can you point me at an example or give 
>>>>>> me a
>>>>>> link to that?
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> On Tuesday, September 9, 2014 10:52:35 AM UTC-4, vineeth mohan wrote:
>>>>>>>
>>>>>>> Hello Joshuva ,
>>>>>>>
>>>>>>> I have a feeling this has something to do with the threadpool.
>>>>>>> There is a limit on number of feeds to be queued for indexing.
>>>>>>>
>>>>>>> Try increasing the size of threadpool queue of index and bulk to a
>>>>>>> large number.
>>>>>>> Also through cluster node API on threadpool, you can see if any
>>>>>>> request has failed.
>>>>>>> Monitor this API for any failed request due to large volume.
>>>>>>>
>>>>>>> Threadpool - http://www.elasticsearch.org/guide/en/elasticsearch/
>>>>>>> reference/current/modules-threadpool.html
>>>>>>> Threadpool stats - http://www.elasticsearch.org
>>>>>>> /guide/en/elasticsearch/reference/current/cluster-nodes-stats.html
>>>>>>>
>>>>>>> Having said that , i wont recommend bulk indexing that much
>>>>>>> information at a time and 512 MB is not going to help much.
>>>>>>>
>>>>>>> Thanks
>>>>>>>           Vineeth
>>>>>>>
>>>>>>> On Tue, Sep 9, 2014 at 7:48 PM, Joshua P <jpeter...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi there!
>>>>>>>>
>>>>>>>> I'm trying to do a one-time index of about 800,000 records into an
>>>>>>>> instance of elasticsearch. But I'm having a bit of trouble. It 
>>>>>>>> continually
>>>>>>>> fails around 200,000 records. Looking at in the Elasticsearch Head 
>>>>>>>> Plugin,
>>>>>>>> my index goes offline and becomes unrecoverable.
>>>>>>>>
>>>>>>>> For now, I have it running on a VM on my personal machine.
>>>>>>>>
>>>>>>>> VM Config:
>>>>>>>> Ubuntu Server 14.04 64-Bit
>>>>>>>> 8 GB RAM
>>>>>>>> 2 Processors
>>>>>>>> 32 GB SSD
>>>>>>>>
>>>>>>>> Java
>>>>>>>> java version "1.7.0_65"
>>>>>>>> OpenJDK Runtime Environment (IcedTea 2.5.1)
>>>>>>>> (7u65-2.5.1-4ubuntu1~0.14.04.2)
>>>>>>>> OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
>>>>>>>>
>>>>>>>> Elasticsearch is using mostly the defaults. This is the output of:
>>>>>>>> curl http://localhost:9200/_nodes/process?pretty
>>>>>>>> {
>>>>>>>>   "cluster_name" : "property_transaction_data",
>>>>>>>>   "nodes" : {
>>>>>>>>     "KlFkO_qgSOKmV_jjj5xeVw" : {
>>>>>>>>       "name" : "Marvin Flumm",
>>>>>>>>       "transport_address" : "inet[/192.168.133.131:9300]",
>>>>>>>>       "host" : "ubuntu-es",
>>>>>>>>       "ip" : "127.0.1.1",
>>>>>>>>       "version" : "1.3.2",
>>>>>>>>       "build" : "dee175d",
>>>>>>>>       "http_address" : "inet[/192.168.133.131:9200]",
>>>>>>>>       "process" : {
>>>>>>>>         "refresh_interval_in_millis" : 1000,
>>>>>>>>         "id" : 1092,
>>>>>>>>         "max_file_descriptors" : 65535,
>>>>>>>>         "mlockall" : true
>>>>>>>>       }
>>>>>>>>     }
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>> I adjusted ES_HEAP_SIZE to 512mb.
>>>>>>>>
>>>>>>>> I'm using the following code to pull data from SQL Server and index
>>>>>>>> it.
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "elasticsearch" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to elasticsearc...@googlegroups.com.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3
>>>>>>>> f-462f-bdcf-df717cbc6269%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/elasticsearch/f94f96d4-8c3f-462f-bdcf-df717cbc6269%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "elasticsearch" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to elasticsearc...@googlegroups.com.
>>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>>> msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40goo
>>>>> glegroups.com
>>>>> <https://groups.google.com/d/msgid/elasticsearch/0dcac495-a071-4644-9349-109071fb1855%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/elasticsearch/b439af3d-69b0-4301-bf07-22b37767a17c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/a3680944-54fc-4d01-bb30-3a9465760cae%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/a3680944-54fc-4d01-bb30-3a9465760cae%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGZD-9XvBskpAv2T%2BCiQqK5V6UaJH0opMCeNkk%2B7aXvYw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Bulk Indexing Problems

Reply via email to