Thank you Nick, I tried that but I didn't see a noticeable performance 
improvement.

Also, I tried setting the number of replicas to "0", load the data, then 
put it back to "5", but this is causing some problems with our health check 
scripts, because the index is very large, and the shards seems to be in 
"INITIALIZING" status forever.

Regards.



On Wednesday, November 19, 2014 7:47:10 AM UTC-8, Nick Canzoneri wrote:
>
> On the index settings side, you can dynamically turn off the index 
> refresh_interval and also reduce the number of shard replicas for the 
> duration of the bulk import.
>
> Described here: 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk
>
> On Wed, Nov 19, 2014 at 2:53 AM, <xaviertr...@gmail.com <javascript:>> 
> wrote:
>
>> Hello,
>>
>> I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some 
>> geographical information into an elasticsearch index. With our current 
>> params, the loading is taking around 20-25 minutes to run, but we think it 
>> should be faster. Are these numbers similar to what other users are 
>> getting? Do you have any hints on how to get better performance? Any help 
>> will be appreciated. Please find the details below.
>>
>> Our ES cluster is version 1.1.1 with 11 nodes, and we are using 
>> Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the 
>> numbers of reducers to 11. Other params we use are:
>>
>> es.input.json=true
>> es.mapping.id=id
>> es.batch.size.bytes=10M
>> es.batch.size.entries=10000
>>
>> The average doc size is 1.3Kb, and each doc contains a "bbox" field with 
>> the shape definition like this:
>>
>> "bbox": {
>> "type": "envelope",
>> "coordinates": [
>> [
>> -77.08488844489459,
>> 38.9502995339637
>> ],
>> [
>> -77.0844224567727,
>> 38.9502305534064
>> ]
>> ]
>> }
>>
>> We are using the following mapping for this index, because these are the 
>> 3 fields of our docs we are more interested in:
>>
>> {
>>     "properties": {
>>         "bbox": {
>>             "precision": "10m",
>>             "tree": "quadtree",
>>             "type": "geo_shape"
>>         },
>>         "id": {
>>           "type": "string",
>>           "index": "not_analyzed"
>>         },
>>         "streets": {
>>           "type": "string"
>>         }
>>     }
>> }
>>
>> This is a typical output of the MapReduce job:
>>
>> 14/11/17 09:05:44 INFO mapred.JobClient:   Elasticsearch Hadoop Counters
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Retries=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Retries Total Time(ms)=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Total=1375
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bulk Total Time(ms)=11714959
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Accepted=14351811146
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Received=5498829
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Retried=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Bytes Sent=14351811146
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Documents Accepted=10129699
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Documents Received=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Documents Retried=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Documents Sent=10129699
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Network Retries=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Network Total 
>> Time(ms)=11732552
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Node Retries=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Scroll Total=0
>> 14/11/17 09:05:44 INFO mapred.JobClient:     Scroll Total Time(ms)=0
>>
>> Thanks,
>> Xavier.
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Nick Canzoneri
> Developer, Wildbit <http://wildbit.com/>
> Beanstalk <http://beanstalkapp.com/>, Postmark <http://postmarkapp.com/>, 
> dploy.io
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4d5bfe04-50a6-497a-8370-642fa0ed56ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to