Re: Bulk load performance

2014-11-19 Thread xaviertrujillo111
Thank you Nick, I tried that but I didn't see a noticeable performance 
improvement.

Also, I tried setting the number of replicas to "0", load the data, then 
put it back to "5", but this is causing some problems with our health check 
scripts, because the index is very large, and the shards seems to be in 
"INITIALIZING" status forever.

Regards.



On Wednesday, November 19, 2014 7:47:10 AM UTC-8, Nick Canzoneri wrote:
>
> On the index settings side, you can dynamically turn off the index 
> refresh_interval and also reduce the number of shard replicas for the 
> duration of the bulk import.
>
> Described here: 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk
>
> On Wed, Nov 19, 2014 at 2:53 AM, > 
> wrote:
>
>> Hello,
>>
>> I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some 
>> geographical information into an elasticsearch index. With our current 
>> params, the loading is taking around 20-25 minutes to run, but we think it 
>> should be faster. Are these numbers similar to what other users are 
>> getting? Do you have any hints on how to get better performance? Any help 
>> will be appreciated. Please find the details below.
>>
>> Our ES cluster is version 1.1.1 with 11 nodes, and we are using 
>> Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the 
>> numbers of reducers to 11. Other params we use are:
>>
>> es.input.json=true
>> es.mapping.id=id
>> es.batch.size.bytes=10M
>> es.batch.size.entries=1
>>
>> The average doc size is 1.3Kb, and each doc contains a "bbox" field with 
>> the shape definition like this:
>>
>> "bbox": {
>> "type": "envelope",
>> "coordinates": [
>> [
>> -77.08488844489459,
>> 38.9502995339637
>> ],
>> [
>> -77.0844224567727,
>> 38.9502305534064
>> ]
>> ]
>> }
>>
>> We are using the following mapping for this index, because these are the 
>> 3 fields of our docs we are more interested in:
>>
>> {
>> "properties": {
>> "bbox": {
>> "precision": "10m",
>> "tree": "quadtree",
>> "type": "geo_shape"
>> },
>> "id": {
>>   "type": "string",
>>   "index": "not_analyzed"
>> },
>> "streets": {
>>   "type": "string"
>> }
>> }
>> }
>>
>> This is a typical output of the MapReduce job:
>>
>> 14/11/17 09:05:44 INFO mapred.JobClient:   Elasticsearch Hadoop Counters
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146
>> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699
>> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699
>> 14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Network Total 
>> Time(ms)=11732552
>> 14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0
>> 14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0
>>
>> Thanks,
>> Xavier.
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Nick Canzoneri
> Developer, Wildbit 
> Beanstalk , Postmark , 
> dploy.io
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4d5bfe04-50a6-497a-8370-642fa0ed56ff%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Bulk load performance

2014-11-19 Thread Nick Canzoneri
On the index settings side, you can dynamically turn off the index
refresh_interval and also reduce the number of shard replicas for the
duration of the bulk import.

Described here:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk

On Wed, Nov 19, 2014 at 2:53 AM,  wrote:

> Hello,
>
> I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some
> geographical information into an elasticsearch index. With our current
> params, the loading is taking around 20-25 minutes to run, but we think it
> should be faster. Are these numbers similar to what other users are
> getting? Do you have any hints on how to get better performance? Any help
> will be appreciated. Please find the details below.
>
> Our ES cluster is version 1.1.1 with 11 nodes, and we are using
> Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the
> numbers of reducers to 11. Other params we use are:
>
> es.input.json=true
> es.mapping.id=id
> es.batch.size.bytes=10M
> es.batch.size.entries=1
>
> The average doc size is 1.3Kb, and each doc contains a "bbox" field with
> the shape definition like this:
>
> "bbox": {
> "type": "envelope",
> "coordinates": [
> [
> -77.08488844489459,
> 38.9502995339637
> ],
> [
> -77.0844224567727,
> 38.9502305534064
> ]
> ]
> }
>
> We are using the following mapping for this index, because these are the 3
> fields of our docs we are more interested in:
>
> {
> "properties": {
> "bbox": {
> "precision": "10m",
> "tree": "quadtree",
> "type": "geo_shape"
> },
> "id": {
>   "type": "string",
>   "index": "not_analyzed"
> },
> "streets": {
>   "type": "string"
> }
> }
> }
>
> This is a typical output of the MapReduce job:
>
> 14/11/17 09:05:44 INFO mapred.JobClient:   Elasticsearch Hadoop Counters
> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375
> 14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959
> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146
> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829
> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146
> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699
> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699
> 14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Network Total
> Time(ms)=11732552
> 14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0
> 14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0
>
> Thanks,
> Xavier.
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Nick Canzoneri
Developer, Wildbit 
Beanstalk , Postmark ,
dploy.io

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAKWm5yPDSs_PABPi7Ydnr0h8utGAwOTOJuyDvEBm4fNMLG-Sqg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Bulk load performance

2014-11-18 Thread xaviertrujillo111
Hello,

I'm trying to do a bulk load of ~10M JSON docs (12.8Gb) with some 
geographical information into an elasticsearch index. With our current 
params, the loading is taking around 20-25 minutes to run, but we think it 
should be faster. Are these numbers similar to what other users are 
getting? Do you have any hints on how to get better performance? Any help 
will be appreciated. Please find the details below.

Our ES cluster is version 1.1.1 with 11 nodes, and we are using 
Elasticsearch-MapReduce libraries 2.0.2 to do the bulk-load, setting the 
numbers of reducers to 11. Other params we use are:

es.input.json=true
es.mapping.id=id
es.batch.size.bytes=10M
es.batch.size.entries=1

The average doc size is 1.3Kb, and each doc contains a "bbox" field with 
the shape definition like this:

"bbox": {
"type": "envelope",
"coordinates": [
[
-77.08488844489459,
38.9502995339637
],
[
-77.0844224567727,
38.9502305534064
]
]
}

We are using the following mapping for this index, because these are the 3 
fields of our docs we are more interested in:

{
"properties": {
"bbox": {
"precision": "10m",
"tree": "quadtree",
"type": "geo_shape"
},
"id": {
  "type": "string",
  "index": "not_analyzed"
},
"streets": {
  "type": "string"
}
}
}

This is a typical output of the MapReduce job:

14/11/17 09:05:44 INFO mapred.JobClient:   Elasticsearch Hadoop Counters
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Retries Total Time(ms)=0
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total=1375
14/11/17 09:05:44 INFO mapred.JobClient: Bulk Total Time(ms)=11714959
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Accepted=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Received=5498829
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Bytes Sent=14351811146
14/11/17 09:05:44 INFO mapred.JobClient: Documents Accepted=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Documents Received=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Retried=0
14/11/17 09:05:44 INFO mapred.JobClient: Documents Sent=10129699
14/11/17 09:05:44 INFO mapred.JobClient: Network Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Network Total Time(ms)=11732552
14/11/17 09:05:44 INFO mapred.JobClient: Node Retries=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total=0
14/11/17 09:05:44 INFO mapred.JobClient: Scroll Total Time(ms)=0

Thanks,
Xavier.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/70956234-78d0-4ee2-9536-398ac529b76a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.