Re: corruption when indexing large number of documents (4 billion+)

Darshat Shah Tue, 20 Jan 2015 09:04:29 -0800

So an update on this. I set aside 7 dedicated master eligible nodes, and 
ensured the data upload connection pool only round robins over the data 
nodes. Updated batch size to 5k. With this, upload of about 6Billion 
documents (5TB of data) went through in about 12 hours.


Additionally on windows platform there is an issue with setting heap size 
via ES_HEAP_SIZE in powershell. I noticed frequent OOM on queries and turns 
out the setting somehow hadn't taken effect during installation via my 
powershell scripts. Went and fixed this on each node.

Now the cluster looks much more stable. Aggregation times are still a 
little disappointing (20s or roundabouts for our scenarios) so it doesn't 
meet our internal need to get it below 5s mark. I'm going to experiment 
with turning off some nodes to see how more/less nodes affect aggregation 
times.

On Monday, January 12, 2015 at 3:33:10 PM UTC+5:30, Darshat wrote:

> Hi Mark, 
> Thanks for the reply. I need to prototype and demonstrate at this scale 
> first to ensure feasibility. Once we've proven ES works for this use case 
> then its quite possible that we'd engage with support for production. 
>   
> Regarding your questions: 
> What version of ES, java? 
> [Darshat]ES 1.4.1, JVM 1.8.0 (latest I found from Oracle for Win64) 
>
>
> What are you using to monitor your cluster? 
>
> Not much really. I tried installing marvel after we ran into issues, but 
> mostly looking at cat apis and indices apis. 
>
>
> How many GB is that index? 
>
> About 1GB for every million entries. We don't have string fields but many 
> numeric fields on which aggregations are needed. With 4.5billion docs, the 
> total index size was about 4.5 TB spread over the 98 nodes. 
>
>
> Is it in one massive index? 
>
> Yes. We need aggregations over this data across multiple fields 
>
>
> How many GB in your data in total? 
>
> It can be order of 30TB. 
>
>
> Why do you have 2 replicas? 
>
> Data loss is generally considered not ok. However for our next run, as you 
> suggest, we will start with 0 replica and update it after bulk load is 
> done. 
>
>
> Are you searching while indexing, or just indexing the data? If it's the 
> latter then you might want to try disabling replica's and then setting the 
> index refresh rate to -1 for the index, insert your data, and then turn 
> refresh back on and then let the data index. That's best practice for 
> large 
> amounts of indexing. 
>
>
> Just indexing to get historical data up there. We did set the refresh to 
> -1 
> before we began upload. 
>
>
> Also, consider dropping your bulk size down to 5K, that's generally 
> considered the upper limit for bulk API batches. 
>
>   
>
> We are going to attempt another upload with these changes. I also set the 
> index_buffer_size to 40% in case it helps. 
>
>
>
> -- 
> View this message in context: 
> http://elasticsearch-users.115913.n3.nabble.com/corruption-when-indexing-large-number-of-documents-4-billion-tp4068743p4068787.html
>  
> Sent from the ElasticSearch Users mailing list archive at Nabble.com. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/956440a4-00f1-4027-a9cb-11f2d0a5de47%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: corruption when indexing large number of documents (4 billion+)

Reply via email to