Hi, We have a 98 node cluster of ES with each node 32GB RAM. 16GB is reserved for ES via config file. The index has 98 shards with 2 replicas.
On this cluster we are loading a large number of documents (when done it would be about 10 billion). In this use case about 40million documents are generated per hour and we are pre-loading several days worth of documents to prototype how ES will scale, and its query performance. Right now we are facing problems getting data loaded. Indexing is turned off. We use NEST client, with batch size of 10k. To speed up data load, we distribute the hourly data to each of the 98 nodes to insert in parallel. This worked ok for a few hours till we got 4.5B documents in the cluster. After that the cluster state went to red. The outstanding tasks CAT API shows errors like below. CPU/Disk/Memory seems to be fine on the nodes. Why are we getting these errors?. any help greatly appreciated since this blocks prototyping ES for our use case. thanks Darshat Sample errors: source : shard-failed ([agora_v1][24], node[00ihc1ToRiqMDJ1lou1Sig], [R], s[INITIALIZING]), reason [Failed to start shard, message [RecoveryFailedException[[agora_v1][24]: Recovery failed from [Shingen Harada][RDAwqX9yRgud9f7YtZAJPg][CH1 SCH060051438][inet[/10.46.153.84:9300]] into [Elfqueen][ 00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182 .106:9300]]]; nested: RemoteTransportException[[Shingen Harada][inet[/10.46.153.84:9300]][internal:index/shard/r ecovery/start_recovery]]; nested: RecoveryEngineException[[agora_v1][24] Phase[1] Execution failed]; nested: RecoverFilesRecoveryException[[agora_v1][24] Failed to transfer [0] files with total size of [0b]]; nested: NoS uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1 \24\index\segments_6r]; ]] AND source : shard-failed ([agora_v1][95], node[PUsHFCStRaecPA6MuvJV9g], [P], s[INITIALIZING]), reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[agora_v1][95] failed to fetch index version after copying it over]; nested: CorruptIndexException[[agora_v1][95] Preexisting corrupted index [corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by: CorruptIndexException[Read past EOF while reading segment infos] EOFException[read past EOF: MMapIndexInput(path="D:\ app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el asticsearch\nodes\0\indices\agora_v1\95\index\segments_1 1j")] org.apache.lucene.index.CorruptIndexException: Read past EOF while reading segment infos at org.elasticsearch.index.store.Store.readSegmentsI nfo(Store.java:127) at org.elasticsearch.index.store.Store.access$400(St ore.java:80) at org.elasticsearch.index.store.Store$MetadataSnaps hot.buildMetadata(Store.java:575) ---snip more stack trace----- -- View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Index-corruption-when-upload-large-number-of-documents-4billion-tp4068742.html Sent from the ElasticSearch Users mailing list archive at Nabble.com. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1420774624607-4068742.post%40n3.nabble.com. For more options, visit https://groups.google.com/d/optout.