Hi, the full stack is like this (from outstanding tasks api). We are using ES 1.4.1
insert_order : 69862 priority : HIGH source : shard-failed ([agora_v1][24], node[SEIBtFznTtGpLFPgCLgW4w], [R], s[INITIALIZING]), reason [Failed to start shard, message [CorruptIndexException[[agora_v1][24] Preexisting corrupted index [corrupted_LrKHKRF7Q2KuL15TT_hPvw] caused by: CorruptIndexException[Read past EOF while reading segment infos] EOFException[read past EOF: MMapIndexInput(path="D:\ app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el asticsearch\nodes\0\indices\agora_v1\24\index\segments_6 w")] org.apache.lucene.index.CorruptIndexException: Read past EOF while reading segment infos at org.elasticsearch.index.store.Store.readSegmentsI nfo(Store.java:127) at org.elasticsearch.index.store.Store.access$400(St ore.java:80) at org.elasticsearch.index.store.Store$MetadataSnaps hot.buildMetadata(Store.java:575) at org.elasticsearch.index.store.Store$MetadataSnaps hot.<init>(Store.java:568) at org.elasticsearch.index.store.Store.getMetadata(S tore.java:186) at org.elasticsearch.index.store.Store.getMetadataOr Empty(Store.java:150) at org.elasticsearch.indices.store.TransportNodesLis tShardStoreMetaData.listStoreMetaData(TransportNodesList ShardStoreMetaData.java:152) at org.elasticsearch.indices.store.TransportNodesLis tShardStoreMetaData.nodeOperation(TransportNodesListShar dStoreMetaData.java:138) at org.elasticsearch.indices.store.TransportNodesLis tShardStoreMetaData.nodeOperation(TransportNodesListShar dStoreMetaData.java:59) at org.elasticsearch.action.support.nodes.TransportN odesOperationAction$NodeTransportHandler.messageReceived (TransportNodesOperationAction.java:278) at org.elasticsearch.action.support.nodes.TransportN odesOperationAction$NodeTransportHandler.messageReceived (TransportNodesOperationAction.java:269) at org.elasticsearch.transport.netty.MessageChannelH andler$RequestHandler.run(MessageChannelHandler.java:275 ) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.ru n(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException: read past EOF: MMapInde xInput(path="D:\app\ES.ElasticSearch_v010\elasticsearch- 1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1\24\ index\segments_6w") at org.apache.lucene.store.ByteBufferIndexInput.read Byte(ByteBufferIndexInput.java:81) at org.apache.lucene.store.BufferedChecksumIndexInpu t.readByte(BufferedChecksumIndexInput.java:41) at org.apache.lucene.store.DataInput.readInt(DataInp ut.java:98) at org.apache.lucene.index.SegmentInfos.read(Segment Infos.java:343) at org.apache.lucene.index.SegmentInfos$1.doBody(Seg mentInfos.java:454) at org.apache.lucene.index.SegmentInfos$FindSegments File.run(SegmentInfos.java:906) at org.apache.lucene.index.SegmentInfos$FindSegments File.run(SegmentInfos.java:752) at org.apache.lucene.index.SegmentInfos.read(Segment Infos.java:450) at org.elasticsearch.common.lucene.Lucene.readSegmen tInfos(Lucene.java:85) at org.elasticsearch.index.store.Store.readSegmentsI nfo(Store.java:124) ... 14 more ]]] executing : True time_in_queue_millis : 52865 time_in_queue : 52.8s insert_order : 69863 priority : HIGH source : shard-failed ([agora_v1][24], node[SEIBtFznTtGpLFPgCLgW4w], [R], s[INITIALIZING]), reason [engine failure, message [corrupted preexisting index][CorruptIndexException[[agora_v1][24] Preexisting corrupted index [corrupted_LrKHKRF7Q2KuL15TT_hPvw] caused by: CorruptIndexException[Read past EOF while reading segment infos] EOFException[read past EOF: MMapIndexInput(path="D:\ app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el asticsearch\nodes\0\indices\agora_v1\24\index\segments_6 w")] org.apache.lucene.index.CorruptIndexException: Read past EOF while reading segment infos at org.elasticsearch.index.store.Store.readSegmentsI nfo(Store.java:127) at org.elasticsearch.index.store.Store.access$400(St ore.java:80) at org.elasticsearch.index.store.Store$MetadataSnaps hot.buildMetadata(Store.java:575) at org.elasticsearch.index.store.Store$MetadataSnaps hot.<init>(Store.java:568) at org.elasticsearch.index.store.Store.getMetadata(S tore.java:186) at org.elasticsearch.index.store.Store.getMetadataOr Empty(Store.java:150) at org.elasticsearch.indices.store.TransportNodesLis tShardStoreMetaData.listStoreMetaData(TransportNodesList ShardStoreMetaData.java:152) at org.elasticsearch.indices.store.TransportNodesLis tShardStoreMetaData.nodeOperation(TransportNodesListShar dStoreMetaData.java:138) at org.elasticsearch.indices.store.TransportNodesLis tShardStoreMetaData.nodeOperation(TransportNodesListShar dStoreMetaData.java:59) at org.elasticsearch.action.support.nodes.TransportN odesOperationAction$NodeTransportHandler.messageReceived (TransportNodesOperationAction.java:278) at org.elasticsearch.action.support.nodes.TransportN odesOperationAction$NodeTransportHandler.messageReceived (TransportNodesOperationAction.java:269) at org.elasticsearch.transport.netty.MessageChannelH andler$RequestHandler.run(MessageChannelHandler.java:275 ) at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.ru n(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException: read past EOF: MMapInde xInput(path="D:\app\ES.ElasticSearch_v010\elasticsearch- 1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1\24\ index\segments_6w") at org.apache.lucene.store.ByteBufferIndexInput.read Byte(ByteBufferIndexInput.java:81) at org.apache.lucene.store.BufferedChecksumIndexInpu t.readByte(BufferedChecksumIndexInput.java:41) at org.apache.lucene.store.DataInput.readInt(DataInp ut.java:98) at org.apache.lucene.index.SegmentInfos.read(Segment Infos.java:343) at org.apache.lucene.index.SegmentInfos$1.doBody(Seg mentInfos.java:454) at org.apache.lucene.index.SegmentInfos$FindSegments File.run(SegmentInfos.java:906) at org.apache.lucene.index.SegmentInfos$FindSegments File.run(SegmentInfos.java:752) at org.apache.lucene.index.SegmentInfos.read(Segment Infos.java:450) at org.elasticsearch.common.lucene.Lucene.readSegmen tInfos(Lucene.java:85) at org.elasticsearch.index.store.Store.readSegmentsI nfo(Store.java:124) ... 14 more ]]] executing : False time_in_queue_millis : 52862 time_in_queue : 52.8s insert_order : 69865 priority : HIGH source : shard-failed ([kibana-int][88], node[adjp-WHHSP6kWEiPd3HkeQ], [R], s[INITIALIZING]), reason [Failed to start shard, message [RecoveryFailedException[[kibana-int][88]: Recovery failed from [Quasimodo][spfLOfnjTeiGwrYPMIiRjg][CH1SCH06 0021734][inet[/10.46.208.169:9300]] into [Hyperion][adjp -WHHSP6kWEiPd3HkeQ][CH1SCH050051642][inet[/10.46.216.169 :9300]]]; nested: RemoteTransportException[[Quasimodo][i net[/10.46.208.169:9300]][internal:index/shard/recovery/ start_recovery]]; nested: RecoveryEngineException[[kibana-int][88] Phase[1] Execution failed]; nested: RecoverFilesRecoveryException[[kibana-int][88] Failed to transfer [0] files with total size of [0b]]; nested: NoSuchFileException[D:\app\ES.ElasticSearch_v010\elastic search-1.4.1\data\AP-elasticsearch\nodes\0\indices\kiban a-int\88\index\segments_2]; ]] executing : False time_in_queue_millis : 52860 time_in_queue : 52.8s On Friday, January 9, 2015 at 5:50:44 PM UTC+5:30, Robert Muir wrote: > > Why did you snip the stack trace? can you provide all the information? > > On Thu, Jan 8, 2015 at 10:37 PM, Darshat <dar...@outlook.com <javascript:>> > wrote: > > Hi, > > We have a 98 node cluster of ES with each node 32GB RAM. 16GB is > reserved > > for ES via config file. The index has 98 shards with 2 replicas. > > > > On this cluster we are loading a large number of documents (when done it > > would be about 10 billion). In this use case about 40million documents > are > > generated per hour and we are pre-loading several days worth of > documents to > > prototype how ES will scale, and its query performance. > > > > Right now we are facing problems getting data loaded. Indexing is turned > > off. We use NEST client, with batch size of 10k. To speed up data load, > we > > distribute the hourly data to each of the 98 nodes to insert in > parallel. > > This worked ok for a few hours till we got 4.5B documents in the > cluster. > > > > After that the cluster state went to red. The outstanding tasks CAT API > > shows errors like below. CPU/Disk/Memory seems to be fine on the nodes. > > > > Why are we getting these errors?. any help greatly appreciated since > this > > blocks prototyping ES for our use case. > > > > thanks > > Darshat > > > > Sample errors: > > > > source : shard-failed ([agora_v1][24], > > node[00ihc1ToRiqMDJ1lou1Sig], [R], > s[INITIALIZING]), > > reason [Failed to start shard, message > > [RecoveryFailedException[[agora_v1][24]: Recovery > > failed from [Shingen > > Harada][RDAwqX9yRgud9f7YtZAJPg][CH1 > > SCH060051438][inet[/10.46.153.84:9300]] into > > [Elfqueen][ > > > > 00ihc1ToRiqMDJ1lou1Sig][CH1SCH050053435][inet[/10.46.182 > > .106:9300]]]; nested: > > RemoteTransportException[[Shingen > > > > Harada][inet[/10.46.153.84:9300]][internal:index/shard/r > > ecovery/start_recovery]]; nested: > > RecoveryEngineException[[agora_v1][24] Phase[1] > > Execution failed]; nested: > > RecoverFilesRecoveryException[[agora_v1][24] > Failed > > to > > transfer [0] files with total size of [0b]]; > nested: > > NoS > > > > uchFileException[D:\app\ES.ElasticSearch_v010\elasticsea > > > > rch-1.4.1\data\AP-elasticsearch\nodes\0\indices\agora_v1 > > \24\index\segments_6r]; ]] > > > > > > AND > > > > source : shard-failed ([agora_v1][95], > > node[PUsHFCStRaecPA6MuvJV9g], [P], > s[INITIALIZING]), > > reason [Failed to start shard, message > > > [IndexShardGatewayRecoveryException[[agora_v1][95] > > failed to fetch index version after copying it > over]; > > nested: CorruptIndexException[[agora_v1][95] > > Preexisting corrupted index > > [corrupted_1wegvS7BSKSbOYQkX9zJSw] caused by: > > CorruptIndexException[Read past EOF while reading > > segment infos] > > EOFException[read past EOF: > > MMapIndexInput(path="D:\ > > > > app\ES.ElasticSearch_v010\elasticsearch-1.4.1\data\AP-el > > > > asticsearch\nodes\0\indices\agora_v1\95\index\segments_1 > > 1j")] > > org.apache.lucene.index.CorruptIndexException: > Read > > past EOF while reading segment infos > > at > > org.elasticsearch.index.store.Store.readSegmentsI > > nfo(Store.java:127) > > at > > org.elasticsearch.index.store.Store.access$400(St > > ore.java:80) > > at > > org.elasticsearch.index.store.Store$MetadataSnaps > > hot.buildMetadata(Store.java:575) > > ---snip more stack trace----- > > > > > > > > > > > > > > > > > > -- > > View this message in context: > http://elasticsearch-users.115913.n3.nabble.com/Index-corruption-when-upload-large-number-of-documents-4billion-tp4068742.html > > > Sent from the ElasticSearch Users mailing list archive at Nabble.com. > > > > -- > > You received this message because you are subscribed to the Google > Groups "elasticsearch" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to elasticsearc...@googlegroups.com <javascript:>. > > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/1420774624607-4068742.post%40n3.nabble.com. > > > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/301a744c-6dfa-44f7-95ca-1ca007634d37%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.