Your original error was around a yarn container being destroyed. My guess right now would be that this is due to memory pressure in Hadoop. I would look to increase the heap size AND/OR number of reducers in Hadoop.
Mapreduce is not known for being the fastest thing on the planet to be honest given there is a lot of overhead. It works nicely in batch mode over a large dataset where you want distributed compute but over a smallish dataset it can be seen as laggy. Allan On 23 May 2015 at 20:28, Sudhir Rao <ysud...@gmail.com> wrote: > Here is what the indexing performance i see : it takes 10 mins 29 seconds > to finish indexing 626K records using the mapreduce (pig). Is this the > expected performance for 4 node elasticsearch ? > > Output(s): > > Successfully stored 626283 records in: "index1/raw_data" > > > Counters: > > Total records written : 626283 > > Total bytes written : 0 > > Spillable Memory Manager spill count : 0 > > Total bags proactively spilled: 0 > > Total records proactively spilled: 0 > > > > On Saturday, May 23, 2015 at 12:24:50 PM UTC-7, Sudhir Rao wrote: >> >> I see the following in the elasticsearch logs >> >> >> stop throttling indexing: numMergesInFlight=4, maxNumMerges=5 >> >> The indexing however happens for few million records before all the >> mapper fail - please see the attached error screenshot >> >> >> >> On Thursday, May 21, 2015 at 12:46:29 AM UTC-7, Allan Mitchell wrote: >>> >>> Hi >>> >>> The error is a Grunt error which suggests Pig is throwing it not ES. >>> What do the PIG logs say? What makes you think ES is the issue? >>> >>> I know it works with smaller data but that also means Pig works with >>> smaller data not just ES. >>> >>> Allan >>> >>> On 21 May 2015 at 01:34, Sudhir Rao <ysu...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I have 4 node ES running >>>> >>>> ElasticSearch : 1.5.2 >>>> OS : RHEL 6.x >>>> Java : 1.7 >>>> CPU : 16 cores >>>> 2 machines : 60 GB RAM, 10 TB disk >>>> 2 machines : 120 GB RAM, 5 TB disk >>>> >>>> >>>> I also have a 500 node hadoop cluster and am trying to index data from >>>> Hadoop which is in Avro Format >>>> >>>> Daily size : 1.2 TB >>>> Hourly size : 40-60 GB >>>> >>>> >>>> elasticsearch.yml config >>>> ================== >>>> >>>> cluster.name: zebra >>>> index.mapping.ignore_malformed: true >>>> index.merge.scheduler.max_thread_count: 1 >>>> index.store.throttle.type: none >>>> index.refresh_interval: -1 >>>> index.translog.flush_threshold_size: 1024000000 >>>> discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"] >>>> path.data: >>>> /hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es >>>> bootstrap.mlockall: true >>>> indices.memory.index_buffer_size: 30% >>>> index.translog.flush_threshold_ops: 50000 >>>> index.store.type: mmapfs >>>> >>>> >>>> Cluster Settings >>>> ============ >>>> >>>> $ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true' >>>> { >>>> "cluster_name" : "zebra", >>>> "status" : "green", >>>> "timed_out" : false, >>>> "number_of_nodes" : 4, >>>> "number_of_data_nodes" : 4, >>>> "active_primary_shards" : 21, >>>> "active_shards" : 22, >>>> "relocating_shards" : 0, >>>> "initializing_shards" : 0, >>>> "unassigned_shards" : 0, >>>> "number_of_pending_tasks" : 0 >>>> } >>>> >>>> >>>> Pig Script: >>>> ======== >>>> >>>> avro_data = LOAD '$INPUT_PATH' USING AvroStorage (); >>>> >>>> temp_projection = FOREACH avro_data GENERATE >>>> our.own.udf.ToJsonString(headers,data) as data; >>>> >>>> STORE temp_projection INTO 'fpti/raw_data' USING >>>> org.elasticsearch.hadoop.pig.EsStorage ('es.resource = >>>> fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4', >>>> 'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1'); >>>> When i run the above, there are around 300 mappers none of them >>>> complete and every time the job fails with the below error. There is some >>>> documents that gets indexed though. >>>> >>>> *Error:* >>>> >>>> *2015-05-20 15:40:20,618 [main] ERROR >>>> org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal >>>> error. Could not write all entries [1/8448] (maybe ES was overloaded?). >>>> Bailing out...* >>>> >>>> The job however finishes when the data size is few thousands >>>> >>>> >>>> Please let me know what else i can do to increase my indexing throughput >>>> >>>> >>>> regards >>>> >>>> #sudhir >>>> >>>> -- >>>> Please update your bookmarks! We have moved to >>>> https://discuss.elastic.co/ >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to elasticsearc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com >>>> <https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- > Please update your bookmarks! We have moved to https://discuss.elastic.co/ > --- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- Please update your bookmarks! We have moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAECdJzDD%3DyUafhSY4L%2BZ0MSY%2Bjms-E4DosRR5d9Z_SwVC90%3DwA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.