Re: ES Indexing from Hadoop Issues

Allan Mitchell Tue, 26 May 2015 03:11:57 -0700

Your original error was around a yarn container being destroyed.  My guess
right now would be that this is due to memory pressure in Hadoop.
I would look to increase the heap size AND/OR number of reducers in Hadoop.


Mapreduce is not known for being the fastest thing on the planet to be
honest given there is a lot of overhead.  It works nicely in batch mode
over a large dataset where you want distributed compute but over a smallish
dataset it can be seen as laggy.

Allan

On 23 May 2015 at 20:28, Sudhir Rao <ysud...@gmail.com> wrote:

> Here is what the indexing performance i see : it takes 10 mins 29 seconds
> to finish indexing 626K records using the mapreduce (pig).  Is this the
> expected performance for 4 node elasticsearch ?
>
> Output(s):
>
> Successfully stored 626283 records in: "index1/raw_data"
>
>
> Counters:
>
> Total records written : 626283
>
> Total bytes written : 0
>
> Spillable Memory Manager spill count : 0
>
> Total bags proactively spilled: 0
>
> Total records proactively spilled: 0
>
>
>
> On Saturday, May 23, 2015 at 12:24:50 PM UTC-7, Sudhir Rao wrote:
>>
>> I see the following in the elasticsearch logs
>>
>> >> stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
>>
>> The indexing however happens for few million records before all the
>> mapper fail - please see the attached error screenshot
>>
>>
>>
>> On Thursday, May 21, 2015 at 12:46:29 AM UTC-7, Allan Mitchell wrote:
>>>
>>> Hi
>>>
>>> The error is a Grunt error which suggests Pig is throwing it not ES.
>>> What do the PIG logs say?  What makes you think ES is the issue?
>>>
>>> I know it works with smaller data but that also means Pig works with
>>> smaller data not just ES.
>>>
>>> Allan
>>>
>>> On 21 May 2015 at 01:34, Sudhir Rao <ysu...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have 4 node ES running
>>>>
>>>> ElasticSearch : 1.5.2
>>>> OS : RHEL 6.x
>>>> Java : 1.7
>>>> CPU : 16 cores
>>>> 2 machines : 60 GB RAM, 10 TB disk
>>>> 2 machines : 120 GB RAM, 5 TB disk
>>>>
>>>>
>>>> I also have a 500 node hadoop cluster and am trying to index data from
>>>> Hadoop which is in Avro Format
>>>>
>>>> Daily size : 1.2 TB
>>>> Hourly size : 40-60 GB
>>>>
>>>>
>>>> elasticsearch.yml config
>>>> ==================
>>>>
>>>> cluster.name: zebra
>>>> index.mapping.ignore_malformed: true
>>>> index.merge.scheduler.max_thread_count: 1
>>>> index.store.throttle.type: none
>>>> index.refresh_interval: -1
>>>> index.translog.flush_threshold_size: 1024000000
>>>> discovery.zen.ping.unicast.hosts: ["node1","node2","node3","node4"]
>>>> path.data:
>>>> /hadoop01/es,/hadoop02/es,/hadoop03/es,/hadoop04/es,/hadoop05/es,/hadoop06/es,/hadoop07/es,/hadoop08/es,/hadoop09/es,/hadoop10/es,/hadoop11/es,/hadoop12/es
>>>> bootstrap.mlockall: true
>>>> indices.memory.index_buffer_size: 30%
>>>> index.translog.flush_threshold_ops: 50000
>>>> index.store.type: mmapfs
>>>>
>>>>
>>>> Cluster Settings
>>>> ============
>>>>
>>>> $ curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
>>>> {
>>>> "cluster_name" : "zebra",
>>>> "status" : "green",
>>>> "timed_out" : false,
>>>> "number_of_nodes" : 4,
>>>> "number_of_data_nodes" : 4,
>>>> "active_primary_shards" : 21,
>>>> "active_shards" : 22,
>>>> "relocating_shards" : 0,
>>>> "initializing_shards" : 0,
>>>> "unassigned_shards" : 0,
>>>> "number_of_pending_tasks" : 0
>>>> }
>>>>
>>>>
>>>> Pig Script:
>>>> ========
>>>>
>>>> avro_data = LOAD '$INPUT_PATH' USING AvroStorage ();
>>>>
>>>> temp_projection = FOREACH avro_data GENERATE
>>>> our.own.udf.ToJsonString(headers,data) as data;
>>>>
>>>> STORE temp_projection INTO 'fpti/raw_data' USING
>>>> org.elasticsearch.hadoop.pig.EsStorage ('es.resource =
>>>> fpti/raw_data','es.input.json=true', 'node1,node2,node3,node4',
>>>> 'mapreduce.map.speculative=false','mapreduce.reduce.speculative=false','es.batch.size.bytes=512mb','es.batch.size.entries=1');
>>>> When i run the above, there are around 300 mappers  none of them
>>>> complete and every time the job fails with the below error. There is some
>>>> documents that gets indexed though.
>>>>
>>>> *Error:*
>>>>
>>>> *2015-05-20 15:40:20,618 [main] ERROR
>>>> org.apache.pig.tools.grunt.GruntParser - ERROR 2999: Unexpected internal
>>>> error. Could not write all entries [1/8448] (maybe ES was overloaded?).
>>>> Bailing out...*
>>>>
>>>> The job however finishes when the data size is few thousands
>>>>
>>>>
>>>> Please let me know what else i can do to increase my indexing throughput
>>>>
>>>>
>>>> regards
>>>>
>>>> #sudhir
>>>>
>>>> --
>>>> Please update your bookmarks! We have moved to
>>>> https://discuss.elastic.co/
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to elasticsearc...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/6312a8b6-bde7-40d6-bbf0-8b3fccf7cd12%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
> Please update your bookmarks! We have moved to https://discuss.elastic.co/
> ---
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/53d91e49-3a70-460d-93b3-0255e242fb42%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAECdJzDD%3DyUafhSY4L%2BZ0MSY%2Bjms-E4DosRR5d9Z_SwVC90%3DwA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: ES Indexing from Hadoop Issues

Reply via email to