[ https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572043#comment-16572043 ]
Jaume M commented on TEZ-3976: ------------------------------ [~jlowe] for clarification the OOM is in the AM. Some tasks are killed by the AM and others seem to finish fine (despite a huge number of \{{NUM_FAILED_SHUFFLE_INPUTS}}). I see the AM killing some tasks but I think it just can't keep up with all the incoming evens. For example task {{1529044441963_0021_34_04_000000_4}} runs for around one minute and reports {{NUM_FAILED_SHUFFLE_INPUTS=858965}}. I think we should to make you second point: bq. The downstream task decides that it is getting far too many errors and insufficient overall shuffle progress, so it declares itself unhealthy and fails. happen as well for the {{ShuffleManager}}. > ShuffleManager reporting too many errors > ---------------------------------------- > > Key: TEZ-3976 > URL: https://issues.apache.org/jira/browse/TEZ-3976 > Project: Apache Tez > Issue Type: Bug > Reporter: Jaume M > Priority: Major > > The symptoms are a lot of these logs are being shown: > {code:java} > 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #0 ()] > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: > Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, > attemptNumber=0, > pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0, spillType=2, > spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, > attemptNumber=0, > pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0, spillType=2, > spillId=0], connectFailed: true > 2018-06-15T18:09:35,811 WARN [Fetcher_B {Reducer_5} #1 ()] > org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for > tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, > pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, > spillId=0]] > 2018-06-15T18:09:35,811 INFO [Fetcher_B {Reducer_5} #1 ()] > org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: > Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, > attemptNumber=0, > pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, > spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, > attemptNumber=0, > pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, > spillId=0], connectFailed: true > {code} > Each of those translate into an event in the AM which finally crashes due to > OOM after around 30 minutes and around 10 million shuffle input errors (and > 10 million lines like the previous ones). When the ShufflerManager is closed > and the counters reported there are many shuffle input errors, some of those > logs are: > {code:java} > 2018-06-15T17:46:30,988 INFO [TezTR-441963_21_34_4_0_4 > (1529044441963_0021_34_04_000000_4)] runtime.LogicalIOProcessorRuntimeTask: > Final Counters for attempt_1529044441963_0021_34_04_000000_4: Counters: 43 > [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, > NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, > INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, > OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, > OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, > ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, > SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, > SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, > SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, > FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE > RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, > RECORDS_OUT_OPERATOR_GBY_159=1, > RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11 > FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, > LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, > NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, > SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, > SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, > SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1 > ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, > ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, > OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, > SPILLED_RECORDS=0]] > 2018-06-15T17:46:32,271 INFO [TezTR-441963_21_34_3_15_1 ()] > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for > attempt_1529044441963_0021_34_03_000015_1: Counters: 87 [[File System > Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, > FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=2344929, > HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0, > HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter > SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=1, NUM_FAILED_SHUFFLE_INPUTS=105195, > INPUT_RECORDS_PROCESSED=397, INPUT_SPLIT_LENGTH_BYTES=21563271, > OUTPUT_RECORDS=15737, OUTPUT_LARGE_RECORDS=0, OUTPUT_BYTES=1235818, > OUTPUT_BYTES_WITH_OVERHEAD=1267307, OUTPUT_BYTES_PHYSICAL=357520, > ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILLS_BYTES_READ=0, > ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=31, SHUFFLE_BYTES_DECOMPRESSED=17, > SHUFFLE_BYTES_TO_MEM=31, SHUFFLE_BYTES_TO_DISK=0, > SHUFFLE_BYTES_DISK_DIRECT=0, SHUFFLE_PHASE_TIME=50525, > FIRST_EVENT_RECEIVED=9, LAST_EVENT_RECEIVED=61][HIVE DESERIALIZE_ERRORS=0, > RECORDS_IN_Map_11=395611, RECORDS_OUT_INTERMEDIATE_Map_11=15737, > RECORDS_OUT_OPERATOR_FIL_152=395611, RECORDS_OUT_OPERATOR_GBY_157=1, > RECORDS_OUT_OPERATOR_MAPJOIN_154=15736, RECORDS_OUT_OPERATOR_MAP_0=0, > RECORDS_OUT_OPERATOR_RS_155=15736, RECORDS_OUT_OPERATOR_RS_158=1, > RECORDS_OUT_OPERATOR_SEL_153=395611, RECORDS_OUT_OPERATOR_SEL_156=15736, > RECORDS_OUT_OPERATOR_TS_26=395611][TaskCounter_Map_11_INPUT_Map_13 > FIRST_EVENT_RECEIVED=9, INPUT_RECORDS_PROCESSED=1, LAST_EVENT_RECEIVED=61, > NUM_FAILED_SHUFFLE_INPUTS=105195, NUM_SHUFFLED_INPUTS=1, SHUFFLE_BYTES=31, > SHUFFLE_BYTES_DECOMPRESSED=17, SHUFFLE_BYTES_DISK_DIRECT=0, > SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=31, > SHUFFLE_PHASE_TIME=50525][TaskCounter_Map_11_INPUT_supplier > INPUT_RECORDS_PROCESSED=396, > INPUT_SPLIT_LENGTH_BYTES=21563271][TaskCounter_Map_11_OUTPUT_Reducer_12 > ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, > ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779474, OUTPUT_BYTES_PHYSICAL=164787, > OUTPUT_BYTES_WITH_OVERHEAD=779485, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, > SPILLED_RECORDS=0][TaskCounter_Map_11_OUTPUT_Reducer_6 > ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, > ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=456344, OUTPUT_BYTES_PHYSICAL=192733, > OUTPUT_BYTES_WITH_OVERHEAD=487822, OUTPUT_LARGE_RECORDS=0, > OUTPUT_RECORDS=15736, > SPILLED_RECORDS=0][org.apache.hadoop.hive.llap.counters.LlapIOCounters > ALLOCATED_BYTES=9633792, ALLOCATED_USED_BYTES=7976706, CACHE_HIT_BYTES=0, > CACHE_MISS_BYTES=2344364, CONSUMER_TIME_NS=1136392475, > DECODE_TIME_NS=140377915, HDFS_TIME_NS=145825282, METADATA_CACHE_MISS=4, > NUM_DECODED_BATCHES=41, NUM_VECTOR_BATCHES=396, ROWS_EMITTED=395611, > SELECTED_ROWGROUPS=41, TOTAL_IO_TIME_NS=1234990631]] > {code} > I think this is happening because the fetcher is in [this > loop|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L366] > which is started > [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L69] > and not stopped until the [processor is > closed|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java#L387] > (which is called from > [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L83] > or > [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L110]). > In {{ShuffleManager}} we seem to keep track of > {{TaskCounter.NUM_FAILED_SHUFFLE_INPUTS}} but nothing is done when the value > gets too high. Maybe something similar to {{ShuffleScheduler}} should be done > where the fetchers are only retried a certain amount of times. -- This message was sent by Atlassian JIRA (v7.6.3#76005)