[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

Jason Lowe (JIRA) Tue, 07 Aug 2018 07:36:09 -0700


    [ 
https://issues.apache.org/jira/browse/TEZ-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571755#comment-16571755
 ]


Jason Lowe commented on TEZ-3976:
---------------------------------

bq. Each of those translate into an event in the AM which finally crashes due 
to OOM after around 30 minutes and around 10 million shuffle input errors (and 
10 million lines like the previous ones).

To clarify, this is the task crashing with an OOM and not the AM?  If so it 
looks like yet another flow control issue with the RPC layer where the task is 
generating events faster than the network can push them out.

Normally this scenario does not occur because one of the following triggers 
first:
# The AM sees enough complaints about the inability to fetch an upstream task's 
inputs that it decides to retroactively fail the upstream task and re-run it.  
When that is decided the AM informs the downstream tasks about the now obsolete 
shuffle inputs, and the downstream tasks will stop trying to fetch it.
# The downstream task decides that it is getting far too many errors and 
insufficient overall shuffle progress, so it declares itself unhealthy and 
fails.

It would be useful to understand why neither of those is happening in this case.


> ShuffleManager reporting too many errors
> ----------------------------------------
>
>                 Key: TEZ-3976
>                 URL: https://issues.apache.org/jira/browse/TEZ-3976
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jaume M
>            Priority: Major
>
> The symptoms are a lot of these logs are being shown:
> {code:java}
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #0 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=701, 
> attemptNumber=0, 
> pathComponent=attempt_1529044441963_0021_34_01_000701_0_12541_0, spillType=2, 
> spillId=0], connectFailed: true
> 2018-06-15T18:09:35,811 WARN  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.Fetcher: copyInputs failed for 
> tasks [InputAttemptIdentifier [inputIdentifier=589, attemptNumber=0, 
> pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]]
> 2018-06-15T18:09:35,811 INFO  [Fetcher_B {Reducer_5} #1 ()] 
> org.apache.tez.runtime.library.common.shuffle.impl.ShuffleManager: Reducer_5: 
> Fetch failed for src: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0]InputIdentifier: InputAttemptIdentifier [inputIdentifier=589, 
> attemptNumber=0, 
> pathComponent=attempt_1529044441963_0021_34_01_000589_0_12445_0, spillType=2, 
> spillId=0], connectFailed: true
> {code}
> Each of those translate into an event in the AM which finally crashes due to 
> OOM after around 30 minutes and around 10 million shuffle input errors (and 
> 10 million lines like the previous ones). When the ShufflerManager is closed 
> and the counters reported there are many shuffle input errors, some of those 
> logs are:
> {code:java}
> 2018-06-15T17:46:30,988  INFO [TezTR-441963_21_34_4_0_4 
> (1529044441963_0021_34_04_000000_4)] runtime.LogicalIOProcessorRuntimeTask: 
> Final Counters for attempt_1529044441963_0021_34_04_000000_4: Counters: 43 
> [[org.apache.tez.common.counters.TaskCounter SPILLED_RECORDS=0, 
> NUM_SHUFFLED_INPUTS=26, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> INPUT_RECORDS_PROCESSED=26, OUTPUT_RECORDS=1, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_BYTES=779472, OUTPUT_BYTES_WITH_OVERHEAD=779483, 
> OUTPUT_BYTES_PHYSICAL=780146, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILL_COUNT=0, 
> SHUFFLE_BYTES=4207563, SHUFFLE_BYTES_DECOMPRESSED=20266603, 
> SHUFFLE_BYTES_TO_MEM=3380616, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=826947, SHUFFLE_PHASE_TIME=52516, 
> FIRST_EVENT_RECEIVED=1, LAST_EVENT_RECEIVED=1185][HIVE 
> RECORDS_OUT_INTERMEDIATE_^[[1;35;40m^[[KReducer_12^[[m^[[K=1, 
> RECORDS_OUT_OPERATOR_GBY_159=1, 
> RECORDS_OUT_OPERATOR_RS_160=1][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_INPUT_Map_11
>  FIRST_EVENT_RECEIVED=1, INPUT_RECORDS_PROCESSED=26, 
> LAST_EVENT_RECEIVED=1185, NUM_FAILED_SHUFFLE_INPUTS=858965, 
> NUM_SHUFFLED_INPUTS=26, SHUFFLE_BYTES=4207563, 
> SHUFFLE_BYTES_DECOMPRESSED=20266603, SHUFFLE_BYTES_DISK_DIRECT=826947, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=3380616, 
> SHUFFLE_PHASE_TIME=52516][TaskCounter_^[[1;35;40m^[[KReducer_12^[[m^[[K_OUTPUT_Map_1
>  ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779472, OUTPUT_BYTES_PHYSICAL=780146, 
> OUTPUT_BYTES_WITH_OVERHEAD=779483, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0]]
> 2018-06-15T17:46:32,271 INFO  [TezTR-441963_21_34_3_15_1 ()] 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask: Final Counters for 
> attempt_1529044441963_0021_34_03_000015_1: Counters: 87 [[File System 
> Counters FILE_BYTES_READ=0, FILE_BYTES_WRITTEN=0, FILE_READ_OPS=0, 
> FILE_LARGE_READ_OPS=0, FILE_WRITE_OPS=0, HDFS_BYTES_READ=2344929, 
> HDFS_BYTES_WRITTEN=0, HDFS_READ_OPS=5, HDFS_LARGE_READ_OPS=0, 
> HDFS_WRITE_OPS=0][org.apache.tez.common.counters.TaskCounter 
> SPILLED_RECORDS=0, NUM_SHUFFLED_INPUTS=1, NUM_FAILED_SHUFFLE_INPUTS=105195, 
> INPUT_RECORDS_PROCESSED=397, INPUT_SPLIT_LENGTH_BYTES=21563271, 
> OUTPUT_RECORDS=15737, OUTPUT_LARGE_RECORDS=0, OUTPUT_BYTES=1235818, 
> OUTPUT_BYTES_WITH_OVERHEAD=1267307, OUTPUT_BYTES_PHYSICAL=357520, 
> ADDITIONAL_SPILLS_BYTES_WRITTEN=0, ADDITIONAL_SPILLS_BYTES_READ=0, 
> ADDITIONAL_SPILL_COUNT=0, SHUFFLE_BYTES=31, SHUFFLE_BYTES_DECOMPRESSED=17, 
> SHUFFLE_BYTES_TO_MEM=31, SHUFFLE_BYTES_TO_DISK=0, 
> SHUFFLE_BYTES_DISK_DIRECT=0, SHUFFLE_PHASE_TIME=50525, 
> FIRST_EVENT_RECEIVED=9, LAST_EVENT_RECEIVED=61][HIVE DESERIALIZE_ERRORS=0, 
> RECORDS_IN_Map_11=395611, RECORDS_OUT_INTERMEDIATE_Map_11=15737, 
> RECORDS_OUT_OPERATOR_FIL_152=395611, RECORDS_OUT_OPERATOR_GBY_157=1, 
> RECORDS_OUT_OPERATOR_MAPJOIN_154=15736, RECORDS_OUT_OPERATOR_MAP_0=0, 
> RECORDS_OUT_OPERATOR_RS_155=15736, RECORDS_OUT_OPERATOR_RS_158=1, 
> RECORDS_OUT_OPERATOR_SEL_153=395611, RECORDS_OUT_OPERATOR_SEL_156=15736, 
> RECORDS_OUT_OPERATOR_TS_26=395611][TaskCounter_Map_11_INPUT_Map_13 
> FIRST_EVENT_RECEIVED=9, INPUT_RECORDS_PROCESSED=1, LAST_EVENT_RECEIVED=61, 
> NUM_FAILED_SHUFFLE_INPUTS=105195, NUM_SHUFFLED_INPUTS=1, SHUFFLE_BYTES=31, 
> SHUFFLE_BYTES_DECOMPRESSED=17, SHUFFLE_BYTES_DISK_DIRECT=0, 
> SHUFFLE_BYTES_TO_DISK=0, SHUFFLE_BYTES_TO_MEM=31, 
> SHUFFLE_PHASE_TIME=50525][TaskCounter_Map_11_INPUT_supplier 
> INPUT_RECORDS_PROCESSED=396, 
> INPUT_SPLIT_LENGTH_BYTES=21563271][TaskCounter_Map_11_OUTPUT_Reducer_12 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=779474, OUTPUT_BYTES_PHYSICAL=164787, 
> OUTPUT_BYTES_WITH_OVERHEAD=779485, OUTPUT_LARGE_RECORDS=0, OUTPUT_RECORDS=1, 
> SPILLED_RECORDS=0][TaskCounter_Map_11_OUTPUT_Reducer_6 
> ADDITIONAL_SPILLS_BYTES_READ=0, ADDITIONAL_SPILLS_BYTES_WRITTEN=0, 
> ADDITIONAL_SPILL_COUNT=0, OUTPUT_BYTES=456344, OUTPUT_BYTES_PHYSICAL=192733, 
> OUTPUT_BYTES_WITH_OVERHEAD=487822, OUTPUT_LARGE_RECORDS=0, 
> OUTPUT_RECORDS=15736, 
> SPILLED_RECORDS=0][org.apache.hadoop.hive.llap.counters.LlapIOCounters 
> ALLOCATED_BYTES=9633792, ALLOCATED_USED_BYTES=7976706, CACHE_HIT_BYTES=0, 
> CACHE_MISS_BYTES=2344364, CONSUMER_TIME_NS=1136392475, 
> DECODE_TIME_NS=140377915, HDFS_TIME_NS=145825282, METADATA_CACHE_MISS=4, 
> NUM_DECODED_BATCHES=41, NUM_VECTOR_BATCHES=396, ROWS_EMITTED=395611, 
> SELECTED_ROWGROUPS=41, TOTAL_IO_TIME_NS=1234990631]]
> {code}
> I think this is happening because the fetcher is in [this 
> loop|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/impl/ShuffleManager.java#L366]
>  which is started 
> [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L69]
>  and not stopped until the [processor is 
> closed|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/LogicalIOProcessorRuntimeTask.java#L387]
>  (which is called from 
> [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L83]
>  or 
> [here|https://github.com/apache/tez/blob/fe22f3276d6d97f6b5dfab24490ee2ca32bf73c3/tez-runtime-internals/src/main/java/org/apache/tez/runtime/task/TaskRunner2Callable.java#L110]).
> In {{ShuffleManager}} we seem to keep track of 
> {{TaskCounter.NUM_FAILED_SHUFFLE_INPUTS}} but nothing is done when the value 
> gets too high. Maybe something similar to {{ShuffleScheduler}} should be done 
> where the fetchers are only retried a certain amount of times.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TEZ-3976) ShuffleManager reporting too many errors

Reply via email to