[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error
[ https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256036#comment-16256036 ] Eric Wohlstadter commented on TEZ-3846: --- [~aplusplus] Sure, I'll take a look at this. > Tez AM may not clean up properly on an internal error > - > > Key: TEZ-3846 > URL: https://issues.apache.org/jira/browse/TEZ-3846 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Zhiyuan Yang > > Normally, in Hive we blindly reopen the session on any submit error; however > I accidentally broke that, and while investigating noticed a new error before > reopen that claims that session where a DAG has failed is still running a > DAG. Looks like it should either clean up, or if we assume OOM is not > clean-up-able, die completely. > {noformat} > 2017-09-28T01:07:12,352 INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > client.TezClient: Submitted dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, > dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM ( > ... > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Status: Failed > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Vertex failed, vertexName=Map 61, > vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex > vertex_1506585924598_0001_53_01 [Map 61] killed/failed due > to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, > vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: > GC overhead limit exceeded > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Invalid event V_INTERNAL_ERROR on Vertex > vertex_1506585924598_0001_53_00 [Map 60] > 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > log.PerfLogger: end=1506586045787 duration=13435 > from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor> > ... [reuse] > 2017-09-28T01:07:28,459 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > client.TezClient: Submitting dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, dagName=insert overwrite table > orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, > callerType=HIVE_QUERY_ID, > callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 } > 2017-09-28T01:07:35,259 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > exec.Task: Dag submit failed due to App master already running a DAG > {noformat} > Session continues living and failing like that multiple times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error
[ https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256030#comment-16256030 ] Zhiyuan Yang commented on TEZ-3846: --- [~EricWohlstadter] It's done in TEZ-3858. > Tez AM may not clean up properly on an internal error > - > > Key: TEZ-3846 > URL: https://issues.apache.org/jira/browse/TEZ-3846 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Zhiyuan Yang > > Normally, in Hive we blindly reopen the session on any submit error; however > I accidentally broke that, and while investigating noticed a new error before > reopen that claims that session where a DAG has failed is still running a > DAG. Looks like it should either clean up, or if we assume OOM is not > clean-up-able, die completely. > {noformat} > 2017-09-28T01:07:12,352 INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > client.TezClient: Submitted dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, > dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM ( > ... > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Status: Failed > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Vertex failed, vertexName=Map 61, > vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex > vertex_1506585924598_0001_53_01 [Map 61] killed/failed due > to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, > vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: > GC overhead limit exceeded > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Invalid event V_INTERNAL_ERROR on Vertex > vertex_1506585924598_0001_53_00 [Map 60] > 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > log.PerfLogger: end=1506586045787 duration=13435 > from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor> > ... [reuse] > 2017-09-28T01:07:28,459 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > client.TezClient: Submitting dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, dagName=insert overwrite table > orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, > callerType=HIVE_QUERY_ID, > callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 } > 2017-09-28T01:07:35,259 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > exec.Task: Dag submit failed due to App master already running a DAG > {noformat} > Session continues living and failing like that multiple times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error
[ https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16256026#comment-16256026 ] Eric Wohlstadter commented on TEZ-3846: --- [~aplusplus] Did you file another jira to correct the incorrect log message? I want to add that one to the backlog. > Tez AM may not clean up properly on an internal error > - > > Key: TEZ-3846 > URL: https://issues.apache.org/jira/browse/TEZ-3846 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Zhiyuan Yang > > Normally, in Hive we blindly reopen the session on any submit error; however > I accidentally broke that, and while investigating noticed a new error before > reopen that claims that session where a DAG has failed is still running a > DAG. Looks like it should either clean up, or if we assume OOM is not > clean-up-able, die completely. > {noformat} > 2017-09-28T01:07:12,352 INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > client.TezClient: Submitted dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, > dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM ( > ... > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Status: Failed > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Vertex failed, vertexName=Map 61, > vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex > vertex_1506585924598_0001_53_01 [Map 61] killed/failed due > to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, > vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: > GC overhead limit exceeded > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Invalid event V_INTERNAL_ERROR on Vertex > vertex_1506585924598_0001_53_00 [Map 60] > 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > log.PerfLogger: end=1506586045787 duration=13435 > from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor> > ... [reuse] > 2017-09-28T01:07:28,459 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > client.TezClient: Submitting dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, dagName=insert overwrite table > orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, > callerType=HIVE_QUERY_ID, > callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 } > 2017-09-28T01:07:35,259 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > exec.Task: Dag submit failed due to App master already running a DAG > {noformat} > Session continues living and failing like that multiple times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error
[ https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16215529#comment-16215529 ] Zhiyuan Yang commented on TEZ-3846: --- bq. SessionState: Invalid event V_INTERNAL_ERROR on Vertex vertex_1506585924598_0001_53_00 \[Map 60\] @Kuhu Shukla This is a misleading dag level diagnostic. It's printed by InternalErrorTransition instead of state machine code. The actual message is ,this vertex got invalid event, but that event is not V_INTERNAL_ERROR. I'll file another jira to correct this. > Tez AM may not clean up properly on an internal error > - > > Key: TEZ-3846 > URL: https://issues.apache.org/jira/browse/TEZ-3846 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Zhiyuan Yang > > Normally, in Hive we blindly reopen the session on any submit error; however > I accidentally broke that, and while investigating noticed a new error before > reopen that claims that session where a DAG has failed is still running a > DAG. Looks like it should either clean up, or if we assume OOM is not > clean-up-able, die completely. > {noformat} > 2017-09-28T01:07:12,352 INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > client.TezClient: Submitted dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, > dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM ( > ... > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Status: Failed > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Vertex failed, vertexName=Map 61, > vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex > vertex_1506585924598_0001_53_01 [Map 61] killed/failed due > to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, > vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: > GC overhead limit exceeded > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Invalid event V_INTERNAL_ERROR on Vertex > vertex_1506585924598_0001_53_00 [Map 60] > 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > log.PerfLogger: end=1506586045787 duration=13435 > from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor> > ... [reuse] > 2017-09-28T01:07:28,459 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > client.TezClient: Submitting dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, dagName=insert overwrite table > orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, > callerType=HIVE_QUERY_ID, > callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 } > 2017-09-28T01:07:35,259 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > exec.Task: Dag submit failed due to App master already running a DAG > {noformat} > Session continues living and failing like that multiple times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error
[ https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186309#comment-16186309 ] Sergey Shelukhin commented on TEZ-3846: --- Tez version was 0.9.0 (the one Hive is using on master). Unfortunately I don't have vertex logs. > Tez AM may not clean up properly on an internal error > - > > Key: TEZ-3846 > URL: https://issues.apache.org/jira/browse/TEZ-3846 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Zhiyuan Yang > > Normally, in Hive we blindly reopen the session on any submit error; however > I accidentally broke that, and while investigating noticed a new error before > reopen that claims that session where a DAG has failed is still running a > DAG. Looks like it should either clean up, or if we assume OOM is not > clean-up-able, die completely. > {noformat} > 2017-09-28T01:07:12,352 INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > client.TezClient: Submitted dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, > dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM ( > ... > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Status: Failed > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Vertex failed, vertexName=Map 61, > vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex > vertex_1506585924598_0001_53_01 [Map 61] killed/failed due > to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, > vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: > GC overhead limit exceeded > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Invalid event V_INTERNAL_ERROR on Vertex > vertex_1506585924598_0001_53_00 [Map 60] > 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > log.PerfLogger: end=1506586045787 duration=13435 > from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor> > ... [reuse] > 2017-09-28T01:07:28,459 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > client.TezClient: Submitting dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, dagName=insert overwrite table > orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, > callerType=HIVE_QUERY_ID, > callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 } > 2017-09-28T01:07:35,259 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > exec.Task: Dag submit failed due to App master already running a DAG > {noformat} > Session continues living and failing like that multiple times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error
[ https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186024#comment-16186024 ] Kuhu Shukla commented on TEZ-3846: -- [~sershe], Thank you for reporting the issue. On what version of Tez was this issue seen? Wondering if any of the recent fixes and/or JIRAs might be related here. eg. TEZ-3817. > Tez AM may not clean up properly on an internal error > - > > Key: TEZ-3846 > URL: https://issues.apache.org/jira/browse/TEZ-3846 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Zhiyuan Yang > > Normally, in Hive we blindly reopen the session on any submit error; however > I accidentally broke that, and while investigating noticed a new error before > reopen that claims that session where a DAG has failed is still running a > DAG. Looks like it should either clean up, or if we assume OOM is not > clean-up-able, die completely. > {noformat} > 2017-09-28T01:07:12,352 INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > client.TezClient: Submitted dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, > dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM ( > ... > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Status: Failed > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Vertex failed, vertexName=Map 61, > vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex > vertex_1506585924598_0001_53_01 [Map 61] killed/failed due > to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, > vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: > GC overhead limit exceeded > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Invalid event V_INTERNAL_ERROR on Vertex > vertex_1506585924598_0001_53_00 [Map 60] > 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > log.PerfLogger: end=1506586045787 duration=13435 > from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor> > ... [reuse] > 2017-09-28T01:07:28,459 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > client.TezClient: Submitting dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, dagName=insert overwrite table > orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, > callerType=HIVE_QUERY_ID, > callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 } > 2017-09-28T01:07:35,259 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > exec.Task: Dag submit failed due to App master already running a DAG > {noformat} > Session continues living and failing like that multiple times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TEZ-3846) Tez AM may not clean up properly on an internal error
[ https://issues.apache.org/jira/browse/TEZ-3846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185082#comment-16185082 ] Zhiyuan Yang commented on TEZ-3846: --- I'll take a look soon. > Tez AM may not clean up properly on an internal error > - > > Key: TEZ-3846 > URL: https://issues.apache.org/jira/browse/TEZ-3846 > Project: Apache Tez > Issue Type: Bug >Reporter: Sergey Shelukhin >Assignee: Zhiyuan Yang > > Normally, in Hive we blindly reopen the session on any submit error; however > I accidentally broke that, and while investigating noticed a new error before > reopen that claims that session where a DAG has failed is still running a > DAG. Looks like it should either clean up, or if we assume OOM is not > clean-up-able, die completely. > {noformat} > 2017-09-28T01:07:12,352 INFO [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > client.TezClient: Submitted dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, > dagId=dag_1506585924598_0001_53, dagName=SELECT count(1) FROM ( > ... > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Status: Failed > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Vertex failed, vertexName=Map 61, > vertexId=vertex_1506585924598_0001_53_01, diagnostics=[Vertex > vertex_1506585924598_0001_53_01 [Map 61] killed/failed due > to:ROOT_INPUT_INIT_FAILURE, Vertex Input: src initializer failed, > vertex=vertex_1506585924598_0001_53_01 [Map 61], java.lang.OutOfMemoryError: > GC overhead limit exceeded > 2017-09-28T01:07:25,787 ERROR [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > SessionState: Invalid event V_INTERNAL_ERROR on Vertex > vertex_1506585924598_0001_53_00 [Map 60] > 2017-09-28T01:07:25,787 DEBUG [3d4e3f44-40c5-431a-b3de-801d60c1c579 main] > log.PerfLogger: end=1506586045787 duration=13435 > from=org.apache.hadoop.hive.ql.exec.tez.monitoring.TezJobMonitor> > ... [reuse] > 2017-09-28T01:07:28,459 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > client.TezClient: Submitting dag to TezSession, > sessionName=HIVE-35a0e5c9-ce27-4b27-824c-ce9bc0fe104d, > applicationId=application_1506585924598_0001, dagName=insert overwrite table > orc_ppd_staging s...s(Stage-1), callerContext={ context=HIVE, > callerType=HIVE_QUERY_ID, > callerId=hiveptest_20170928010728_58f19d98-85da-4fad-83a7-7bf3aa0252a7 } > 2017-09-28T01:07:35,259 INFO [11108166-069e-43d7-9e21-25b9214d01a4 main] > exec.Task: Dag submit failed due to App master already running a DAG > {noformat} > Session continues living and failing like that multiple times. -- This message was sent by Atlassian JIRA (v6.4.14#64029)