[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag
[ https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122733#comment-15122733 ] Jeff Zhang edited comment on TEZ-2307 at 1/29/16 1:56 AM: -- bq. I think it'll be better to move the DAGAppMaster into IDLE state only after the cleanup is done. I thought about that. but it would make user confused that the last dag is completed but he still can not submit another dag due to AM is still in RUNNING. For now it seems dag clean up won't take too much, have you thought to put it in DAGImpl.finish ? I think the root cause is that the dag state view on the client side is not consistent with that in AM side. So if we put dag clean up in DAGImpl.finish, then the 2 sides are consistent. was (Author: zjffdu): bq. I think it'll be better to move the DAGAppMaster into IDLE state only after the cleanup is done. I thought about that. but it would make user confused that the last dag is completed but he still can not submit another dag due to AM is still in RUNNING. For now it seems dag clean up won't take too much, have you thought to put it in DAGImpl.finish ? I think the root cause is that the dag view on the client side is not consistent with that in AM side. So if we put dag clean up in DAGImpl.finish, then the 2 sides are consistent. > Possible wrong error message when submitting new dag > > > Key: TEZ-2307 > URL: https://issues.apache.org/jira/browse/TEZ-2307 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch, > TEZ-2307-4.patch > > > In the following 2 cases, AM would propagate wrong error message to client > ("App master already running a DAG") > * The last dag is completed but AM is still in RUNNING state > * AM is in shutting down. > {code} > 2015-04-10 06:01:50,369 INFO [IPC Server handler 0 on 46821] ipc.Server > (Server.java:run(2070)) - IPC Server handler 0 on 46821, call > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG > from 10.0.0.223:48581 Call#411 Retry#0 > org.apache.tez.dag.api.TezException: App master already running a DAG > at > org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131) > at > org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag
[ https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122733#comment-15122733 ] Jeff Zhang edited comment on TEZ-2307 at 1/29/16 1:56 AM: -- bq. I think it'll be better to move the DAGAppMaster into IDLE state only after the cleanup is done. I thought about that. but it would make user confused that the last dag is completed but he still can not submit another dag due to AM is still in RUNNING. For now it seems dag clean up won't take too much, have you thought to put it in DAGImpl.finish ? I think the root cause is that the dag view on the client side is not consistent with that in AM side. So if we put dag clean up in DAGImpl.finish, then the 2 sides are consistent. was (Author: zjffdu): bq. I think it'll be better to move the DAGAppMaster into IDLE state only after the cleanup is done. I thought about that. but it would make user confused that the last dag is completed but he still can not submit another dag due to AM is still in RUNNING. For now it seems dag clean up won't take too much, have you thought to put it in DAGImpl.finish ? > Possible wrong error message when submitting new dag > > > Key: TEZ-2307 > URL: https://issues.apache.org/jira/browse/TEZ-2307 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch, > TEZ-2307-4.patch > > > In the following 2 cases, AM would propagate wrong error message to client > ("App master already running a DAG") > * The last dag is completed but AM is still in RUNNING state > * AM is in shutting down. > {code} > 2015-04-10 06:01:50,369 INFO [IPC Server handler 0 on 46821] ipc.Server > (Server.java:run(2070)) - IPC Server handler 0 on 46821, call > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG > from 10.0.0.223:48581 Call#411 Retry#0 > org.apache.tez.dag.api.TezException: App master already running a DAG > at > org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131) > at > org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag
[ https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122063#comment-15122063 ] Siddharth Seth edited comment on TEZ-2307 at 1/28/16 6:36 PM: -- bq. I think make the submit RPC call wait might not be a good option because it is confused that user can not submit new dag even after previous dag is completed. So I suggest that user can still submit new dag, but keep the dag in NEW state until the cleanup of previous dag is done. This is an option. Couple of things which will need to be considered though. The user will consider submitDag as successful. What happens if there's an error during the cleanup of the previous DAG ? That would have to be sent back as part of dag status monitoring. This can get fairly confusing for users - DAG accepted, but then notified about failure due to a cleanup error from the previous DAG. Also, in case of an error during previous DAG cleanup - we should send back a specific error, which the user can act on. SessionNotRunning itself, or a new Exception - which users can use to launch a new application. On the patch itself. Instead of using a field - dagCleanupDone, I think it'll be better to move the DAGAppMaster into IDLE state only after the cleanup is done. My bad here, I should have fixed this in the patch which added the cleanup state. submitDag can wait on the DAG entering IDLE state instead of waiting on dagCleanup. A notification can be sent out once the DAG enters cleanup state. This also gets rid of the call from DAGImpl to set the dagCleanupedFlag to false. - In the current patch, calling setDagCleanupDone races with handling of the DAGCleanupEvent if concurrent dispatchers are used. It'd be better to avoid this for when we support concurrent dispatchers as the default. - A boolean field (maybe volatile) is sufficient instead of an AtomicBoolean since we're synchronizing on it. was (Author: sseth): bq. I think make the submit RPC call wait might not be a good option because it is confused that user can not submit new dag even after previous dag is completed. So I suggest that user can still submit new dag, but keep the dag in NEW state until the cleanup of previous dag is done. This is an option. Couple of things which will need to be considered though. The user will consider submitDag as successful. What happens if there's an error during the cleanup of the previous DAG ? That would have to be sent back as part of dag status monitoring. This can get fairly confusing for users - DAG accepted, but then notified about failure due to a cleanup error from the previous DAG. On the patch itself. Instead of using a field - dagCleanupDone, I think it'll be better to move the DAGAppMaster into IDLE state only after the cleanup is done. My bad here, I should have fixed this in the patch which added the cleanup state. submitDag can wait on the DAG entering IDLE state instead of waiting on dagCleanup. A notification can be sent out once the DAG enters cleanup state. This also gets rid of the call from DAGImpl to set the dagCleanupedFlag to false. - In the current patch, calling setDagCleanupDone races with handling of the DAGCleanupEvent if concurrent dispatchers are used. It'd be better to avoid this for when we support concurrent dispatchers as the default. - A boolean field (maybe volatile) is sufficient instead of an AtomicBoolean since we're synchronizing on it. > Possible wrong error message when submitting new dag > > > Key: TEZ-2307 > URL: https://issues.apache.org/jira/browse/TEZ-2307 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch, > TEZ-2307-4.patch > > > In the following 2 cases, AM would propagate wrong error message to client > ("App master already running a DAG") > * The last dag is completed but AM is still in RUNNING state > * AM is in shutting down. > {code} > 2015-04-10 06:01:50,369 INFO [IPC Server handler 0 on 46821] ipc.Server > (Server.java:run(2070)) - IPC Server handler 0 on 46821, call > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG > from 10.0.0.223:48581 Call#411 Retry#0 > org.apache.tez.dag.api.TezException: App master already running a DAG > at > org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131) > at > org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGCl
[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag
[ https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118448#comment-15118448 ] Jeff Zhang edited comment on TEZ-2307 at 1/27/16 6:34 AM: -- I think make the submit RPC call wait might not be a good option because it is confused that user can not submit new dag even after previous dag is completed. So I suggest that user can still submit new dag, but keep the dag in NEW state until the cleanup of previous dag is done. The only issue is that the TezDAGId cache is not cleared, but it should be fine. [~sseth] What do you think ? was (Author: zjffdu): I think make the submit RPC call wait might not be a good option because it is confused that user can not submit new dag even after previous dag is completed. So I suggest that user can still submit new dag, but keep the dag in NEW state until the cleanup of previous dag is done. [~sseth] What do you think ? > Possible wrong error message when submitting new dag > > > Key: TEZ-2307 > URL: https://issues.apache.org/jira/browse/TEZ-2307 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2307-1.patch > > > In the following 2 cases, AM would propagate wrong error message to client > ("App master already running a DAG") > * The last dag is completed but AM is still in RUNNING state > * AM is in shutting down. > {code} > 2015-04-10 06:01:50,369 INFO [IPC Server handler 0 on 46821] ipc.Server > (Server.java:run(2070)) - IPC Server handler 0 on 46821, call > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG > from 10.0.0.223:48581 Call#411 Retry#0 > org.apache.tez.dag.api.TezException: App master already running a DAG > at > org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131) > at > org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163) > at > org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)