[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag

2016-01-28 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122733#comment-15122733
 ] 

Jeff Zhang edited comment on TEZ-2307 at 1/29/16 1:56 AM:
--

bq.  I think it'll be better to move the DAGAppMaster into IDLE state only 
after the cleanup is done. 
I thought about that. but it would make user confused that the last dag is 
completed but he still can not submit another dag due to AM is still in 
RUNNING. For now it seems dag clean up won't take too much, have you thought to 
put it in DAGImpl.finish ?  I think the root cause is that the dag state view 
on the client side is not consistent with that in AM side. So if we put dag 
clean up in DAGImpl.finish, then the 2 sides are consistent. 


was (Author: zjffdu):
bq.  I think it'll be better to move the DAGAppMaster into IDLE state only 
after the cleanup is done. 
I thought about that. but it would make user confused that the last dag is 
completed but he still can not submit another dag due to AM is still in 
RUNNING. For now it seems dag clean up won't take too much, have you thought to 
put it in DAGImpl.finish ?  I think the root cause is that the dag view on the 
client side is not consistent with that in AM side. So if we put dag clean up 
in DAGImpl.finish, then the 2 sides are consistent. 

> Possible wrong error message when submitting new dag
> 
>
> Key: TEZ-2307
> URL: https://issues.apache.org/jira/browse/TEZ-2307
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch, 
> TEZ-2307-4.patch
>
>
> In the following 2 cases, AM would propagate wrong error message to client 
> ("App master already running a DAG")
> * The last dag is completed but AM is still in RUNNING state
> * AM is in shutting down. 
> {code}
> 2015-04-10 06:01:50,369 INFO  [IPC Server handler 0 on 46821] ipc.Server 
> (Server.java:run(2070)) - IPC Server handler 0 on 46821, call 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG 
> from 10.0.0.223:48581 Call#411 Retry#0
> org.apache.tez.dag.api.TezException: App master already running a DAG
>   at 
> org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131)
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag

2016-01-28 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122733#comment-15122733
 ] 

Jeff Zhang edited comment on TEZ-2307 at 1/29/16 1:56 AM:
--

bq.  I think it'll be better to move the DAGAppMaster into IDLE state only 
after the cleanup is done. 
I thought about that. but it would make user confused that the last dag is 
completed but he still can not submit another dag due to AM is still in 
RUNNING. For now it seems dag clean up won't take too much, have you thought to 
put it in DAGImpl.finish ?  I think the root cause is that the dag view on the 
client side is not consistent with that in AM side. So if we put dag clean up 
in DAGImpl.finish, then the 2 sides are consistent. 


was (Author: zjffdu):
bq.  I think it'll be better to move the DAGAppMaster into IDLE state only 
after the cleanup is done. 
I thought about that. but it would make user confused that the last dag is 
completed but he still can not submit another dag due to AM is still in 
RUNNING. For now it seems dag clean up won't take too much, have you thought to 
put it in DAGImpl.finish ?

> Possible wrong error message when submitting new dag
> 
>
> Key: TEZ-2307
> URL: https://issues.apache.org/jira/browse/TEZ-2307
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch, 
> TEZ-2307-4.patch
>
>
> In the following 2 cases, AM would propagate wrong error message to client 
> ("App master already running a DAG")
> * The last dag is completed but AM is still in RUNNING state
> * AM is in shutting down. 
> {code}
> 2015-04-10 06:01:50,369 INFO  [IPC Server handler 0 on 46821] ipc.Server 
> (Server.java:run(2070)) - IPC Server handler 0 on 46821, call 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG 
> from 10.0.0.223:48581 Call#411 Retry#0
> org.apache.tez.dag.api.TezException: App master already running a DAG
>   at 
> org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131)
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag

2016-01-28 Thread Siddharth Seth (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122063#comment-15122063
 ] 

Siddharth Seth edited comment on TEZ-2307 at 1/28/16 6:36 PM:
--

bq. I think make the submit RPC call wait might not be a good option because it 
is confused that user can not submit new dag even after previous dag is 
completed. So I suggest that user can still submit new dag, but keep the dag in 
NEW state until the cleanup of previous dag is done.
This is an option. Couple of things which will need to be considered though. 
The user will consider submitDag as successful. What happens if there's an 
error during the cleanup of the previous DAG ? That would have to be sent back 
as part of dag status monitoring. This can get fairly confusing for users - DAG 
accepted, but then notified about failure due to a cleanup error from the 
previous DAG.
Also, in case of an error during previous DAG cleanup - we should send back a 
specific error, which the user can act on. SessionNotRunning itself, or a new 
Exception - which users can use to launch a new application.

On the patch itself.
Instead of using a field - dagCleanupDone, I think it'll be better to move the 
DAGAppMaster into IDLE state only after the cleanup is done. My bad here, I 
should have fixed this in the patch which added the cleanup state. submitDag 
can wait on the DAG entering IDLE state instead of waiting on dagCleanup. A 
notification can be sent out once the DAG enters cleanup state. This also gets 
rid of the call from DAGImpl to set the dagCleanupedFlag to false.
- In the current patch, calling setDagCleanupDone races with handling of the 
DAGCleanupEvent if concurrent dispatchers are used. It'd be better to avoid 
this for when we support concurrent dispatchers as the default.
- A boolean field (maybe volatile) is sufficient instead of an AtomicBoolean 
since we're synchronizing on it.


was (Author: sseth):
bq. I think make the submit RPC call wait might not be a good option because it 
is confused that user can not submit new dag even after previous dag is 
completed. So I suggest that user can still submit new dag, but keep the dag in 
NEW state until the cleanup of previous dag is done.
This is an option. Couple of things which will need to be considered though. 
The user will consider submitDag as successful. What happens if there's an 
error during the cleanup of the previous DAG ? That would have to be sent back 
as part of dag status monitoring. This can get fairly confusing for users - DAG 
accepted, but then notified about failure due to a cleanup error from the 
previous DAG.

On the patch itself.
Instead of using a field - dagCleanupDone, I think it'll be better to move the 
DAGAppMaster into IDLE state only after the cleanup is done. My bad here, I 
should have fixed this in the patch which added the cleanup state. submitDag 
can wait on the DAG entering IDLE state instead of waiting on dagCleanup. A 
notification can be sent out once the DAG enters cleanup state. This also gets 
rid of the call from DAGImpl to set the dagCleanupedFlag to false.
- In the current patch, calling setDagCleanupDone races with handling of the 
DAGCleanupEvent if concurrent dispatchers are used. It'd be better to avoid 
this for when we support concurrent dispatchers as the default.
- A boolean field (maybe volatile) is sufficient instead of an AtomicBoolean 
since we're synchronizing on it.

> Possible wrong error message when submitting new dag
> 
>
> Key: TEZ-2307
> URL: https://issues.apache.org/jira/browse/TEZ-2307
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2307-1.patch, TEZ-2307-2.patch, TEZ-2307-3.patch, 
> TEZ-2307-4.patch
>
>
> In the following 2 cases, AM would propagate wrong error message to client 
> ("App master already running a DAG")
> * The last dag is completed but AM is still in RUNNING state
> * AM is in shutting down. 
> {code}
> 2015-04-10 06:01:50,369 INFO  [IPC Server handler 0 on 46821] ipc.Server 
> (Server.java:run(2070)) - IPC Server handler 0 on 46821, call 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG 
> from 10.0.0.223:48581 Call#411 Retry#0
> org.apache.tez.dag.api.TezException: App master already running a DAG
>   at 
> org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131)
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGCl

[jira] [Comment Edited] (TEZ-2307) Possible wrong error message when submitting new dag

2016-01-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/TEZ-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118448#comment-15118448
 ] 

Jeff Zhang edited comment on TEZ-2307 at 1/27/16 6:34 AM:
--

I think make the submit RPC call wait might not be a good option because it is 
confused that user can not submit new dag even after previous dag is completed. 
So I suggest that user can still submit new dag, but keep the dag in NEW state 
until the cleanup of previous dag is done. The only issue is that the TezDAGId 
cache is not cleared, but it should be fine. [~sseth] What do you think ?


was (Author: zjffdu):
I think make the submit RPC call wait might not be a good option because it is 
confused that user can not submit new dag even after previous dag is completed. 
So I suggest that user can still submit new dag, but keep the dag in NEW state 
until the cleanup of previous dag is done. [~sseth] What do you think ?

> Possible wrong error message when submitting new dag
> 
>
> Key: TEZ-2307
> URL: https://issues.apache.org/jira/browse/TEZ-2307
> Project: Apache Tez
>  Issue Type: Bug
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Attachments: TEZ-2307-1.patch
>
>
> In the following 2 cases, AM would propagate wrong error message to client 
> ("App master already running a DAG")
> * The last dag is completed but AM is still in RUNNING state
> * AM is in shutting down. 
> {code}
> 2015-04-10 06:01:50,369 INFO  [IPC Server handler 0 on 46821] ipc.Server 
> (Server.java:run(2070)) - IPC Server handler 0 on 46821, call 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPB.submitDAG 
> from 10.0.0.223:48581 Call#411 Retry#0
> org.apache.tez.dag.api.TezException: App master already running a DAG
>   at 
> org.apache.tez.dag.app.DAGAppMaster.submitDAGToAppMaster(DAGAppMaster.java:1131)
>   at 
> org.apache.tez.dag.api.client.DAGClientHandler.submitDAG(DAGClientHandler.java:118)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.submitDAG(DAGClientAMProtocolBlockingPBServerImpl.java:163)
>   at 
> org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(DAGClientAMProtocolRPC.java:7471)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)