[jira] [Commented] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-07 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902027#comment-16902027
 ] 

Zac Zhou commented on YARN-9721:


Thanks, [~tangzhankun],

I prefer method 1 and 2. 

Method 2 can work the same as method 3, when the time period parameter is set 
to 0.

Method 1 needs to add a member variable in RefreshNodesRequest and RMNode, 
which would involve a bit more work. 

I'm ok with both methods~

> An easy method to exclude a nodemanager from the yarn cluster cleanly
> -
>
> Key: YARN-9721
> URL: https://issues.apache.org/jira/browse/YARN-9721
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Priority: Major
> Attachments: decommission nodes.png
>
>
> If we want to take offline a nodemanager server, nodes.exclude-path
>  and "rmadmin -refreshNodes" command are used to decommission the server.
>  But this method cannot clean up the node clearly. Nodemanager servers are 
> still in Decommissioned Nodes as the attachment shows.
>   !decommission nodes.png!
> YARN-4311 enable a removalTimer to clean up the untracked node.
>  But the logic of isUntrackedNode method is to restrict. If include-path is 
> not used, no servers can meet the criteria. Using an include file would make 
> a potential risk in maintenance.
> If yarn cluster is installed on cloud, nodemanager servers are created and 
> deleted frequently. We need a way to exclude a nodemanager from the yarn 
> cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
> keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-06 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900727#comment-16900727
 ] 

Zac Zhou commented on YARN-9721:


 

[~sunilg]

Thanks a lot for your comments~

Maybe I could use some methods to clean up the inactive list.
 # Add a parameter like "--prune-nodes", to the command "rmadmin 
-refreshNodes". A parameter named like "prunable" can be added to RMNodes. when 
"rmadmin -refreshNodes --prune-nodes" is executed.  prunable of RMNodes should 
be true, and RMNodes will deleted by removalTimer.
 # Add a time period parameter in yarn configuration. If RMNodes stays in the 
inactive list more than that time period, delete the RMNodes.
 # Add a parameter in yarn configuration. If the parameter is true. Delete the 
RMNodes from the inactive list directly.

[~sunilg], [~leftnoteasy], [~cheersyang], [~tangzhankun] Any Ideas~

 

> An easy method to exclude a nodemanager from the yarn cluster cleanly
> -
>
> Key: YARN-9721
> URL: https://issues.apache.org/jira/browse/YARN-9721
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Priority: Major
> Attachments: decommission nodes.png
>
>
> If we want to take offline a nodemanager server, nodes.exclude-path
>  and "rmadmin -refreshNodes" command are used to decommission the server.
>  But this method cannot clean up the node clearly. Nodemanager servers are 
> still in Decommissioned Nodes as the attachment shows.
>   !decommission nodes.png!
> YARN-4311 enable a removalTimer to clean up the untracked node.
>  But the logic of isUntrackedNode method is to restrict. If include-path is 
> not used, no servers can meet the criteria. Using an include file would make 
> a potential risk in maintenance.
> If yarn cluster is installed on cloud, nodemanager servers are created and 
> deleted frequently. We need a way to exclude a nodemanager from the yarn 
> cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
> keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-05 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9721:
---
Description: 
If we want to take offline a nodemanager server, nodes.exclude-path
 and "rmadmin -refreshNodes" command are used to decommission the server.
 But this method cannot clean up the node clearly. Nodemanager servers are 
still in Decommissioned Nodes as the attachment shows.

  !decommission nodes.png!

YARN-4311 enable a removalTimer to clean up the untracked node.
 But the logic of isUntrackedNode method is to restrict. If include-path is not 
used, no servers can meet the criteria. Using an include file would make a 
potential risk in maintenance.

If yarn cluster is installed on cloud, nodemanager servers are created and 
deleted frequently. We need a way to exclude a nodemanager from the yarn 
cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
keep growing, which would cause a memory issue of RM.

  was:
If we want to take offline a nodemanager server, nodes.exclude-path
 and "rmadmin -refreshNodes" command are used to decommission the server.
 But this method cannot clean up the node clearly. Nodemanager servers are 
still in Decommissioned Nodes as the attachment shows.

 

YARN-4311 enable a removalTimer to clean up the untracked node.
 But the logic of isUntrackedNode method is to restrict. If include-path is not 
used, no servers can meet the criteria. Using an include file would make a 
potential risk in maintenance.

If yarn cluster is installed on cloud, nodemanager servers are created and 
deleted frequently. We need a way to exclude a nodemanager from the yarn 
cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
keep growing, which would cause a memory issue of RM.


> An easy method to exclude a nodemanager from the yarn cluster cleanly
> -
>
> Key: YARN-9721
> URL: https://issues.apache.org/jira/browse/YARN-9721
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Priority: Major
> Attachments: decommission nodes.png
>
>
> If we want to take offline a nodemanager server, nodes.exclude-path
>  and "rmadmin -refreshNodes" command are used to decommission the server.
>  But this method cannot clean up the node clearly. Nodemanager servers are 
> still in Decommissioned Nodes as the attachment shows.
>   !decommission nodes.png!
> YARN-4311 enable a removalTimer to clean up the untracked node.
>  But the logic of isUntrackedNode method is to restrict. If include-path is 
> not used, no servers can meet the criteria. Using an include file would make 
> a potential risk in maintenance.
> If yarn cluster is installed on cloud, nodemanager servers are created and 
> deleted frequently. We need a way to exclude a nodemanager from the yarn 
> cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
> keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-05 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9721:
---
Description: 
If we want to take offline a nodemanager server, nodes.exclude-path
 and "rmadmin -refreshNodes" command are used to decommission the server.
 But this method cannot clean up the node clearly. Nodemanager servers are 
still in Decommissioned Nodes as the attachment shows.

 

YARN-4311 enable a removalTimer to clean up the untracked node.
 But the logic of isUntrackedNode method is to restrict. If include-path is not 
used, no servers can meet the criteria. Using an include file would make a 
potential risk in maintenance.

If yarn cluster is installed on cloud, nodemanager servers are created and 
deleted frequently. We need a way to exclude a nodemanager from the yarn 
cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
keep growing, which would cause a memory issue of RM.

  was:
If we want to take offline a nodemanager server, nodes.exclude-path
and "rmadmin -refreshNodes" command are used to decommission the server.
But this method cannot clean up the node clearly. Nodemanager servers are still 
in Decommissioned Nodes as the attachment shows.

[YARN-4311|https://issues.apache.org/jira/browse/YARN-4311] enable a 
removalTimer to clean up the untracked node.
But the logic of isUntrackedNode method is to restrict. If include-path is not 
used, no servers can meet the criteria. Using an include file would make a 
potential risk in maintenance.

If yarn cluster is installed on cloud, nodemanager servers are created and 
deleted frequently. We need a way to exclude a nodemanager from the yarn 
cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
keep growing, which would cause a memory issue of RM.


> An easy method to exclude a nodemanager from the yarn cluster cleanly
> -
>
> Key: YARN-9721
> URL: https://issues.apache.org/jira/browse/YARN-9721
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Priority: Major
> Attachments: decommission nodes.png
>
>
> If we want to take offline a nodemanager server, nodes.exclude-path
>  and "rmadmin -refreshNodes" command are used to decommission the server.
>  But this method cannot clean up the node clearly. Nodemanager servers are 
> still in Decommissioned Nodes as the attachment shows.
>  
> YARN-4311 enable a removalTimer to clean up the untracked node.
>  But the logic of isUntrackedNode method is to restrict. If include-path is 
> not used, no servers can meet the criteria. Using an include file would make 
> a potential risk in maintenance.
> If yarn cluster is installed on cloud, nodemanager servers are created and 
> deleted frequently. We need a way to exclude a nodemanager from the yarn 
> cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
> keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9721) An easy method to exclude a nodemanager from the yarn cluster cleanly

2019-08-05 Thread Zac Zhou (JIRA)
Zac Zhou created YARN-9721:
--

 Summary: An easy method to exclude a nodemanager from the yarn 
cluster cleanly
 Key: YARN-9721
 URL: https://issues.apache.org/jira/browse/YARN-9721
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zac Zhou
 Attachments: decommission nodes.png

If we want to take offline a nodemanager server, nodes.exclude-path
and "rmadmin -refreshNodes" command are used to decommission the server.
But this method cannot clean up the node clearly. Nodemanager servers are still 
in Decommissioned Nodes as the attachment shows.

[YARN-4311|https://issues.apache.org/jira/browse/YARN-4311] enable a 
removalTimer to clean up the untracked node.
But the logic of isUntrackedNode method is to restrict. If include-path is not 
used, no servers can meet the criteria. Using an include file would make a 
potential risk in maintenance.

If yarn cluster is installed on cloud, nodemanager servers are created and 
deleted frequently. We need a way to exclude a nodemanager from the yarn 
cluster cleanly. Otherwise, the map of rmContext.getInactiveRMNodes() would 
keep growing, which would cause a memory issue of RM.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9447) RM Crashes with NPE at TimelineServiceV2Publisher.putEntity

2019-04-14 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817505#comment-16817505
 ] 

Zac Zhou commented on YARN-9447:


[~Prabhu Joseph],

Thanks for fixing it. I think this issue is the same as 
[YARN-6695|https://issues.apache.org/jira/browse/YARN-6695]

> RM Crashes with NPE at TimelineServiceV2Publisher.putEntity
> ---
>
> Key: YARN-9447
> URL: https://issues.apache.org/jira/browse/YARN-9447
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: ATSv2
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Critical
> Attachments: YARN-9447-001.patch, rm.log
>
>
> ResourceManager crashes with NullPointerException when 
> TimelineServiceV2Publisher does putEntity after the timeline collector 
> service for application is removed. This happened when killing a mapreduce 
> job.
> {code}
> 2019-04-05 14:53:24,728 INFO 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorManager:
>  The collector service for application_1553788280931_0013 was removed
> 2019-04-05 14:53:24,734 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:461)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:73)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:496)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:485)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> 2019-04-05 14:53:24,743 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Exiting, bbye..
> 2019-04-05 14:53:24,758 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler:
>  Container container_e30_1553788280931_0013_01_01 completed with event 
> FINISHED, but corresponding RMContainer doesn't exist.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-04 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759785#comment-16759785
 ] 

Zac Zhou commented on YARN-9161:


The new failed case seems not related to this patch. I run the test case 
locally and it succeeded.

[~sunilg],  can you help to review when you are free?

Thanks a lot~

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch, YARN-9161.011.patch, 
> YARN-9161.012.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-03 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759404#comment-16759404
 ] 

Zac Zhou commented on YARN-9161:


Since [YARN-9262|https://issues.apache.org/jira/browse/YARN-9262] fix the NPE 
of TestRMAppAttemptTransitions. Update the patch to fix the conflicts

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch, YARN-9161.011.patch, 
> YARN-9161.012.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-03 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.012.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch, YARN-9161.011.patch, 
> YARN-9161.012.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-02 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.011.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch, YARN-9161.011.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-02 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758973#comment-16758973
 ] 

Zac Zhou edited comment on YARN-9161 at 2/2/19 11:59 AM:
-

For the failed test cases of TestRMAppAttemptTransitions, it seems related to 
HADOOP-14178. And I'll fix it using the same way. Something like:
{code:java}
 verify(scheduler, times(2)).allocate(any(ApplicationAttemptId.class),
-any(List.class), any(List.class), any(List.class), any(List.class), 
any(List.class),
+any(List.class), any(), any(List.class), any(), any(),
 any(ContainerUpdates.class));
{code}
Fot the timeout of distributedshell, it can be fixed by YARN-9231


was (Author: yuan_zac):
For the failed test cases of TestRMAppAttemptTransitions, it seems related to 
HADOOP-14178. And I'll fix it using the same way. 

Fot the timeout of distributedshell, it can be fixed by YARN-9231

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-02 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758973#comment-16758973
 ] 

Zac Zhou commented on YARN-9161:


For the failed test cases of TestRMAppAttemptTransitions, it seems related to 
[HADOOP-14178|https://issues.apache.org/jira/browse/HADOOP-14178]. And I'll fix 
it using the same way 

Fot the timeout of distributedshell, it can be fixed by 
[YARN-9231|https://issues.apache.org/jira/browse/YARN-9231]

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-02 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758973#comment-16758973
 ] 

Zac Zhou edited comment on YARN-9161 at 2/2/19 11:56 AM:
-

For the failed test cases of TestRMAppAttemptTransitions, it seems related to 
HADOOP-14178. And I'll fix it using the same way. 

Fot the timeout of distributedshell, it can be fixed by YARN-9231


was (Author: yuan_zac):
For the failed test cases of TestRMAppAttemptTransitions, it seems related to 
[HADOOP-14178|https://issues.apache.org/jira/browse/HADOOP-14178]. And I'll fix 
it using the same way 

Fot the timeout of distributedshell, it can be fixed by 
[YARN-9231|https://issues.apache.org/jira/browse/YARN-9231]

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-02-01 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758113#comment-16758113
 ] 

Zac Zhou commented on YARN-9161:


Thanks, [~sunilg]

Yeah. For the latest patch, I just changed the TestCapacitySchedulerMetrics, 
checking csMetrics.getNumOfCommitSuccess a few times if it's not updated.

I tried to run the test locally, But it never failed by using mvn command.

So I made this modification to see if it can fix the Jenkins issue.

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-31 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.010.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch, YARN-9161.010.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-31 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757911#comment-16757911
 ] 

Zac Zhou commented on YARN-9161:


Thank you [~sunilg] for triggering Jenkins.

Distributedshell is ok now but TestCapacitySchedulerMetrics failed again. I'll 
look into TestCapacitySchedulerMetrics shortly~

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-28 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16754599#comment-16754599
 ] 

Zac Zhou commented on YARN-9161:


YARN-9161.009.patch is to fix some checkstyle and findbugs issues. I run 
TestCapacitySchedulerMetrics locally and it succeeded.

[~sunilg], [~leftnoteasy], [~tangzhankun], can you help to review when you are 
free~

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-28 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.009.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch, 
> YARN-9161.009.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9231) TestDistributedShell fix timeout

2019-01-27 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753696#comment-16753696
 ] 

Zac Zhou commented on YARN-9231:


[~Prabhu Joseph] ,

I run into the same problem. Thanks a lot for fixing it.

LGTM, +1

But I'm not sure if 2500 seconds is ok.

[~sunilg], any comments~

> TestDistributedShell fix timeout
> 
>
> Key: YARN-9231
> URL: https://issues.apache.org/jira/browse/YARN-9231
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: distributed-shell
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>Priority: Major
> Attachments: 0001-YARN-9231.patch
>
>
> TestDistributedShell test cases time out with - "There was a timeout or other 
> error in the fork"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-27 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.008.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch, YARN-9161.008.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-25 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.007.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch, YARN-9161.007.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-24 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751938#comment-16751938
 ] 

Zac Zhou commented on YARN-9161:


There are some code conflicts with 
[YARN-9116|https://issues.apache.org/jira/browse/YARN-9116]. I'll submit a 
patch to resolve it shorty.

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-24 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751934#comment-16751934
 ] 

Zac Zhou commented on YARN-9161:


[~sunilg], Thanks a lot for your comments~
{quote}I have some doubts here. Do we need this filtering?

if we use correct resource name in resource-types.xml for gpu and fpgs, below 
code (in CapacitySchedulerConfigureation) could pick it up
{quote}
I think the filter is needed. The parameter, resourceTypes, of 
updateResourceValuesFromConfig was based on AbsoluteResourceType. Now it's from 
the following code in the method updateConfigurableResourceRequirement of 
AbstractCSQueue

 
{code:java}
Set resources = Arrays.stream(clusterResource.getResources()).map(x
-> x.getName()).collect(Collectors.toSet());
{code}
With the variable resources, we can get the resources in resource-types.xml.

 

If an incorrect resource is specified in capacity-scheduler.xml, 
CapacitySchedulerConfigureation.updateResourceValuesFromConfig would ignore it. 
So that the scheduler would not use the incorrect resource for container 
allocation.

Hope I make it clear~

 

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-22 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749522#comment-16749522
 ] 

Zac Zhou commented on YARN-9161:


[~sunilg] , Thanks a lot for your comments.

Checked with [~tangzhankun] offline, It is a different issue. 

The root cause of YARN-9205  is that configuration doesn't load 
resource-types.xml.

Here are some key modifications for this patch:
 # In AbstractCSQueue.updateConfigurableResourceRequirement, Doesn't use 
AbsoluteResourceType to filter out custom resources
 # In 
AbstractCSQueue.getCurrentLimitResource/ParentQueue.calculateEffectiveResourcesAndCapacity,
 use Resources.componentwiseMin instead of Resources.min.
 # Add a method, ResourceUtils.parseResourcesString, to refactor resource 
string parsing logic.
 # Add ut for container allocation and absolute configuration

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6695) Race condition in RM for publishing container events vs appFinished events causes NPE

2019-01-22 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-6695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749497#comment-16749497
 ] 

Zac Zhou commented on YARN-6695:


I checked the logic for TimelineServiceV2Publisher. If 
yarn.rm.system-metrics-publisher.emit-container-events is set to true, RM only 
publishes two events, containerCreated and containerFinished, to timeline 
server v2. I think the workload is affordable.  

And In the guide of TimelineServiceV2,  
yarn.rm.system-metrics-publisher.emit-container-events is set to true as a 
basic configuration to enable timeline v2.

This Jira is useful. I suggest to commit it to trunk recently~

 

> Race condition in RM for publishing container events vs appFinished events 
> causes NPE 
> --
>
> Key: YARN-6695
> URL: https://issues.apache.org/jira/browse/YARN-6695
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Rohith Sharma K S
>Priority: Critical
> Attachments: YARN-6695.001.patch
>
>
> When RM publishes container events i.e by enabling 
> *yarn.rm.system-metrics-publisher.emit-container-events*, there is race 
> condition for processing events 
> vs appFinished event that removes appId from collector list which cause NPE. 
> Look at the below trace where appId is removed from collectors first and then 
> corresponding events are processed. 
> {noformat}
> 2017-06-06 19:28:48,896 INFO  capacity.ParentQueue 
> (ParentQueue.java:removeApplication(472)) - Application removed - appId: 
> application_1496758895643_0005 user: root leaf-queue of parent: root 
> #applications: 0
> 2017-06-06 19:28:48,921 INFO  collector.TimelineCollectorManager 
> (TimelineCollectorManager.java:remove(190)) - The collector service for 
> application_1496758895643_0005 was removed
> 2017-06-06 19:28:48,922 ERROR metrics.TimelineServiceV2Publisher 
> (TimelineServiceV2Publisher.java:putEntity(451)) - Error when publishing 
> entity TimelineEntity[type='YARN_CONTAINER', 
> id='container_e01_1496758895643_0005_01_02']
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(TimelineServiceV2Publisher.java:448)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(TimelineServiceV2Publisher.java:72)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:480)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2EventHandler.handle(TimelineServiceV2Publisher.java:469)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:201)
>   at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:127)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-22 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16749498#comment-16749498
 ] 

Zac Zhou commented on YARN-9161:


[~sunilg], [~leftnoteasy], [~tangzhankun] 

welcome to add some comments whenever you are available~

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-21 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16748468#comment-16748468
 ] 

Zac Zhou commented on YARN-9161:


I tested the TestCapacitySchedulerSurgicalPreemption locally, it doesn't fail.

I used maven command to run the whole test cases for the sub-module 
hadoop-yarn-applications-distributedshell without the patch. It failed with a 
timeout error just like the test report as well.

I also tested the case TestDistributedShell directly, it succeeded to run on my 
server.

So the failed tests seem not related to the patch. 

[~sunilg] , Could you help to review the patch when you are free.

Thanks a lot~

 

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-17 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745779#comment-16745779
 ] 

Zac Zhou commented on YARN-9161:


Update patch to fix failed UT cases

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-17 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.006.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch, 
> YARN-9161.006.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9205) When using custom resource type, application will fail to run due to the CapacityScheduler throws InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION)

2019-01-17 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16745759#comment-16745759
 ] 

Zac Zhou commented on YARN-9205:


[~tangzhankun]
{quote}private static Map 
getResourceInformationMapFromConfig(
...// NULL value here!String[] resourceNames = 
conf.getStrings(YarnConfiguration.RESOURCE_TYPES);{quote}
[YARN-9161|https://issues.apache.org/jira/browse/YARN-9161] fixes this issue as 
well. Hope the two Jiras would not conflicts ~

 

 

> When using custom resource type, application will fail to run due to the 
> CapacityScheduler throws 
> InvalidResourceRequestException(GREATER_THEN_MAX_ALLOCATION) 
> ---
>
> Key: YARN-9205
> URL: https://issues.apache.org/jira/browse/YARN-9205
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Zhankun Tang
>Assignee: Zhankun Tang
>Priority: Critical
> Attachments: YARN-9205-trunk.001.patch
>
>
> In a non-secure cluster. Reproduce it as follows:
>  # Set capacity scheduler in yarn-site.xml
>  # Use default capacity-scheduler.xml
>  # Set custom resource type "cmp.com/hdw" in resource-types.xml
>  # Set a value say 10 in node-resources.xml
>  # Start cluster
>  # Submit a distribute shell application which requests some "cmp.com/hdw"
> The AM will get an exception from CapacityScheduler and then failed. This bug 
> doesn't exist in FairScheduler.
> {code:java}
> 2019-01-17 22:12:11,286 INFO distributedshell.ApplicationMaster: Requested 
> container ask: Capability[ 2>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution Type: 
> GUARANTEED, Enforce Execution Type: false}]Resource Profile[]
> 2019-01-17 22:12:12,326 ERROR impl.AMRMClientAsyncImpl: Exception on heartbeat
> org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid 
> resource request! Cannot allocate containers as requested resource is greater 
> than maximum allowed allocation. Requested resource type=[cmp.com/hdw], 
> Requested resource=, maximum allowed 
> allocation=, please note that maximum allowed 
> allocation is calculated by scheduler based on maximum resource of registered 
> NodeManagers, which might be less than configured maximum 
> allocation=
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.throwInvalidResourceException(SchedulerUtils.java:492)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.checkResourceRequestAgainstAvailableResource(SchedulerUtils.java:388)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:315)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:293)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:301)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.normalizeAndValidateRequests(RMServerUtils.java:250)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.DefaultAMSProcessor.allocate(DefaultAMSProcessor.java:240)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.constraint.processor.DisabledPlacementProcessor.allocate(DisabledPlacementProcessor.java:75)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.AMSProcessingChain.allocate(AMSProcessingChain.java:92)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:424)
> at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationMasterProtocolPBServiceImpl.allocate(ApplicationMasterProtocolPBServiceImpl.java:60)
> ...{code}
> Did a roughly debugging, below method return the wrong maximum capacity.
> DefaultAMSProcessor.java, Line 234.
> {code:java}
> Resource maximumCapacity =
>  getScheduler().getMaximumResourceCapability(app.getQueue());{code}
> The above code seems should return "" 
> but returns "".
> This might be caused by queue maximum allocation calculation:
> AbstractCSQueue.java Line364
> {code:java}
> this.maximumAllocation =
>  configuration.getMaximumAllocationPerQueue(
>  getQueuePath());{code}
> And this invokes CapacitySchedulerConfiguration.java Line 895:
> {code:java}
> Resource clusterMax = ResourceUtils.fetchMaximumAllocationFromConfig(this);
> {code}
> Passing a "this" which is not a YarnConfiguration instance will cause below 
> code return null for resource names and then only contains mandatory 
> resources. This might be the root cause.
> {code:java}
> private static Map 
> getResourceInformationMapFromConfig(
> ...
> // NULL value here!
> String[] resourceNames = 

[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-16 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744183#comment-16744183
 ] 

Zac Zhou commented on YARN-9161:


Add UT for container allocation and fix some code style issues

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-16 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.005.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch, YARN-9161.005.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-16 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.004.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch, YARN-9161.004.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2019-01-10 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739331#comment-16739331
 ] 

Zac Zhou commented on YARN-8489:


[~leftnoteasy], [~suma.shivaprasad]

When you get spare time, can you help to review the patch?

Thanks a lot~

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch, 
> YARN-8489.003.patch, YARN-8489.004.patch, YARN-8489.005.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-09 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738921#comment-16738921
 ] 

Zac Zhou edited comment on YARN-9161 at 1/10/19 2:36 AM:
-

The test case is broken by HDFS-14084 and resolved by YARN-9183


was (Author: yuan_zac):
The test case is broken by 
[YARN-9183|https://issues.apache.org/jira/browse/YARN-9183]

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-09 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738921#comment-16738921
 ] 

Zac Zhou commented on YARN-9161:


The test case is broken by 
[YARN-9183|https://issues.apache.org/jira/browse/YARN-9183]

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-09 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738327#comment-16738327
 ] 

Zac Zhou commented on YARN-9161:


[~sunilg],

Thanks for your comments~

The failed unit tests seem not related to this patch from my understanding.

 

 

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-08 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.003.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch, 
> YARN-9161.003.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9185) TimelineServiceV2Publisher throws NPE when app is finished before container metrics updated

2019-01-08 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737830#comment-16737830
 ] 

Zac Zhou commented on YARN-9185:


[~eyang]

Thanks a lot for fixing it~

> TimelineServiceV2Publisher throws NPE when app is finished before container 
> metrics updated
> ---
>
> Key: YARN-9185
> URL: https://issues.apache.org/jira/browse/YARN-9185
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> When the dominant component feature is enabled. Resourcemanager throws NPE 
> after the app is finished.
> The stack is as follows:
> 2019-01-08 19:54:48,788 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(Timel
> ineServiceV2Publisher.java:459)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(Time
> lineServiceV2Publisher.java:73)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
> Handler.handle(TimelineServiceV2Publisher.java:494)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
> Handler.handle(TimelineServiceV2Publisher.java:483)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2019-01-08 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737791#comment-16737791
 ] 

Zac Zhou commented on YARN-8489:


[~suma.shivaprasad]

Thanks a lot for your comments~

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch, 
> YARN-8489.003.patch, YARN-8489.004.patch, YARN-8489.005.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8489) Need to support "dominant" component concept inside YARN service

2019-01-08 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8489:
---
Attachment: YARN-8489.005.patch

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch, 
> YARN-8489.003.patch, YARN-8489.004.patch, YARN-8489.005.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8489) Need to support "dominant" component concept inside YARN service

2019-01-08 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8489:
---
Attachment: YARN-8489.004.patch

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch, 
> YARN-8489.003.patch, YARN-8489.004.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2019-01-08 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737084#comment-16737084
 ] 

Zac Zhou commented on YARN-8489:


[~leftnoteasy], [~suma.shivaprasad]

Thanks a lot for your comments.
{quote}3) Changes of TimelineServiceV2Publisher, is it a specific issue related 
to this change? If it is a corner case we need to take care, I suggest to file 
a separate JIRA and add unit test.
{quote}
Yes, I think it's some kind of related to this patch. This would cause an NPE 
exception in resource manager. The root cause seems that the application is 
finished, but its containers metrics still needs to update using its 
TimelineServiceV2Publisher.

I'll create a separate  Jira to track it.

 

 

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch, 
> YARN-8489.003.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9185) TimelineServiceV2Publisher throws NPE when app is finished before container metrics updated

2019-01-08 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737091#comment-16737091
 ] 

Zac Zhou commented on YARN-9185:


When the following configuration is specified:


 
yarn.timeline-service.generic-application-history.save-non-am-container-meta-info
 false


The NPE doesn't occur.

> TimelineServiceV2Publisher throws NPE when app is finished before container 
> metrics updated
> ---
>
> Key: YARN-9185
> URL: https://issues.apache.org/jira/browse/YARN-9185
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> When the dominant component feature is enabled. Resourcemanager throws NPE 
> after the app is finished.
> The stack is as follows:
> 2019-01-08 19:54:48,788 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(Timel
> ineServiceV2Publisher.java:459)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(Time
> lineServiceV2Publisher.java:73)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
> Handler.handle(TimelineServiceV2Publisher.java:494)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
> Handler.handle(TimelineServiceV2Publisher.java:483)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9185) TimelineServiceV2Publisher throws NPE when app is finished before container metrics updated

2019-01-08 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9185:
---
Issue Type: Sub-task  (was: Bug)
Parent: YARN-8489

> TimelineServiceV2Publisher throws NPE when app is finished before container 
> metrics updated
> ---
>
> Key: YARN-9185
> URL: https://issues.apache.org/jira/browse/YARN-9185
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> When the dominant component feature is enabled. Resourcemanager throws NPE 
> after the app is finished.
> The stack is as follows:
> 2019-01-08 19:54:48,788 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
> Error in dispatcher thread
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(Timel
> ineServiceV2Publisher.java:459)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(Time
> lineServiceV2Publisher.java:73)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
> Handler.handle(TimelineServiceV2Publisher.java:494)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
> Handler.handle(TimelineServiceV2Publisher.java:483)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>  at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9185) TimelineServiceV2Publisher throws NPE when app is finished before container metrics updated

2019-01-08 Thread Zac Zhou (JIRA)
Zac Zhou created YARN-9185:
--

 Summary: TimelineServiceV2Publisher throws NPE when app is 
finished before container metrics updated
 Key: YARN-9185
 URL: https://issues.apache.org/jira/browse/YARN-9185
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zac Zhou
Assignee: Zac Zhou


When the dominant component feature is enabled. Resourcemanager throws NPE 
after the app is finished.

The stack is as follows:

2019-01-08 19:54:48,788 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: 
Error in dispatcher thread
java.lang.NullPointerException
 at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.putEntity(Timel
ineServiceV2Publisher.java:459)
 at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher.access$100(Time
lineServiceV2Publisher.java:73)
 at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
Handler.handle(TimelineServiceV2Publisher.java:494)
 at 
org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV2Publisher$TimelineV2Event
Handler.handle(TimelineServiceV2Publisher.java:483)
 at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
 at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
 at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-07 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.002.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch, YARN-9161.002.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9155) Can't re-run a submarine job, if the previous job with the same service name has finished

2019-01-07 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735602#comment-16735602
 ] 

Zac Zhou commented on YARN-9155:


[~leftnoteasy], Thanks a lot for your comments. it makes sense to me. I'll work 
on it ~

> Can't re-run a submarine job, if the previous job with the same service name 
> has finished
> -
>
> Key: YARN-9155
> URL: https://issues.apache.org/jira/browse/YARN-9155
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> Yarn native service doesn't clean up its HDFS service path when it is 
> finished.
> So if we don't execute "yarn app -destroy " command before the next run of a 
> submarine job. we would get the following exception:
> 2018-12-24 11:38:02,493 ERROR 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem: Dir 
> /user/hadoop//services/distributed-tf-gpu-ml4/${service_name}.json 
> exists: hdfs://mldev/user/hadoop/**
> /services/distributed-tf-gpu-ml4/${service_name}.json 8472
> 2018-12-24 11:38:02,494 ERROR 
> org.apache.hadoop.yarn.service.webapp.ApiServer: Failed to create service 
> ${service_name}: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.createService(ApiServer.java:131)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOu
> tInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJav
> aMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:8
> 4)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 19)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 09)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:1
> 79)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> 

[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-02 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732613#comment-16732613
 ] 

Zac Zhou commented on YARN-9161:


Thanks a lot, [~leftnoteasy], [~sunilg]
{code:java}
// W/o unit for memory means bits, and 20 bits will be rounded to 0
res = CliUtils.createResourceFromString("memory=20,vcores=3",
ResourceUtils.getResourcesTypeInfo());
Assert.assertEquals(Resources.createResource(0, 3), res);
{code}
This test case is in submarine sub-project. The immediate cause of it failed is 
that I replace submarine client-side resource parsing logic with a new common 
method ResourceUtils.getStandardYarnResource().

As this test case is only related to SUBMARINE, do you think it is ok to modify 
this test case to ensure backward-compatible? So that memory API for SUBMARINE 
and YARN is the same as 
[YARN-7242|https://issues.apache.org/jira/browse/YARN-7242]

 

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2019-01-02 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732006#comment-16732006
 ] 

Zac Zhou commented on YARN-9161:


I checked the failed test case TestRunJobCliParsing and found that it's related 
to memory resource API.

In the test case:
{code:java}
// W/o unit for memory means bits, and 20 bits will be rounded to 0
res = CliUtils.createResourceFromString("memory=20,vcores=3",
ResourceUtils.getResourcesTypeInfo());
Assert.assertEquals(Resources.createResource(0, 3), res);
{code}
memory=20 should return 0 as it doesn't have a unit.

But YARN-5881 and YARN-7242 supports the API "memory=20" assuming MB is used by 
default. 

[~leftnoteasy], could you help to give some advice about how to use the memory 
resource API.

Thanks a lot ~

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2018-12-31 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: (was: YARN-9161.001.patch)

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2018-12-31 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9161:
---
Attachment: YARN-9161.001.patch

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9161.001.patch
>
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2018-12-28 Thread Zac Zhou (JIRA)
Zac Zhou created YARN-9161:
--

 Summary: Absolute resources of capacity scheduler doesn't support 
GPU and FPGA
 Key: YARN-9161
 URL: https://issues.apache.org/jira/browse/YARN-9161
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zac Zhou
Assignee: Zac Zhou


As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
elements: memory and vcores, which would filter out absolute resources 
configuration of gpu and fpga in 
AbstractCSQueue.updateConfigurableResourceRequirement. 

This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9161) Absolute resources of capacity scheduler doesn't support GPU and FPGA

2018-12-28 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730541#comment-16730541
 ] 

Zac Zhou commented on YARN-9161:


a patch would be attached shortly.

> Absolute resources of capacity scheduler doesn't support GPU and FPGA
> -
>
> Key: YARN-9161
> URL: https://issues.apache.org/jira/browse/YARN-9161
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> As the enum CapacitySchedulerConfiguration.AbsoluteResourceType only has two 
> elements: memory and vcores, which would filter out absolute resources 
> configuration of gpu and fpga in 
> AbstractCSQueue.updateConfigurableResourceRequirement. 
> This issue would cause gpu and fpga can't be allocated correctly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9003) Support multi-homed network for docker container

2018-12-26 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728938#comment-16728938
 ] 

Zac Zhou commented on YARN-9003:


I'm not sure whether --net can specify multiple networks. 

I tried bridge and calico network, it looks like only one network take effects:

What I got is as follows:
h6. bridge only

 
{code:java}
 docker run -it --net=bridge --name c1 --rm busybox 
/ # ifconfig
eth0 Link encap:Ethernet HWaddr 02:42:AC:12:00:02 
inet addr:172.18.0.2 Bcast:0.0.0.0 Mask:255.255.0.0
inet6 addr: fe80::42:acff:fe12:2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0 
RX bytes:418 (418.0 B) TX bytes:508 (508.0 B)
lo Link encap:Local Loopback 
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1 
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
{code}
 
h6. calico only

 
{code:java}
docker run -it --net=calico-network --name c1 --rm busybox 
/ # ifconfig
cali0 Link encap:Ethernet HWaddr EE:EE:EE:EE:EE:EE 
inet addr:192.20.24.39 Bcast:0.0.0.0 Mask:255.255.255.255
inet6 addr: fe80::ecee:eeff:feee:/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:5 errors:0 dropped:0 overruns:0 frame:0
TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0 
RX bytes:418 (418.0 B) TX bytes:258 (258.0 B)
lo Link encap:Local Loopback 
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1 
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
{code}
 
h6. Both calico and bridge

 
{code:java}
docker run -it --net=bridge --net=calico-network --name c1 --rm busybox 
/ # ifconfig
cali0 Link encap:Ethernet HWaddr EE:EE:EE:EE:EE:EE 
 inet addr:192.20.24.41 Bcast:0.0.0.0 Mask:255.255.255.255
 inet6 addr: fe80::ecee:eeff:feee:/64 Scope:Link
 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
 RX packets:6 errors:0 dropped:0 overruns:0 frame:0
 TX packets:5 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:0 
 RX bytes:508 (508.0 B) TX bytes:418 (418.0 B)
lo Link encap:Local Loopback 
 inet addr:127.0.0.1 Mask:255.0.0.0
 inet6 addr: ::1/128 Scope:Host
 UP LOOPBACK RUNNING MTU:65536 Metric:1
 RX packets:0 errors:0 dropped:0 overruns:0 frame:0
 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
 collisions:0 txqueuelen:1 
 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
{code}
 

Should we use "docker network connect" instead?

 

> Support multi-homed network for docker container
> 
>
> Key: YARN-9003
> URL: https://issues.apache.org/jira/browse/YARN-9003
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: docker
> Attachments: YARN-9003.001.patch, YARN-9003.002.patch
>
>
> Docker network can be defined as configuration properties - docker.network to 
> setup docker container to connect to a specific network in YARN service.  
> Docker can run multi-homed network by specifying --net=bridge 
> --net=private-net.  This is useful to expose small number of  front end 
> container and ports, while the rest of the infrastructure runs in private 
> network.  This task is to add support for specifying multiple docker networks 
> to YARN service and docker support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9155) Can't re-run a submarine job, if the previous job with the same service name has finished

2018-12-24 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728293#comment-16728293
 ] 

Zac Zhou commented on YARN-9155:


Or should we add an after service listeners interface to yarn native service, 
so that users can specify want they want to do after yarn service is finished. 
If we use this way, YARN-8725 can be resolved easily as well.

> Can't re-run a submarine job, if the previous job with the same service name 
> has finished
> -
>
> Key: YARN-9155
> URL: https://issues.apache.org/jira/browse/YARN-9155
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> Yarn native service doesn't clean up its HDFS service path when it is 
> finished.
> So if we don't execute "yarn app -destroy " command before the next run of a 
> submarine job. we would get the following exception:
> 2018-12-24 11:38:02,493 ERROR 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem: Dir 
> /user/hadoop//services/distributed-tf-gpu-ml4/${service_name}.json 
> exists: hdfs://mldev/user/hadoop/**
> /services/distributed-tf-gpu-ml4/${service_name}.json 8472
> 2018-12-24 11:38:02,494 ERROR 
> org.apache.hadoop.yarn.service.webapp.ApiServer: Failed to create service 
> ${service_name}: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.createService(ApiServer.java:131)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOu
> tInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJav
> aMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:8
> 4)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 19)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 09)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:1
> 79)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> 

[jira] [Updated] (YARN-9155) Can't re-run a submarine job, if a previous job with the same service name has finished

2018-12-24 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9155:
---
Summary: Can't re-run a submarine job, if a previous job with the same 
service name has finished  (was: Can't submit a submarine job, if a previous 
job with the same service name has finished)

> Can't re-run a submarine job, if a previous job with the same service name 
> has finished
> ---
>
> Key: YARN-9155
> URL: https://issues.apache.org/jira/browse/YARN-9155
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> Yarn native service doesn't clean up its HDFS service path when it is 
> finished.
> So if we don't execute "yarn app -destroy " command before the next run of a 
> submarine job. we would get the following exception:
> 2018-12-24 11:38:02,493 ERROR 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem: Dir 
> /user/hadoop//services/distributed-tf-gpu-ml4/${service_name}.json 
> exists: hdfs://mldev/user/hadoop/**
> /services/distributed-tf-gpu-ml4/${service_name}.json 8472
> 2018-12-24 11:38:02,494 ERROR 
> org.apache.hadoop.yarn.service.webapp.ApiServer: Failed to create service 
> ${service_name}: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.createService(ApiServer.java:131)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOu
> tInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJav
> aMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:8
> 4)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 19)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 09)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:1
> 79)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> 

[jira] [Updated] (YARN-9155) Can't re-run a submarine job, if the previous job with the same service name has finished

2018-12-24 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9155:
---
Summary: Can't re-run a submarine job, if the previous job with the same 
service name has finished  (was: Can't re-run a submarine job, if a previous 
job with the same service name has finished)

> Can't re-run a submarine job, if the previous job with the same service name 
> has finished
> -
>
> Key: YARN-9155
> URL: https://issues.apache.org/jira/browse/YARN-9155
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> Yarn native service doesn't clean up its HDFS service path when it is 
> finished.
> So if we don't execute "yarn app -destroy " command before the next run of a 
> submarine job. we would get the following exception:
> 2018-12-24 11:38:02,493 ERROR 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem: Dir 
> /user/hadoop//services/distributed-tf-gpu-ml4/${service_name}.json 
> exists: hdfs://mldev/user/hadoop/**
> /services/distributed-tf-gpu-ml4/${service_name}.json 8472
> 2018-12-24 11:38:02,494 ERROR 
> org.apache.hadoop.yarn.service.webapp.ApiServer: Failed to create service 
> ${service_name}: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.createService(ApiServer.java:131)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOu
> tInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJav
> aMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:8
> 4)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 19)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 09)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:1
> 79)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> 

[jira] [Commented] (YARN-9155) Can't submit a submarine job, if a previous job with the same service name has finished

2018-12-24 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728276#comment-16728276
 ] 

Zac Zhou commented on YARN-9155:


Could we add a parameter in Service API to indicate that the service should 
clean up HDFS path and zookeeper node when the job is finished? And the cleanup 
logic can be added in ServiceUtils.ProcessTerminationHandler.

[~leftnoteasy], [~tangzhankun], [~liuxun323] it would be nice if you can give 
some comments/advice.

Thanks

> Can't submit a submarine job, if a previous job with the same service name 
> has finished
> ---
>
> Key: YARN-9155
> URL: https://issues.apache.org/jira/browse/YARN-9155
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> Yarn native service doesn't clean up its HDFS service path when it is 
> finished.
> So if we don't execute "yarn app -destroy " command before the next run of a 
> submarine job. we would get the following exception:
> 2018-12-24 11:38:02,493 ERROR 
> org.apache.hadoop.yarn.service.utils.CoreFileSystem: Dir 
> /user/hadoop//services/distributed-tf-gpu-ml4/${service_name}.json 
> exists: hdfs://mldev/user/hadoop/**
> /services/distributed-tf-gpu-ml4/${service_name}.json 8472
> 2018-12-24 11:38:02,494 ERROR 
> org.apache.hadoop.yarn.service.webapp.ApiServer: Failed to create service 
> ${service_name}: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.createService(ApiServer.java:131)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOu
> tInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJav
> aMethodDispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:8
> 4)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
> 473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 19)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
> 09)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:1
> 79)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at 

[jira] [Created] (YARN-9155) Can't submit a submarine job, if a previous job with the same service name has finished

2018-12-24 Thread Zac Zhou (JIRA)
Zac Zhou created YARN-9155:
--

 Summary: Can't submit a submarine job, if a previous job with the 
same service name has finished
 Key: YARN-9155
 URL: https://issues.apache.org/jira/browse/YARN-9155
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zac Zhou
Assignee: Zac Zhou


Yarn native service doesn't clean up its HDFS service path when it is finished.

So if we don't execute "yarn app -destroy " command before the next run of a 
submarine job. we would get the following exception:

2018-12-24 11:38:02,493 ERROR 
org.apache.hadoop.yarn.service.utils.CoreFileSystem: Dir 
/user/hadoop//services/distributed-tf-gpu-ml4/${service_name}.json exists: 
hdfs://mldev/user/hadoop/**
/services/distributed-tf-gpu-ml4/${service_name}.json 8472

2018-12-24 11:38:02,494 ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: 
Failed to create service ${service_name}: {}
java.lang.reflect.UndeclaredThrowableException
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
 at 
org.apache.hadoop.yarn.service.webapp.ApiServer.createService(ApiServer.java:131)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
 at 
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOu
tInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
 at 
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJav
aMethodDispatcher.java:75)
 at 
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
 at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
 at 
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
 at 
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:8
4)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
542)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1
473)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
19)
 at 
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:14
09)
 at 
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
 at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
 at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:1
79)
 at 
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
 at 
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
 at 
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
 at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
 at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
 at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
 at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
 at 
org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
 at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilte
r.java:644)
 at 
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilte
r.java:592)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
 at 

[jira] [Commented] (YARN-9144) WebAppProxyServlet can't redirect to ATS V1.5 when a yarn native service app is finished

2018-12-18 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724723#comment-16724723
 ] 

Zac Zhou commented on YARN-9144:


Fix code stype

> WebAppProxyServlet can't redirect to ATS V1.5 when a yarn native service app 
> is finished
> 
>
> Key: YARN-9144
> URL: https://issues.apache.org/jira/browse/YARN-9144
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9144.001.patch, YARN-9144.002.patch
>
>
> When a yarn native service app is finished, RM web UI v1 should redirect 
> tracking URL to ATS V1.5 if it's enabled. So that users can check the app 
> logs like MR jobs. But the tracking URL points to RM app page.
> The root cause is that WebAppProxyServlet may get app report from RM, as RM 
> would cache a small amount of apps status. Then WebAppProxyServlet redirect 
> to RM, not ATS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9144) WebAppProxyServlet can't redirect to ATS V1.5 when a yarn native service app is finished

2018-12-18 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9144:
---
Attachment: YARN-9144.002.patch

> WebAppProxyServlet can't redirect to ATS V1.5 when a yarn native service app 
> is finished
> 
>
> Key: YARN-9144
> URL: https://issues.apache.org/jira/browse/YARN-9144
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9144.001.patch, YARN-9144.002.patch
>
>
> When a yarn native service app is finished, RM web UI v1 should redirect 
> tracking URL to ATS V1.5 if it's enabled. So that users can check the app 
> logs like MR jobs. But the tracking URL points to RM app page.
> The root cause is that WebAppProxyServlet may get app report from RM, as RM 
> would cache a small amount of apps status. Then WebAppProxyServlet redirect 
> to RM, not ATS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9144) WebAppProxyServlet can't redirect to ATS V1.5 when a yarn native service app is finished

2018-12-18 Thread Zac Zhou (JIRA)
Zac Zhou created YARN-9144:
--

 Summary: WebAppProxyServlet can't redirect to ATS V1.5 when a yarn 
native service app is finished
 Key: YARN-9144
 URL: https://issues.apache.org/jira/browse/YARN-9144
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zac Zhou
Assignee: Zac Zhou


When a yarn native service app is finished, RM web UI v1 should redirect 
tracking URL to ATS V1.5 if it's enabled. So that users can check the app logs 
like MR jobs. But the tracking URL points to RM app page.

The root cause is that WebAppProxyServlet may get app report from RM, as RM 
would cache a small amount of apps status. Then WebAppProxyServlet redirect to 
RM, not ATS



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-12-17 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722923#comment-16722923
 ] 

Zac Zhou commented on YARN-8489:


[~suma.shivaprasad] , Thanks a lot.

The failed test case seems not related with this patch.

[~leftnoteasy] , [~eyang], could you help to review the patch. 

Thanks.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch, 
> YARN-8489.003.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9141) [submarine] JobStatus outputs with system UTC clock, not local clock

2018-12-16 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9141:
---
Description: 
The current time is Mon Dec 17 12:26:31 CST 2018.

But submarine job status output is like this:

Job Name=distributed-tf-gpu-ml4, status=RUNNING time=2018-12-17T04:29:20.873Z
Components:
--

The time is not local time.

 

> [submarine] JobStatus outputs with system UTC clock, not local clock
> 
>
> Key: YARN-9141
> URL: https://issues.apache.org/jira/browse/YARN-9141
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> The current time is Mon Dec 17 12:26:31 CST 2018.
> But submarine job status output is like this:
> Job Name=distributed-tf-gpu-ml4, status=RUNNING time=2018-12-17T04:29:20.873Z
> Components:
> --
> The time is not local time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9141) [submarine] JobStatus outputs with system UTC clock, not local clock

2018-12-16 Thread Zac Zhou (JIRA)
Zac Zhou created YARN-9141:
--

 Summary: [submarine] JobStatus outputs with system UTC clock, not 
local clock
 Key: YARN-9141
 URL: https://issues.apache.org/jira/browse/YARN-9141
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zac Zhou
Assignee: Zac Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-12-16 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8489:
---
Attachment: YARN-8489.003.patch

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch, 
> YARN-8489.003.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-12-15 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8489:
---
Attachment: YARN-8489.002.patch

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8489.001.patch, YARN-8489.002.patch
>
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-12-11 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716694#comment-16716694
 ] 

Zac Zhou commented on YARN-8489:


@[~suma.shivaprasad] any Updates? Or would you mind if I take it, as this jira 
blocks terminating submarine job gracefully.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-12-11 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716694#comment-16716694
 ] 

Zac Zhou edited comment on YARN-8489 at 12/11/18 9:47 AM:
--

[~suma.shivaprasad] any Updates? Or would you mind if I take it, as this jira 
blocks terminating submarine job gracefully.


was (Author: yuan_zac):
@[~suma.shivaprasad] any Updates? Or would you mind if I take it, as this jira 
blocks terminating submarine job gracefully.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-12-04 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708652#comment-16708652
 ] 

Zac Zhou commented on YARN-8960:


I think it should be ok, [~leftnoteasy] any comments?

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, 
> YARN-8960.006.patch, YARN-8960.007.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-12-04 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708649#comment-16708649
 ] 

Zac Zhou commented on YARN-9001:


Yup, I think it can be applied to 3.2.0. Since this patch uses API from 3.1.0. 
It should be ok~

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
> Attachments: YARN-9001-branch-3.2.001.patch, YARN-9001.001.patch, 
> YARN-9001.002.patch, YARN-9001.003.patch, YARN-9001.004.patch, 
> YARN-9001.005.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Attachment: YARN-8960.007.patch

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, 
> YARN-8960.006.patch, YARN-8960.007.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687457#comment-16687457
 ] 

Zac Zhou edited comment on YARN-8960 at 11/15/18 4:18 AM:
--

Add a parameter, named distribute_keytab, which can be used to specify whether 
to distribute local keytab across the cluster. 

A submarine job can be submitted like this:
{code:java}
./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
--input_path hdfs://mldev/tmp/cifar-10-data \
--checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
--num_ps 1 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
--ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 2 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
--keytab /tmp/keytabs/hadoop.keytab \
--principal hadoop/ad...@corp.com \
--distribute_keytab{code}




 

 


was (Author: yuan_zac):
Add a parameter, named distribute_keytab, which can be used to specify whether 
to distribute local keytab across the cluster. 

A submarine job can be submitted like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
 --env DOCKER_JAVA_HOME=/opt/java \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
 --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
 --worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
 --input_path hdfs://mldev/tmp/cifar-10-data \
 --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
 --num_ps 1 \
 --ps_resources memory=4G,vcores=2,gpu=0 \
 --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
 --ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
 --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
 --num_workers 2 \
 --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
 --keytab /tmp/keytabs/hadoop.keytab \
 --principal hadoop/ad...@corp.com \
 --distribute_keytab

 

 

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, 
> YARN-8960.006.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> 

[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687457#comment-16687457
 ] 

Zac Zhou commented on YARN-8960:


Add a parameter, named distribute_keytab, which can be used to specify whether 
to distribute local keytab across the cluster. 

A submarine job can be submitted like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
 --env DOCKER_JAVA_HOME=/opt/java \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
 --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
 --worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
 --input_path hdfs://mldev/tmp/cifar-10-data \
 --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
 --num_ps 1 \
 --ps_resources memory=4G,vcores=2,gpu=0 \
 --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
 --ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
 --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
 --num_workers 2 \
 --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
 --keytab /tmp/keytabs/hadoop.keytab \
 --principal hadoop/ad...@corp.com \
 --distribute_keytab

 

 

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, 
> YARN-8960.006.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Attachment: YARN-8960.006.patch

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch, 
> YARN-8960.006.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Attachment: YARN-8960.006.patch

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Attachment: (was: YARN-8960.006.patch)

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686692#comment-16686692
 ] 

Zac Zhou commented on YARN-8960:


 

Thanks,[~leftnoteasy]

For comment 1
{quote}1) doLoginIfSecure, could u print login user if keytab/principal is 
empty? (Assume the user has login using kinit). We should fail the job 
submission if user doesn't login using kinit AND no keytab/principal specified 
AND security is enabled. And suggest to use Log.info instead of debug.
{quote}
LoginIfSecure is changed.

For comment 2
{quote}2) Regarding to upload keytab, I'm a bit concerned about this behavior, 
instead of doing that, should we assume keytabs will be placed under all 
machine's directory? For example, if "zac" user has 
/security/keytabs/zac.keytab, the remote machine should have the same keytab on 
the same folder. Passing around keytab could be a high risk of the cluster.

If you think #2 is necessary, please at least make uploading keytab to an 
optional parameter, and add a note to command line description (Such as 
"distributing keytab to other machines is a risky operation to your 
credentials. Please consider options pre-distribute your keytab by admin as an 
alternative and more safety solution").
{quote}

Yeah, I agree with you. Publishing keytab to the cluster seems a risk. 
But I think we need to support it, as it's easier for user to submit a 
submarine job. I checked spark code(Client.prepareLocalResource) for it's 
--keytab 
--principal parameter. Spark uploaded the user's keytab to hdfs to resolve am 
delegationToken renewer issue for long-running app(AMDelegationTokenRenewer). 
As the keytab is uploaded to user's home directory, we can set it's permission 
to 400 to avoid others to get it. if 
[YARN-8725|https://issues.apache.org/jira/browse/YARN-8725]
is done, the staging dir will be cleaned up after the job is done. I think it's 
a controllable risk.

Your advice is great, keytab uploading is changed to optional and warnings is 
added.

Thanks

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-14 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Attachment: YARN-8960.005.patch

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch, YARN-8960.005.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> {code}
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ...
> Caused by: org.apache.hadoop.yarn.exceptions.YarnException: No principal 
> specified in the persisted service definitio
> n, fail to connect to AM.
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.createAMProxy(ServiceClient.java:1500)
>  at 
> org.apache.hadoop.yarn.service.client.ServiceClient.getStatus(ServiceClient.java:1376)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.lambda$getServiceFromClient$4(ApiServer.java:804)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  ... 68 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-13 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686076#comment-16686076
 ] 

Zac Zhou commented on YARN-9001:


Thanks, [~leftnoteasy]

I checked the code and found there were no differences between branch-3.3 and 
branch-3.2. So all changes should have been pushed to branch-3.2.

I found that there is an issue in the patch, which would cause merge failed in 
branch 3.2.

The git diff was:

 
{code:java}
+ String appStatus=serviceClient.getStatusString(jobName);
+ Service serviceSpec= ServiceApiUtil.jsonSerDeser.fromJson(appStatus);
+ JobStatus jobStatus = JobStatusBuilder.fromServiceSpec(serviceSpec);
+ return jobStatus;
+ }
 
- Service serviceSpec = this.serviceClient.getStatus(jobName);
- return JobStatusBuilder.fromServiceSpec(serviceSpec);{code}
 

and it should be:

 
{code:java}
- Service serviceSpec = this.serviceClient.getStatus(jobName);
+ String appStatus=serviceClient.getStatusString(jobName);
+ Service serviceSpec= ServiceApiUtil.jsonSerDeser.fromJson(appStatus);
 JobStatus jobStatus = JobStatusBuilder.fromServiceSpec(serviceSpec);
 return jobStatus;{code}
 

I checked the code in the trunk. The patch was merged correctly to the trunk.

So I generated a patch for branch-3.2, named YARN-9001-branch-3.2.001.patch, 
and tested it. it should work for branch 3.2.

I'm not sure why Jenkins didn't fail when patches were attached. ~

In the future, I'll check if patches can be applied to trunk and branch 3.2.

Thanks 

 

 

 

 

 

 

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9001-branch-3.2.001.patch, YARN-9001.001.patch, 
> YARN-9001.002.patch, YARN-9001.003.patch, YARN-9001.004.patch, 
> YARN-9001.005.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-13 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9001:
---
Attachment: YARN-9001-branch-3.2.001.patch

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: YARN-9001-branch-3.2.001.patch, YARN-9001.001.patch, 
> YARN-9001.002.patch, YARN-9001.003.patch, YARN-9001.004.patch, 
> YARN-9001.005.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-13 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Attachment: YARN-8960.004.patch

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch, YARN-8960.004.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> ispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:179)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> 

[jira] [Comment Edited] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684788#comment-16684788
 ] 

Zac Zhou edited comment on YARN-8960 at 11/13/18 6:53 AM:
--

As discussion offline, we can use the same kerberos parameter for both service 
and user.

Two parameters --keytab, --principal are added to the submarine job.

We can submit a submarine job like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
--input_path hdfs://mldev/tmp/cifar-10-data \
--checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
--num_ps 1 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
--ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 2 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
 *--keytab* /tmp/keytabs/hadoop.keytab \
 *--principal* hadoop/ad...@corp.com

 


was (Author: yuan_zac):
As discussion offline, we can use the same kerberos keytab parameter for both 
service and user.

Two parameters --keytab, --principal are added to the submarine job.

We can submit a submarine job like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
--input_path hdfs://mldev/tmp/cifar-10-data \
--checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
--num_ps 1 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
--ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 2 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
*--keytab* /tmp/keytabs/hadoop.keytab \
*--principal* hadoop/ad...@corp.com

 

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> ispatcher.java:75)
>  

[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684788#comment-16684788
 ] 

Zac Zhou commented on YARN-8960:


As discussion offline, we can use the same kerberos keytab parameter for both 
service and user.

Two parameters --keytab, --principal are added to the submarine job.

We can submit a submarine job like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
 --env DOCKER_JAVA_HOME=/opt/java \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
 --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
 --worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
 --input_path hdfs://mldev/tmp/cifar-10-data \
 --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
 --num_ps 1 \
 --ps_resources memory=4G,vcores=2,gpu=0 \
 --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
 --ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
 --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
 --num_workers 2 \
 --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
 --keytab /tmp/keytabs/hadoop.keytab \
 --principal hadoop/ad...@corp.com

 

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> ispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> 

[jira] [Comment Edited] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684788#comment-16684788
 ] 

Zac Zhou edited comment on YARN-8960 at 11/13/18 6:50 AM:
--

As discussion offline, we can use the same kerberos keytab parameter for both 
service and user.

Two parameters --keytab, --principal are added to the submarine job.

We can submit a submarine job like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
--input_path hdfs://mldev/tmp/cifar-10-data \
--checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
--num_ps 1 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
--ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 2 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
*--keytab* /tmp/keytabs/hadoop.keytab \
*--principal* hadoop/ad...@corp.com

 


was (Author: yuan_zac):
As discussion offline, we can use the same kerberos keytab parameter for both 
service and user.

Two parameters --keytab, --principal are added to the submarine job.

We can submit a submarine job like this:

./yarn jar 
/home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
 --env DOCKER_JAVA_HOME=/opt/java \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \
 --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
 --worker_docker_image 0.0.0.0:5000/gpu-cuda9.0-tf1.8.0-with-models \
 --input_path hdfs://mldev/tmp/cifar-10-data \
 --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \
 --num_ps 1 \
 --ps_resources memory=4G,vcores=2,gpu=0 \
 --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \
 --ps_docker_image 0.0.0.0:5000/dockerfile-cpu-tf1.8.0-with-models \
 --worker_resources memory=4G,vcores=2,gpu=1 --verbose \
 --num_workers 2 \
 --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py 
--data-dir=hdfs://mldev/tmp/cifar-10-data 
--job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 
--eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" \
 --keytab /tmp/keytabs/hadoop.keytab \
 --principal hadoop/ad...@corp.com

 

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> 

[jira] [Comment Edited] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684664#comment-16684664
 ] 

Zac Zhou edited comment on YARN-8714 at 11/13/18 3:14 AM:
--

Looks great, it would be convenient for notebook app, like Zeppline, to submit 
the job if local files are supported.

I'm not sure if the parameter name, localization, is ok. Is it easier to 
understand if we use some parameter like "files" or "libjars" used in map 
reduce job?

Thanks,


was (Author: yuan_zac):
Looks great, it would be convenient for notebook app, like Zeppline, to submit 
the job if local files are supported.

I'm not sure if the parameter name, localization, is ok. Is it easier to 
understand if we use some parameter like '''--files' or "--libjars" used in map 
reduce job?

Thanks,

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684664#comment-16684664
 ] 

Zac Zhou edited comment on YARN-8714 at 11/13/18 3:14 AM:
--

Looks great, it would be convenient for notebook app, like Zeppline, to submit 
the job if local files are supported.

I'm not sure if the parameter name, localization, is ok. Is it easier to 
understand if we use some parameter like "files" or "libjars" used in map 
reduce jobs?

Thanks,


was (Author: yuan_zac):
Looks great, it would be convenient for notebook app, like Zeppline, to submit 
the job if local files are supported.

I'm not sure if the parameter name, localization, is ok. Is it easier to 
understand if we use some parameter like "files" or "libjars" used in map 
reduce job?

Thanks,

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684664#comment-16684664
 ] 

Zac Zhou edited comment on YARN-8714 at 11/13/18 3:13 AM:
--

Looks great, it would be convenient for notebook app, like Zeppline, to submit 
the job if local files are supported.

I'm not sure if the parameter name, localization, is ok. Is it easier to 
understand if we use some parameter like '''--files' or "--libjars" used in map 
reduce job?

Thanks,


was (Author: yuan_zac):
Looks great, it would be convenient for notebook app, like Zeppline, to submit 
the job if local files are supported.

I'm not sure if the parameter name, localization, is ok. Is it easier to 
understand if we use some parameter like '''--files' or "--libjars" used in map 
reduce job?

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8714) [Submarine] Support files/tarballs to be localized for a training job.

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684664#comment-16684664
 ] 

Zac Zhou commented on YARN-8714:


Looks great, it would be convenient for notebook app, like Zeppline, to submit 
the job if local files are supported.

I'm not sure if the parameter name, localization, is ok. Is it easier to 
understand if we use some parameter like '''--files' or "--libjars" used in map 
reduce job?

> [Submarine] Support files/tarballs to be localized for a training job.
> --
>
> Key: YARN-8714
> URL: https://issues.apache.org/jira/browse/YARN-8714
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Wangda Tan
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8714-WIP1-trunk-001.patch
>
>
> See 
> https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7,
>  {{job run --localizations ...}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16684639#comment-16684639
 ] 

Zac Zhou commented on YARN-9001:


Sure, Wangda

The following test cases have been executed:
 # submarine run job command with and without "wait_job_finish" parameter
 # submarine show job command
 # yarn app -status command
 # yarn app -destroy command

Thanks, 

 

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-12 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16683661#comment-16683661
 ] 

Zac Zhou commented on YARN-9001:


The ut error seems is not related to the patch, resubmit the patch

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-12 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9001:
---
Attachment: YARN-9001.004.patch

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch, YARN-9001.004.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-11 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9001:
---
Attachment: YARN-9001.003.patch

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch, 
> YARN-9001.003.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-10 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682419#comment-16682419
 ] 

Zac Zhou commented on YARN-8960:


Maybe, we need both of them,

To enable "yarn app -status", submarine service should have service principal.

And If we want to make it convenient for a notebook app, like zeppline, to 
submit submarine apps for different user like spark, we need user principal 
parameters to specify who submit the job.

Or we can just have one principal parameter, and use it as both service 
principal and user principal?

[~leftnoteasy], [~sunilg] any comments?

 

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> ispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:179)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at 

[jira] [Updated] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-10 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-9001:
---
Attachment: YARN-9001.002.patch

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-9001.001.patch, YARN-9001.002.patch
>
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-09 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-9001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681388#comment-16681388
 ] 

Zac Zhou commented on YARN-9001:


I'll submit a patch shortly

> [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs
> --
>
> Key: YARN-9001
> URL: https://issues.apache.org/jira/browse/YARN-9001
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
>
> For now, submarine submit a service to yarn by using ServiceClient, We should 
> change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-9001) [Submarine] Use AppAdminClient instead of ServiceClient to sumbit jobs

2018-11-09 Thread Zac Zhou (JIRA)
Zac Zhou created YARN-9001:
--

 Summary: [Submarine] Use AppAdminClient instead of ServiceClient 
to sumbit jobs
 Key: YARN-9001
 URL: https://issues.apache.org/jira/browse/YARN-9001
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zac Zhou
Assignee: Zac Zhou


For now, submarine submit a service to yarn by using ServiceClient, We should 
change it to AppAdminClient 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8960) [Submarine] Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-01 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Summary: [Submarine] Can't get submarine service status using the command 
of "yarn app -status" under security environment  (was: {Submarine} Can't get 
submarine service status using the command of "yarn app -status" under security 
environment)

> [Submarine] Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> ispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:179)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> 

[jira] [Updated] (YARN-8960) {Submarine} Can't get submarine service status using the command of "yarn app -status" under security environment

2018-11-01 Thread Zac Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zac Zhou updated YARN-8960:
---
Summary: {Submarine} Can't get submarine service status using the command 
of "yarn app -status" under security environment  (was: Can't get submarine 
service status using the command of "yarn app -status" under security 
environment)

> {Submarine} Can't get submarine service status using the command of "yarn app 
> -status" under security environment
> -
>
> Key: YARN-8960
> URL: https://issues.apache.org/jira/browse/YARN-8960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zac Zhou
>Priority: Major
> Attachments: YARN-8960.001.patch, YARN-8960.002.patch, 
> YARN-8960.003.patch
>
>
> After submitting a submarine job, we tried to get service status using the 
> following command:
> yarn app -status ${service_name}
> But we got the following error:
> HTTP error code : 500
>  
> The stack in resourcemanager log is :
> ERROR org.apache.hadoop.yarn.service.webapp.ApiServer: Get service failed: {}
> java.lang.reflect.UndeclaredThrowableException
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getServiceFromClient(ApiServer.java:800)
>  at 
> org.apache.hadoop.yarn.service.webapp.ApiServer.getService(ApiServer.java:186)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker
> ._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>  at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodD
> ispatcher.java:75)
>  at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>  at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>  at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>  at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>  at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:89)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.webapp.RMWebAppFilter.doFilter(RMWebAppFilter.java:179)
>  at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>  at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>  at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>  at com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>  at com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>  at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>  at 
> 

  1   2   >