[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-09-23 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17419203#comment-17419203
 ] 

Matthew Sharp commented on YARN-10178:
--

We have been running this patch in our production cluster for the last week and 
this has fixed our issue.  We were seeing 1-2 failovers a day on average from 
this issue before patching.  Over the last week we have not had any and our 
logs show no hits for this specific error message any more.  

+1 (non-binding) for merging this in.  Thanks all for the work on this patch!

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for 

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-09-15 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415693#comment-17415693
 ] 

Matthew Sharp commented on YARN-10178:
--

Thanks for the review and thoughts.  We are currently testing this to see if it 
fixes the issue in our cluster.  We should have enough evidence by early next 
week to confirm if it does.

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && 

[jira] [Commented] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

2021-09-14 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415025#comment-17415025
 ] 

Matthew Sharp commented on YARN-10178:
--

We have been seeing this frequently in our 3.3.0 cluster as well.  [~wangda] 
has there been any other internal discussions on the best approach to address 
this? 

> Global Scheduler async thread crash caused by 'Comparison method violates its 
> general contract'
> ---
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler
>Affects Versions: 3.2.1
>Reporter: tuyu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, 
> YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received 
> RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, 
> Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: 
> Comparison method violates its general contract!  
>at 
> java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 
> 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know 
> Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use 
> PriorityUtilizationQueueOrderingPolicy function to choose queue to assign 
> container,and construct a CSAssignment struct, and use 
> submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit 
> function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest request =
>   (ResourceCommitRequest) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && 

[jira] [Created] (YARN-10932) Capacity Scheduler Deadlock - GuaranteedOrZeroCapacityOverTimePolicy

2021-09-01 Thread Matthew Sharp (Jira)
Matthew Sharp created YARN-10932:


 Summary: Capacity Scheduler Deadlock - 
GuaranteedOrZeroCapacityOverTimePolicy
 Key: YARN-10932
 URL: https://issues.apache.org/jira/browse/YARN-10932
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Matthew Sharp
 Attachments: thread_dump.txt

We recently started testing out the GuaranteedOrZeroCapacityOverTimePolicy and 
ran into an interesting deadlock.

I attached the relevant portions from the thread dump we were able to grab.
{code:java}
 Found one Java-level deadlock:
 =
 "qtp2055501967-32194":
  waiting for ownable synchronizer 0x7f2521f6a850, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "IPC Server handler 0 on default port 8033"
 "IPC Server handler 0 on default port 8033":
  waiting for ownable synchronizer 0x7f2522026418, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "SchedulingMonitor (QueueManagementDynamicEditPolicy)"
 "SchedulingMonitor (QueueManagementDynamicEditPolicy)":
  waiting for ownable synchronizer 0x7f25220254c0, (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
  which is held by "IPC Server handler 0 on default port 8033" {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-04-27 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333457#comment-17333457
 ] 

Matthew Sharp commented on YARN-10493:
--

An exception is thrown for that to ensure it matches our storage format for the 
HDFS properties file.  This is the same logic as the Java CLI import tool from 
YARN-10494.

Example HDFS Path:  /runc-root/meta/[namespace]/@.properties

In theory we could change that if there is a benefit in your opinion, but my 
initial reaction is that adding sub-directories to that namespace may make it 
harder to track images (cleanup, governance, perhaps even quotas, etc.).

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf, 
> runc-container-repository-v2-design_updated.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-04-27 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333276#comment-17333276
 ] 

Matthew Sharp commented on YARN-10493:
--

Thanks for the review.  I can submit an update for the feedback.  The string 
split portion with a negative limit will actually split at the delimiter as 
many times as possible.  I do this to help account for invalid input that 
otherwise would crash the NodeManager if it isn't caught.  A lot of the new 
test cases added try to account for various examples of accidental typos that 
could cause issues.

For the dependency on YARN-10494 I can start to look at that next.

 

 

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf, 
> runc-container-repository-v2-design_updated.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-04-01 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313424#comment-17313424
 ] 

Matthew Sharp commented on YARN-10493:
--

[~ebadger] The latest PR contains the namespace support that we had discussed.  
I also updated the design doc to outline that a bit more.

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf, 
> runc-container-repository-v2-design_updated.pdf
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-10493) RunC container repository v2

2021-03-25 Thread Matthew Sharp (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Sharp updated YARN-10493:
-
Attachment: runc-container-repository-v2-design_updated.pdf

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf, 
> runc-container-repository-v2-design_updated.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308201#comment-17308201
 ] 

Matthew Sharp commented on YARN-10493:
--

Good news for the test.  I can add that to the improvement list for the 
YARN-10494 tool, since that will break the localizer.  I have a working patch 
the namespaces, going to test a bit more on our cluster tomorrow and then I 
will update the PR and update the pdf.

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-10493) RunC container repository v2

2021-03-24 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308201#comment-17308201
 ] 

Matthew Sharp edited comment on YARN-10493 at 3/24/21, 10:04 PM:
-

Good news for the test.  I can add that to the improvement list for the 
YARN-10494 tool, since that will break the localizer.  I have a working patch 
for the namespaces, going to test a bit more on our cluster tomorrow and then I 
will update the PR and update the pdf.


was (Author: matthewsharp):
Good news for the test.  I can add that to the improvement list for the 
YARN-10494 tool, since that will break the localizer.  I have a working patch 
the namespaces, going to test a bit more on our cluster tomorrow and then I 
will update the PR and update the pdf.

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17308172#comment-17308172
 ] 

Matthew Sharp commented on YARN-10493:
--

No I don't see that in our cluster.  We have 755 perms on the directories and 
644 on the files it looks like.  I actually just recently deleted the old 
/runc-root as well with the v2 change for testing and it recreated with default 
umask I am assuming.  For the CLI tool imports that is running as our hdfs 
admin user still (not ideal for now), but for testing seems fine.

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-24 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17307897#comment-17307897
 ] 

Matthew Sharp commented on YARN-10493:
--

I looked at YARN-10494 (CLI Tool) design and that was the missing piece.  I 
originally included a static directory you can set for a single namespace 
(defaults to library if you don't).  But the CLI tool design pdf and your usage 
example makes more sense, to make that user configurable and only default to a 
namespace if that is not provided.  I can work on an update to support that and 
also add to the pdf design to make that more clear.

 

YARN_CONTAINER_RUNTIME_RUNC_IMAGE=[namespace]/[:tag]

 

Examples:
{code:java}
YARN_CONTAINER_RUNTIME_RUNC_IMAGE=busybox

/runc-root/meta/library/busybox@latest.properties{code}
 
{code:java}
YARN_CONTAINER_RUNTIME_RUNC_IMAGE=hadoop/busybox:latest 

/runc-root/meta/hadoop/busybox@latest.properties
{code}
 

 
{code:java}
YARN_CONTAINER_RUNTIME_RUNC_IMAGE=shared/busybox:1.0.0

/runc-root/meta/shared/busybox@1.0.0.properties
{code}
 

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10493) RunC container repository v2

2021-03-19 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304979#comment-17304979
 ] 

Matthew Sharp commented on YARN-10493:
--

I have an initial PR to address the improvements outlined in the attached pdf.  
I have some thoughts around the manifest caching that I would like to address 
in a follow up Jira.  We have this running internally with the Java CLI tool 
from YARN-10494.  

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: runc-container-repository-v2-design.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10534) Enable runC container transformations

2021-03-19 Thread Matthew Sharp (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Sharp reassigned YARN-10534:


Assignee: Matthew Sharp

> Enable runC container transformations
> -
>
> Key: YARN-10534
> URL: https://issues.apache.org/jira/browse/YARN-10534
> Project: Hadoop YARN
>  Issue Type: New Feature
>Reporter: Matthew Sharp
>Assignee: Matthew Sharp
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The goal of this Jira is to provide an optional plugin to apply runC 
> container transformations. Enabling runC container transformations will 
> provide an easy way to apply site specific customizations to all containers.
> An example of one transformation that many clusters may need could be a 
> Kerberos transformation. This would apply cluster Kerberos configurations and 
> mount them to all runC containers that are submitted, without requiring users 
> to manage them within their own images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10494) CLI tool for docker-to-squashfs conversion (pure Java)

2021-03-17 Thread Matthew Sharp (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Sharp reassigned YARN-10494:


Assignee: Matthew Sharp  (was: mbsharp85)

> CLI tool for docker-to-squashfs conversion (pure Java)
> --
>
> Key: YARN-10494
> URL: https://issues.apache.org/jira/browse/YARN-10494
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
>  Labels: pull-request-available
> Attachments: YARN-10494.001.patch, 
> docker-to-squashfs-conversion-tool-design.pdf
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> *YARN-9564* defines a docker-to-squashfs image conversion tool that relies on 
> python2, multiple libraries, squashfs-tools and root access in order to 
> convert Docker images to squashfs images for use with the runc container 
> runtime in YARN.
> *YARN-9943* was created to investigate alternatives, as the response to 
> merging YARN-9564 has not been very positive. This proposal outlines the 
> design for a CLI conversion tool in 100% pure Java that will work out of the 
> box.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-10493) RunC container repository v2

2021-03-17 Thread Matthew Sharp (Jira)


 [ 
https://issues.apache.org/jira/browse/YARN-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Sharp reassigned YARN-10493:


Assignee: Matthew Sharp  (was: mbsharp85)

> RunC container repository v2
> 
>
> Key: YARN-10493
> URL: https://issues.apache.org/jira/browse/YARN-10493
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: nodemanager, yarn
>Affects Versions: 3.3.0
>Reporter: Craig Condit
>Assignee: Matthew Sharp
>Priority: Major
> Attachments: runc-container-repository-v2-design.pdf
>
>
> The current runc container repository design has scalability and usability 
> issues which will likely limit widespread adoption. We should address this 
> with a new, V2 layout.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10534) Enable runC container transformations

2020-12-15 Thread Matthew Sharp (Jira)
Matthew Sharp created YARN-10534:


 Summary: Enable runC container transformations
 Key: YARN-10534
 URL: https://issues.apache.org/jira/browse/YARN-10534
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Matthew Sharp


The goal of this Jira is to provide an optional plugin to apply runC container 
transformations. Enabling runC container transformations will provide an easy 
way to apply site specific customizations to all containers.

An example of one transformation that many clusters may need could be a 
Kerberos transformation. This would apply cluster Kerberos configurations and 
mount them to all runC containers that are submitted, without requiring users 
to manage them within their own images.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10250) Container Relaunch - find: File system loop detected

2020-04-28 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094710#comment-17094710
 ] 

Matthew Sharp commented on YARN-10250:
--

I can submit a patch for this shortly based on the idea above.  I am open to 
other suggestions as well.

> Container Relaunch - find: File system loop detected
> 
>
> Key: YARN-10250
> URL: https://issues.apache.org/jira/browse/YARN-10250
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Matthew Sharp
>Priority: Major
>
> Hive LLAP YARN service tries to relaunch from a container failure and when it 
> retries on the same node we are seeing it fail with:
> {code:java}
> find: File system loop detected; ‘./lib/llap-27Apr2020.tar.gz’ is part of the 
> same file system loop as ‘./lib’. {code}
>  
> YARN-8667 attempted to clean up the prior symlinks before relaunching, but in 
> this case it still exists since it recreates the symlinks right before trying 
> to output to directory.info for logging.
>  
> The following line appears to be the culprit:  
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java#L1346]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-10250) Container Relaunch - find: File system loop detected

2020-04-28 Thread Matthew Sharp (Jira)


[ 
https://issues.apache.org/jira/browse/YARN-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094707#comment-17094707
 ] 

Matthew Sharp commented on YARN-10250:
--

The launch-container script will fail on any non-zero return code, since that 
is debugging information only, one quick approach is to force those commands to 
always return true so the container relaunch is not impacted. 

> Container Relaunch - find: File system loop detected
> 
>
> Key: YARN-10250
> URL: https://issues.apache.org/jira/browse/YARN-10250
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Matthew Sharp
>Priority: Major
>
> Hive LLAP YARN service tries to relaunch from a container failure and when it 
> retries on the same node we are seeing it fail with:
> {code:java}
> find: File system loop detected; ‘./lib/llap-27Apr2020.tar.gz’ is part of the 
> same file system loop as ‘./lib’. {code}
>  
> YARN-8667 attempted to clean up the prior symlinks before relaunching, but in 
> this case it still exists since it recreates the symlinks right before trying 
> to output to directory.info for logging.
>  
> The following line appears to be the culprit:  
> [https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java#L1346]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-10250) Container Relaunch - find: File system loop detected

2020-04-28 Thread Matthew Sharp (Jira)
Matthew Sharp created YARN-10250:


 Summary: Container Relaunch - find: File system loop detected
 Key: YARN-10250
 URL: https://issues.apache.org/jira/browse/YARN-10250
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.2.0
Reporter: Matthew Sharp


Hive LLAP YARN service tries to relaunch from a container failure and when it 
retries on the same node we are seeing it fail with:
{code:java}
find: File system loop detected; ‘./lib/llap-27Apr2020.tar.gz’ is part of the 
same file system loop as ‘./lib’. {code}
 

YARN-8667 attempted to clean up the prior symlinks before relaunching, but in 
this case it still exists since it recreates the symlinks right before trying 
to output to directory.info for logging.

 

The following line appears to be the culprit:  
[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java#L1346]

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org