from:"Chris Nauroth \(Jira\)"

[jira] [Commented] (YARN-10739) GenericEventHandler.printEventQueueDetails causes RM recovery to take too much time

2023-04-06 Thread Chris Nauroth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-10739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709507#comment-17709507
 ] 

Chris Nauroth commented on YARN-10739:
--

I've seen trouble with this in 3.3 and 3.2 clusters. This patch does not depend 
on the larger YARN-10695 umbrella effort, so I'm planning to cherry-pick it to 
branch-3.3 and branch-3.2 I'll wait a day in case anyone has objections. See 
also YARN-11286.

> GenericEventHandler.printEventQueueDetails causes RM recovery to take too 
> much time
> ---
>
> Key: YARN-10739
> URL: https://issues.apache.org/jira/browse/YARN-10739
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Zhanqi Cai
>Assignee: Qi Zhu
>Priority: Critical
> Fix For: 3.4.0
>
> Attachments: YARN-10739-001.patch, YARN-10739-002.patch, 
> YARN-10739.003.patch, YARN-10739.003.patch, YARN-10739.004.patch, 
> YARN-10739.005.patch, YARN-10739.006.patch
>
>
> Due to YARN-8995 YARN-10642 add GenericEventHandler.printEventQueueDetails on 
> AsyncDispatcher, if the event queue size is too large, the 
> printEventQueueDetails will cost too much time and RM  take a long time to 
> process.
> For example:
>  If we have 4K nodes on cluster and 4K apps running, if we do switch and the 
> node manager will register with RM, and RM will call NodesListManager to do 
> RMAppNodeUpdateEvent, code like below:
> {code:java}
> for(RMApp app : rmContext.getRMApps().values()) {
>   if (!app.isAppFinalStateStored()) {
> this.rmContext
> .getDispatcher()
> .getEventHandler()
> .handle(
> new RMAppNodeUpdateEvent(app.getApplicationId(), eventNode,
> appNodeUpdateType));
>   }
> }{code}
> So the total event is 4k*4k=16 mil, during this window, the 
> GenericEventHandler.printEventQueueDetails will print the event queue detail 
> and be called frequently, once the event queue size reaches 1 mil+, the 
> Iterator of the queue from printEventQueueDetails will be so slow refer to 
> below: 
> {code:java}
> private void printEventQueueDetails() {
>   Iterator iterator = eventQueue.iterator();
>   Map counterMap = new HashMap<>();
>   while (iterator.hasNext()) {
> Enum eventType = iterator.next().getType();
> {code}
> Then RM recovery will cost too much time.
>  Refer to our log:
> {code:java}
> 2021-04-14 20:35:34,432 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:handle(306)) - Size of event-queue is 1200
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: KILL, Event 
> record counter: 310836
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: NODE_UPDATE, 
> Event record counter: 1103
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: 
> NODE_REMOVED, Event record counter: 1
> 2021-04-14 20:35:35,818 INFO  event.AsyncDispatcher 
> (AsyncDispatcher.java:printEventQueueDetails(291)) - Event type: APP_REMOVED, 
> Event record counter: 1
> {code}
> Between AsyncDispatcher.handle and printEventQueueDetails, here is more than 
> 1s to do Iterator.
> I upload a file to ensure the printEventQueueDetails only be called one-time 
> pre-30s.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11286) Make AsyncDispatcher#printEventDetailsExecutor thread pool parameter configurable

2023-04-06 Thread Chris Nauroth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17709508#comment-17709508
 ] 

Chris Nauroth commented on YARN-11286:
--

I've seen trouble with this in 3.3 and 3.2 clusters. This patch does not depend 
on the larger YARN-10695 umbrella effort, so I'm planning to cherry-pick it to 
branch-3.3 and branch-3.2 I'll wait a day in case anyone has objections. See 
also YARN-10739.

> Make AsyncDispatcher#printEventDetailsExecutor thread pool parameter 
> configurable
> -
>
> Key: YARN-11286
> URL: https://issues.apache.org/jira/browse/YARN-11286
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 3.4.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> AsyncDispatcher#printEventDetailsExecutor thread pool parameters are 
> hard-coded, extract this part of hard-coded configuration parameters to the 
> configuration file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11231) FSDownload set wrong permission in destinationTmp

2023-01-07 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11231.
--
  Assignee: Zhang Dongsheng
Resolution: Won't Fix

Hello [~skysider]. I noticed you closed pull request 
[#4629|https://github.com/apache/hadoop/pull/4629]. I assume you are abandoning 
this change, because 777 would be too dangerous, so I'm also closing this 
corresponding JIRA issue. (If I misunderstood, and you're still working on 
something for this, then the issue can be reopened.)

> FSDownload set wrong permission in destinationTmp
> -
>
> Key: YARN-11231
> URL: https://issues.apache.org/jira/browse/YARN-11231
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zhang Dongsheng
>Assignee: Zhang Dongsheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> FSDownload calls createDir in the call method to create the destinationTmp 
> directory, which is later used as the parent directory to create the 
> directory dFinal, which is used in doAs to perform operations such as path 
> creation and path traversal. doAs cannot determine the user's identity, so 
> there is a problem with setting 755 permissions for destinationTmp here, I 
> think it should be set to 777 permissions here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11388) Prevent resource leaks in TestClientRMService.

2022-12-28 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11388.
--
Fix Version/s: 3.4.0
   3.2.5
   3.3.9
   Resolution: Fixed

I have merged this to trunk, branch-3.3 and branch-3.2 (after resolving some 
minor merge conflicts). [~slfan1989] , thank you for your review!

> Prevent resource leaks in TestClientRMService.
> --
>
> Key: YARN-11388
> URL: https://issues.apache.org/jira/browse/YARN-11388
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: test
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.9
>
>
> While working on YARN-11360, I noticed a few problems in 
> {{TestClientRMService}} that made it difficult to work with. Tests do not 
> guarantee that servers they start up get shutdown. If an individual test 
> fails, then it can leave TCP sockets bound, causing subsequent tests in the 
> suite to fail on their socket bind attempts for the same port. There is also 
> a file generated by a test that is leaking outside of the build directory 
> into the source tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11392) ClientRMService implemented getCallerUgi and verifyUserAccessForRMApp methods but forget to use sometimes, caused audit log missing.

2022-12-27 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11392.
--
Fix Version/s: 3.4.0
   3.2.5
   3.3.9
   Resolution: Fixed

I have committed this to trunk, branch-3.3 and branch-3.2. [~chino71], thank 
you for the contribution.

> ClientRMService implemented getCallerUgi and verifyUserAccessForRMApp methods 
> but forget to use sometimes, caused audit log missing.
> 
>
> Key: YARN-11392
> URL: https://issues.apache.org/jira/browse/YARN-11392
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.4
>Reporter: Beibei Zhao
>Assignee: Beibei Zhao
>Priority: Major
>  Labels: audit, log, pull-request-available, yarn
> Fix For: 3.4.0, 3.2.5, 3.3.9
>
>
> ClientRMService implemented getCallerUgi and verifyUserAccessForRMApp methods.
> {code:java}
> private UserGroupInformation getCallerUgi(ApplicationId applicationId,
>   String operation) throws YarnException {
> UserGroupInformation callerUGI;
> try {
>   callerUGI = UserGroupInformation.getCurrentUser();
> } catch (IOException ie) {
>   LOG.info("Error getting UGI ", ie);
>   RMAuditLogger.logFailure("UNKNOWN", operation, "UNKNOWN",
>   "ClientRMService", "Error getting UGI", applicationId);
>   throw RPCUtil.getRemoteException(ie);
> }
> return callerUGI;
>   }
> {code}
> *Privileged operations* like "getContainerReport" (which called checkAccess 
> before op) will call them and *record audit logs* when an *exception* 
> happens, but forget to use sometimes, caused audit log {*}missing{*}: 
> {code:java}
> // getApplicationReport
> UserGroupInformation callerUGI;
> try {
>   callerUGI = UserGroupInformation.getCurrentUser();
> } catch (IOException ie) {
>   LOG.info("Error getting UGI ", ie);
>      // a logFailure should be called here. 
>      throw RPCUtil.getRemoteException(ie);
> }
> {code}
> So, I will replace some code blocks like this with getCallerUgi or 
> verifyUserAccessForRMApp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Assigned] (YARN-11392) ClientRMService implemented getCallerUgi and verifyUserAccessForRMApp methods but forget to use sometimes, caused audit log missing.

2022-12-22 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth reassigned YARN-11392:


Assignee: Beibei Zhao

> ClientRMService implemented getCallerUgi and verifyUserAccessForRMApp methods 
> but forget to use sometimes, caused audit log missing.
> 
>
> Key: YARN-11392
> URL: https://issues.apache.org/jira/browse/YARN-11392
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.3.4
>Reporter: Beibei Zhao
>Assignee: Beibei Zhao
>Priority: Major
>  Labels: audit, log, pull-request-available, yarn
>
> ClientRMService implemented getCallerUgi and verifyUserAccessForRMApp methods.
> {code:java}
> private UserGroupInformation getCallerUgi(ApplicationId applicationId,
>   String operation) throws YarnException {
> UserGroupInformation callerUGI;
> try {
>   callerUGI = UserGroupInformation.getCurrentUser();
> } catch (IOException ie) {
>   LOG.info("Error getting UGI ", ie);
>   RMAuditLogger.logFailure("UNKNOWN", operation, "UNKNOWN",
>   "ClientRMService", "Error getting UGI", applicationId);
>   throw RPCUtil.getRemoteException(ie);
> }
> return callerUGI;
>   }
> {code}
> *Privileged operations* like "getContainerReport" (which called checkAccess 
> before op) will call them and *record audit logs* when an *exception* 
> happens, but forget to use sometimes, caused audit log {*}missing{*}: 
> {code:java}
> // getApplicationReport
> UserGroupInformation callerUGI;
> try {
>   callerUGI = UserGroupInformation.getCurrentUser();
> } catch (IOException ie) {
>   LOG.info("Error getting UGI ", ie);
>      // a logFailure should be called here. 
>      throw RPCUtil.getRemoteException(ie);
> }
> {code}
> So, I will replace some code blocks like this with getCallerUgi or 
> verifyUserAccessForRMApp.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11397) Memory leak when reading aggregated logs from s3 (LogAggregationTFileController::readAggregatedLogs)

2022-12-16 Thread Chris Nauroth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17648770#comment-17648770
 ] 

Chris Nauroth commented on YARN-11397:
--

Is this the same {{S3AInstrumentation}} leak issue as HADOOP-18526, which is 
scheduled for inclusion in the upcoming 3.3.5 release?

CC: [~ste...@apache.org]

> Memory leak when reading aggregated logs from s3 
> (LogAggregationTFileController::readAggregatedLogs)
> 
>
> Key: YARN-11397
> URL: https://issues.apache.org/jira/browse/YARN-11397
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: log-aggregation
>Affects Versions: 3.2.2
> Environment: Remote logs dir on s3.
>Reporter: Maciej Smolenski
>Priority: Critical
> Attachments: YarnLogsS3Issue.scala
>
>
> Reproduction code in the attachment.
> When collecting aggregated logs from s3 in a loop (see reproduction code) we 
> can easily see that the number of 'S3AInstrumentation' is increasing although 
> the number of 'S3AFileSystem' is not increasing. It means that 
> 'S3AInstrumentation' is not released together with 'S3AFileSystem' as it 
> should be. The root cause of this seems to be the missing close on 
> S3AFileSystem.
> The issue seems similar to https://issues.apache.org/jira/browse/YARN-11039 
> but the issue is a 'memory leak' (not a 'thread leak') and affected version 
> is earlier here (3.2.2).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11390) TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 0 now expected: <1> but was: <0>

2022-12-08 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11390.
--
Fix Version/s: 3.4.0
   3.2.5
   3.3.9
   Resolution: Fixed

[~bkosztolnik] , thank you for the contribution. [~pszucs], thank you for 
reviewing. I have committed this to trunk, branch-3.3 and branch-3.2. For the 
cherry-picks to branch-3.3 and branch-3.2, I resolved some minor merge 
conflicts and confirmed a successful test run.

> TestResourceTrackerService.testNodeRemovalNormally: Shutdown nodes should be 
> 0 now expected: <1> but was: <0>
> -
>
> Key: YARN-11390
> URL: https://issues.apache.org/jira/browse/YARN-11390
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Bence Kosztolnik
>Assignee: Bence Kosztolnik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.9
>
>
> Some times the TestResourceTrackerService.{*}testNodeRemovalNormally{*} fails 
> with the following message
> {noformat}
> java.lang.AssertionError: Shutdown nodes should be 0 now expected:<1> but 
> was:<0>
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtilDecomToUntracked(TestResourceTrackerService.java:1723)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalUtil(TestResourceTrackerService.java:1685)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testNodeRemovalNormally(TestResourceTrackerService.java:1530){noformat}
> This can happen in case if the hardcoded 1s sleep in the test not enough for 
> proper shut down.
> To fix this issue we should poll the cluster status with a time out, and see 
> the cluster can reach the expected state



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11388) Prevent resource leaks in TestClientRMService.

2022-12-05 Thread Chris Nauroth (Jira)

Chris Nauroth created YARN-11388:


 Summary: Prevent resource leaks in TestClientRMService.
 Key: YARN-11388
 URL: https://issues.apache.org/jira/browse/YARN-11388
 Project: Hadoop YARN
  Issue Type: Test
  Components: test
Reporter: Chris Nauroth
Assignee: Chris Nauroth


While working on YARN-11360, I noticed a few problems in 
{{TestClientRMService}} that made it difficult to work with. Tests do not 
guarantee that servers they start up get shutdown. If an individual test fails, 
then it can leave TCP sockets bound, causing subsequent tests in the suite to 
fail on their socket bind attempts for the same port. There is also a file 
generated by a test that is leaking outside of the build directory into the 
source tree.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11363) Remove unused TimelineVersionWatcher and TimelineVersion from hadoop-yarn-server-tests

2022-11-01 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11363.
--
Fix Version/s: 3.3.5
   3.4.0
   Resolution: Fixed

> Remove unused TimelineVersionWatcher and TimelineVersion from 
> hadoop-yarn-server-tests 
> ---
>
> Key: YARN-11363
> URL: https://issues.apache.org/jira/browse/YARN-11363
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: test, yarn
>Affects Versions: 3.3.3, 3.3.4
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.5
>
>
> Verify and remove unused TimelineVersionWatcher and TimelineVersion from 
> hadoop-yarn-server-tests 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Resolved] (YARN-11360) Add number of decommissioning/shutdown nodes to YARN cluster metrics.

2022-10-28 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-11360.
--
Fix Version/s: 3.4.0
   3.2.5
   3.3.9
 Hadoop Flags: Reviewed
   Resolution: Fixed

I have committed this to trunk, branch-3.3 and branch-3.2 (after resolving a 
minor merge conflict). [~mkonst], [~groot] and [~abmodi], thank you for the 
code reviews.

> Add number of decommissioning/shutdown nodes to YARN cluster metrics.
> -
>
> Key: YARN-11360
> URL: https://issues.apache.org/jira/browse/YARN-11360
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client, resourcemanager
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.9
>
>
> YARN cluster metrics expose counts of NodeManagers in various states 
> including active and decommissioned. However, these metrics don't expose 
> NodeManagers that are currently in the process of decommissioning. This can 
> look a little spooky to a consumer of these metrics. First, the node drops 
> out of the active count, so it seems like a node just vanished. Then, later 
> (possibly hours later with consideration of graceful decommission), it comes 
> back into existence in the decommissioned count.
> This issue tracks adding the decommissioning count to the metrics 
> ResourceManager RPC. This also enables exposing it in the {{yarn top}} 
> output. This metric is already visible through the REST API, so there isn't 
> any change required there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11360) Add number of decommissioning/shutdown nodes to YARN cluster metrics.

2022-10-25 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-11360:
-
Summary: Add number of decommissioning/shutdown nodes to YARN cluster 
metrics.  (was: Add number of decommissioning nodes to YARN cluster metrics.)

[~mkonst], thank you for the review. I've updated this to include the shutdown 
count like you suggested.

> Add number of decommissioning/shutdown nodes to YARN cluster metrics.
> -
>
> Key: YARN-11360
> URL: https://issues.apache.org/jira/browse/YARN-11360
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client, resourcemanager
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Major
>  Labels: pull-request-available
>
> YARN cluster metrics expose counts of NodeManagers in various states 
> including active and decommissioned. However, these metrics don't expose 
> NodeManagers that are currently in the process of decommissioning. This can 
> look a little spooky to a consumer of these metrics. First, the node drops 
> out of the active count, so it seems like a node just vanished. Then, later 
> (possibly hours later with consideration of graceful decommission), it comes 
> back into existence in the decommissioned count.
> This issue tracks adding the decommissioning count to the metrics 
> ResourceManager RPC. This also enables exposing it in the {{yarn top}} 
> output. This metric is already visible through the REST API, so there isn't 
> any change required there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11360) Add number of decommissioning nodes to YARN cluster metrics.

2022-10-21 Thread Chris Nauroth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622392#comment-17622392
 ] 

Chris Nauroth commented on YARN-11360:
--

Changing the {{yarn top}} output could be viewed as a backward-incompatible 
change according to our policy:

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html#Command_Line_Interface_.28CLI.29

However, since {{yarn top}} is targeted at interactive use and doesn't seem 
usable in scripting anyway, I tend to think this is acceptable. I'll get input 
from others on this before committing. If necessary, I can split the {{yarn 
top}} part to a separate patch.

> Add number of decommissioning nodes to YARN cluster metrics.
> 
>
> Key: YARN-11360
> URL: https://issues.apache.org/jira/browse/YARN-11360
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client, resourcemanager
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Major
>  Labels: pull-request-available
>
> YARN cluster metrics expose counts of NodeManagers in various states 
> including active and decommissioned. However, these metrics don't expose 
> NodeManagers that are currently in the process of decommissioning. This can 
> look a little spooky to a consumer of these metrics. First, the node drops 
> out of the active count, so it seems like a node just vanished. Then, later 
> (possibly hours later with consideration of graceful decommission), it comes 
> back into existence in the decommissioned count.
> This issue tracks adding the decommissioning count to the metrics 
> ResourceManager RPC. This also enables exposing it in the {{yarn top}} 
> output. This metric is already visible through the REST API, so there isn't 
> any change required there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Created] (YARN-11360) Add number of decommissioning nodes to YARN cluster metrics.

2022-10-21 Thread Chris Nauroth (Jira)

Chris Nauroth created YARN-11360:


 Summary: Add number of decommissioning nodes to YARN cluster 
metrics.
 Key: YARN-11360
 URL: https://issues.apache.org/jira/browse/YARN-11360
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client, resourcemanager
 Environment: YARN cluster metrics expose counts of NodeManagers in 
various states including active and decommissioned. However, these metrics 
don't expose NodeManagers that are currently in the process of decommissioning. 
This can look a little spooky to a consumer of these metrics. First, the node 
drops out of the active count, so it seems like a node just vanished. Then, 
later (possibly hours later with consideration of graceful decommission), it 
comes back into existence in the decommissioned count.

This issue tracks adding the decommissioning count to the metrics 
ResourceManager RPC. This also enables exposing it in the {{yarn top}} output. 
This metric is already visible through the REST API, so there isn't any change 
required there.

Reporter: Chris Nauroth
Assignee: Chris Nauroth






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-11360) Add number of decommissioning nodes to YARN cluster metrics.

2022-10-21 Thread Chris Nauroth (Jira)



 [ 
https://issues.apache.org/jira/browse/YARN-11360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-11360:
-
Description: 
YARN cluster metrics expose counts of NodeManagers in various states including 
active and decommissioned. However, these metrics don't expose NodeManagers 
that are currently in the process of decommissioning. This can look a little 
spooky to a consumer of these metrics. First, the node drops out of the active 
count, so it seems like a node just vanished. Then, later (possibly hours later 
with consideration of graceful decommission), it comes back into existence in 
the decommissioned count.

This issue tracks adding the decommissioning count to the metrics 
ResourceManager RPC. This also enables exposing it in the {{yarn top}} output. 
This metric is already visible through the REST API, so there isn't any change 
required there.

Environment: (was: YARN cluster metrics expose counts of NodeManagers 
in various states including active and decommissioned. However, these metrics 
don't expose NodeManagers that are currently in the process of decommissioning. 
This can look a little spooky to a consumer of these metrics. First, the node 
drops out of the active count, so it seems like a node just vanished. Then, 
later (possibly hours later with consideration of graceful decommission), it 
comes back into existence in the decommissioned count.

This issue tracks adding the decommissioning count to the metrics 
ResourceManager RPC. This also enables exposing it in the {{yarn top}} output. 
This metric is already visible through the REST API, so there isn't any change 
required there.
)

> Add number of decommissioning nodes to YARN cluster metrics.
> 
>
> Key: YARN-11360
> URL: https://issues.apache.org/jira/browse/YARN-11360
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client, resourcemanager
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Major
>
> YARN cluster metrics expose counts of NodeManagers in various states 
> including active and decommissioned. However, these metrics don't expose 
> NodeManagers that are currently in the process of decommissioning. This can 
> look a little spooky to a consumer of these metrics. First, the node drops 
> out of the active count, so it seems like a node just vanished. Then, later 
> (possibly hours later with consideration of graceful decommission), it comes 
> back into existence in the decommissioned count.
> This issue tracks adding the decommissioning count to the metrics 
> ResourceManager RPC. This also enables exposing it in the {{yarn top}} 
> output. This metric is already visible through the REST API, so there isn't 
> any change required there.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11231) FSDownload set wrong permission in destinationTmp

2022-07-27 Thread Chris Nauroth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572042#comment-17572042
 ] 

Chris Nauroth commented on YARN-11231:
--

777 is generally a very dangerous thing. This seems like it would open security 
risks of other users writing into the submitter's directories.

Can you provide more details about the problem and how 777 solves it? In an 
unsecured cluster, this all runs as the yarn user, so I don't see how there 
would be a problem there. In a Kerberos secured cluster, resource localization 
runs as the submitting user, which should be granted access with 755. Is there 
something unique in your configuration that causes a conflict?

> FSDownload set wrong permission in destinationTmp
> -
>
> Key: YARN-11231
> URL: https://issues.apache.org/jira/browse/YARN-11231
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Zhang Dongsheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> FSDownload calls createDir in the call method to create the destinationTmp 
> directory, which is later used as the parent directory to create the 
> directory dFinal, which is used in doAs to perform operations such as path 
> creation and path traversal. doAs cannot determine the user's identity, so 
> there is a problem with setting 755 permissions for destinationTmp here, I 
> think it should be set to 777 permissions here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11197) Backport YARN-9608 - DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2022-06-24 Thread Chris Nauroth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558701#comment-17558701
 ] 

Chris Nauroth commented on YARN-11197:
--

Thanks, [~groot]! I would offer to code review for you, but I'll be away from 
my computer for the next 2 weeks. I'll check back when I return, but if another 
committer wants to take it up, that's great too.

> Backport YARN-9608 - DecommissioningNodesWatcher should get lists of running 
> applications on node from RMNode.
> --
>
> Key: YARN-11197
> URL: https://issues.apache.org/jira/browse/YARN-11197
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.10.1, 2.10.2
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> There has been ask in community and internally as well to have YARN-9608 for 
> hadoop-2.10 as well. 
> Evaluate and create patch for the same. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-11197) Backport YARN-9608 - DecommissioningNodesWatcher should get lists of running applications on node from RMNode.

2022-06-24 Thread Chris Nauroth (Jira)



[ 
https://issues.apache.org/jira/browse/YARN-11197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17558616#comment-17558616
 ] 

Chris Nauroth commented on YARN-11197:
--

This is good timing. Dataproc actually just recently backported this and tested 
internally. [~groot], I see you are the assignee for the issue right now. Are 
you actively working on this? If not, we could contribute our patch.

CC: [~abmodi], [~mkonst]

> Backport YARN-9608 - DecommissioningNodesWatcher should get lists of running 
> applications on node from RMNode.
> --
>
> Key: YARN-11197
> URL: https://issues.apache.org/jira/browse/YARN-11197
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.10.1, 2.10.2
>Reporter: Ashutosh Gupta
>Assignee: Ashutosh Gupta
>Priority: Major
>
> There has been ask in community and internally as well to have YARN-9608 for 
> hadoop-2.10 as well. 
> Evaluate and create patch for the same. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5838) windows - environement variables aren't accessible on Yarn 3.0 alpha-1

2016-12-30 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15788415#comment-15788415
 ] 

Chris Nauroth commented on YARN-5838:
-

Hello [~rekha.du...@gmail.com].  Are you possibly looking for the 
{{yarn.nodemanager.admin-env}} setting in yarn-site.xml?  Here is a copy-paste 
of the default as defined in yarn-default.xml:

{code}
  
Environment variables that should be forwarded from the 
NodeManager's environment to the container's.
yarn.nodemanager.admin-env
MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX
  
{code}


> windows - environement variables aren't accessible on Yarn 3.0 alpha-1
> --
>
> Key: YARN-5838
> URL: https://issues.apache.org/jira/browse/YARN-5838
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0-alpha1
> Environment: windows 7
>Reporter: Kanthirekha
>
> windows environment variables aren't accessible on Yarn 3.0 alpha-1
> tried fetching %Path% from Application master and on the container script 
> (after a container is allocated by application master for task executions)
> echo %Path%  
> result : is echo is on 
> it returns blank . 
> Could you please let us know what are the necessary steps to access env 
> variables from yarn 3.0 aplha1 version ? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Updated] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2016-09-30 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3514:

Assignee: (was: Chris Nauroth)

I'm not actively working on this, so I'm unassigning.

> Active directory usernames like domain\login cause YARN failures
> 
>
> Key: YARN-3514
> URL: https://issues.apache.org/jira/browse/YARN-3514
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
> Environment: CentOS6
>Reporter: john lilley
>Priority: Minor
>  Labels: BB2015-05-TBR
> Attachments: YARN-3514.001.patch, YARN-3514.002.patch
>
>
> We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
> Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
> are able to authenticate, browse HDFS, etc.  However, YARN fails during 
> localization because it seems to get confused by the presence of a \ 
> character in the local user name.
> Our AD authentication on the nodes goes through sssd and set configured to 
> map AD users onto the form domain\username.  For example, our test user has a 
> Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
> "domain\hadoopuser".  We have no problem validating that user with PAM, 
> logging in as that user, su-ing to that user, etc.
> However, when we attempt to run a YARN application master, the localization 
> step fails when setting up the local cache directory for the AM.  The error 
> that comes out of the RM logs:
> 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
> ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
> diagnostics='Application application_1429295486450_0001 failed 1 times due to 
> AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
> -1000 due to: Application application_1429295486450_0001 initialization 
> failed (exitCode=255) with output: main : command provided 0
> main : user is DOMAIN\hadoopuser
> main : requested yarn user is domain\hadoopuser
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
> directory: 
> /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
> at 
> org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
> .Failing this attempt.. Failing the application.'
> However, when we look on the node launching the AM, we see this:
> [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
> [root@rpb-cdh-kerb-2 usercache]# ls -l
> drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
> There appears to be different treatment of the \ character in different 
> places.  Something creates the directory as "domain\hadoopuser" but something 
> else later attempts to use it as "domain%5Chadoopuser".  I’m not sure where 
> or why the URL escapement converts the \ to %5C or why this is not consistent.
> I should also mention, for the sake of completeness, our auth_to_local rule 
> is set up to map u...@domain.com to domain\user:
> RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4205) Add a service for monitoring application life time out

2016-09-29 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534277#comment-15534277
 ] 

Chris Nauroth commented on YARN-4205:
-

[~gtCarrera9], we are all clear to use JDK 8 features in trunk, so I think 
committing only to branch-2 is fine.

> Add a service for monitoring application life time out
> --
>
> Key: YARN-4205
> URL: https://issues.apache.org/jira/browse/YARN-4205
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: nijel
>Assignee: Rohith Sharma K S
> Fix For: 2.9.0
>
> Attachments: 0001-YARN-4205.patch, 0002-YARN-4205.patch, 
> 0003-YARN-4205.patch, 0004-YARN-4205.patch, 0005-YARN-4205.patch, 
> 0006-YARN-4205.patch, 0007-YARN-4205.1.patch, 0007-YARN-4205.2.patch, 
> 0007-YARN-4205.patch, YARN-4205-addendum.001.patch, YARN-4205_01.patch, 
> YARN-4205_02.patch, YARN-4205_03.patch
>
>
> This JIRA intend to provide a lifetime monitor service. 
> The service will monitor the applications where the life time is configured. 
> If the application is running beyond the lifetime, it will be killed. 
> The lifetime will be considered from the submit time.
> The thread monitoring interval is configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4205) Add a service for monitoring application life time out

2016-09-29 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534232#comment-15534232
 ] 

Chris Nauroth commented on YARN-4205:
-

+1 for the addendum patch.  [~gtCarrera9], thank you.

> Add a service for monitoring application life time out
> --
>
> Key: YARN-4205
> URL: https://issues.apache.org/jira/browse/YARN-4205
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: nijel
>Assignee: Rohith Sharma K S
> Fix For: 2.9.0
>
> Attachments: 0001-YARN-4205.patch, 0002-YARN-4205.patch, 
> 0003-YARN-4205.patch, 0004-YARN-4205.patch, 0005-YARN-4205.patch, 
> 0006-YARN-4205.patch, 0007-YARN-4205.1.patch, 0007-YARN-4205.2.patch, 
> 0007-YARN-4205.patch, YARN-4205-addendum.001.patch, YARN-4205_01.patch, 
> YARN-4205_02.patch, YARN-4205_03.patch
>
>
> This JIRA intend to provide a lifetime monitor service. 
> The service will monitor the applications where the life time is configured. 
> If the application is running beyond the lifetime, it will be killed. 
> The lifetime will be considered from the submit time.
> The thread monitoring interval is configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4205) Add a service for monitoring application life time out

2016-09-29 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534056#comment-15534056
 ] 

Chris Nauroth commented on YARN-4205:
-

I think this patch broke compilation on branch-2.

{code}
  private Map monitoredApps =
  new HashMap();
{code}

{code}
monitoredApps.putIfAbsent(appToMonitor, timeout);
{code}

[{{Map#putIfAbsent}}|https://docs.oracle.com/javase/8/docs/api/java/util/Map.html#putIfAbsent-K-V-]
 was added in JDK 1.8, but we want to be able to compile branch-2 for JDK 1.7.

Can someone please take a look?

> Add a service for monitoring application life time out
> --
>
> Key: YARN-4205
> URL: https://issues.apache.org/jira/browse/YARN-4205
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: scheduler
>Reporter: nijel
>Assignee: Rohith Sharma K S
> Fix For: 2.9.0
>
> Attachments: 0001-YARN-4205.patch, 0002-YARN-4205.patch, 
> 0003-YARN-4205.patch, 0004-YARN-4205.patch, 0005-YARN-4205.patch, 
> 0006-YARN-4205.patch, 0007-YARN-4205.1.patch, 0007-YARN-4205.2.patch, 
> 0007-YARN-4205.patch, YARN-4205_01.patch, YARN-4205_02.patch, 
> YARN-4205_03.patch
>
>
> This JIRA intend to provide a lifetime monitor service. 
> The service will monitor the applications where the life time is configured. 
> If the application is running beyond the lifetime, it will be killed. 
> The lifetime will be considered from the submit time.
> The thread monitoring interval is configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5551) Ignore deleted file mapping from memory computation when smaps is enabled

2016-08-25 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437256#comment-15437256
 ] 

Chris Nauroth commented on YARN-5551:
-

OK, I get it now.  Thanks, [~gopalv].  I'd be fine proceeding with the change.  
I'm not online until after Labor Day, so I can't do a full code review, test 
and commit.  If anyone else wants to do it, please don't wait for me.

> Ignore deleted file mapping from memory computation when smaps is enabled
> -
>
> Key: YARN-5551
> URL: https://issues.apache.org/jira/browse/YARN-5551
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Attachments: YARN-5551.branch-2.001.patch
>
>
> Currently deleted file mappings are also included in the memory computation 
> when SMAP is enabled. For e.g
> {noformat}
> 7f612004a000-7f612004c000 rw-s  00:10 4201507513 
> /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-521969216_162_734673185
>  (deleted)
> Size:  8 kB
> Rss:   4 kB
> Pss:   2 kB
> Shared_Clean:  0 kB
> Shared_Dirty:  4 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced:4 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> Swap:  0 kB
> KernelPageSize:4 kB
> MMUPageSize:   4 kB
> 7f6123f99000-7f6163f99000 rw-p  08:41 211419477  
> /grid/4/hadoop/yarn/local/usercache/root/appcache/application_1466700718395_1249/container_e19_1466700718395_1249_01_03/7389389356021597290.cache
>  (deleted)
> Size:1048576 kB
> Rss:  637292 kB
> Pss:  637292 kB
> Shared_Clean:  0 kB
> Shared_Dirty:  0 kB
> Private_Clean: 0 kB
> Private_Dirty:637292 kB
> Referenced:   637292 kB
> Anonymous:637292 kB
> AnonHugePages: 0 kB
> Swap:  0 kB
> KernelPageSize:4 kB
> {noformat}
> It would be good to exclude these from getSmapBasedRssMemorySize() 
> computation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5551) Ignore deleted file mapping from memory computation when smaps is enabled

2016-08-25 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15437095#comment-15437095
 ] 

Chris Nauroth commented on YARN-5551:
-

My understanding agrees with Jason's last comment.  The mapping could last well 
past the deletion of the underlying file, maybe even for the whole lifetime of 
the process, so it's correct to include it in the accounting.

> Ignore deleted file mapping from memory computation when smaps is enabled
> -
>
> Key: YARN-5551
> URL: https://issues.apache.org/jira/browse/YARN-5551
> Project: Hadoop YARN
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Minor
> Attachments: YARN-5551.branch-2.001.patch
>
>
> Currently deleted file mappings are also included in the memory computation 
> when SMAP is enabled. For e.g
> {noformat}
> 7f612004a000-7f612004c000 rw-s  00:10 4201507513 
> /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-521969216_162_734673185
>  (deleted)
> Size:  8 kB
> Rss:   4 kB
> Pss:   2 kB
> Shared_Clean:  0 kB
> Shared_Dirty:  4 kB
> Private_Clean: 0 kB
> Private_Dirty: 0 kB
> Referenced:4 kB
> Anonymous: 0 kB
> AnonHugePages: 0 kB
> Swap:  0 kB
> KernelPageSize:4 kB
> MMUPageSize:   4 kB
> 7f6123f99000-7f6163f99000 rw-p  08:41 211419477  
> /grid/4/hadoop/yarn/local/usercache/root/appcache/application_1466700718395_1249/container_e19_1466700718395_1249_01_03/7389389356021597290.cache
>  (deleted)
> Size:1048576 kB
> Rss:  637292 kB
> Pss:  637292 kB
> Shared_Clean:  0 kB
> Shared_Dirty:  0 kB
> Private_Clean: 0 kB
> Private_Dirty:637292 kB
> Referenced:   637292 kB
> Anonymous:637292 kB
> AnonHugePages: 0 kB
> Swap:  0 kB
> KernelPageSize:4 kB
> {noformat}
> It would be good to exclude these from getSmapBasedRssMemorySize() 
> computation.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5456) container-executor support for FreeBSD, NetBSD, and others if conf path is absolute

2016-08-02 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15404892#comment-15404892
 ] 

Chris Nauroth commented on YARN-5456:
-

[~aw], patch 01 looks good.  I verified this on OS X, Linux and FreeBSD.  It's 
cool to see the test passing on FreeBSD this time around!  My only other 
suggestion is to try deploying this change in a secured cluster for a bit of 
manual testing before we commit.

> container-executor support for FreeBSD, NetBSD, and others if conf path is 
> absolute
> ---
>
> Key: YARN-5456
> URL: https://issues.apache.org/jira/browse/YARN-5456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
> Attachments: YARN-5456.00.patch, YARN-5456.01.patch
>
>
> YARN-5121 fixed quite a few portability issues, but it also changed how it 
> determines it's location to be very operating specific for security reasons.  
> We should add support for FreeBSD to unbreak it's ports entry, NetBSD (the 
> sysctl options are just in a different order), and for operating systems that 
> do not have a defined method, an escape hatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5456) container-executor support for FreeBSD, NetBSD, and others if conf path is absolute

2016-08-01 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402789#comment-15402789
 ] 

Chris Nauroth commented on YARN-5456:
-

OK, this plan sounds fine to me.  I think the only additional thing we need 
here is the check on the {{malloc}} call.

> container-executor support for FreeBSD, NetBSD, and others if conf path is 
> absolute
> ---
>
> Key: YARN-5456
> URL: https://issues.apache.org/jira/browse/YARN-5456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
> Attachments: YARN-5456.00.patch
>
>
> YARN-5121 fixed quite a few portability issues, but it also changed how it 
> determines it's location to be very operating specific for security reasons.  
> We should add support for FreeBSD to unbreak it's ports entry, NetBSD (the 
> sysctl options are just in a different order), and for operating systems that 
> do not have a defined method, an escape hatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5456) container-executor support for FreeBSD, NetBSD, and others if conf path is absolute

2016-08-01 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402719#comment-15402719
 ] 

Chris Nauroth commented on YARN-5456:
-

[~aw], thank you for the patch.  I ran it on OS X, Linux and FreeBSD.  I think 
this will be ready to go after adding error checks on the {{malloc}} call and 
discussing a testing obstacle I'm hitting.

I'm running {{test-container-executor}}, and it passes everywhere except my 
FreeBSD VM.  In target/native-results/test-container-executor.stdout, I see 
this:

{code}
Testing delete_container()
Can't chmod /tmp/test-container-executor/local-1/usercache/cnauroth to add the 
sticky bit - Operation not permitted
Can't chmod /tmp/test-container-executor/local-2/usercache/cnauroth to add the 
sticky bit - Operation not permitted
Can't chmod /tmp/test-container-executor/local-3/usercache/cnauroth to add the 
sticky bit - Operation not permitted
Can't chmod /tmp/test-container-executor/local-4/usercache/cnauroth to add the 
sticky bit - Operation not permitted
Can't chmod /tmp/test-container-executor/local-5/usercache/cnauroth to add the 
sticky bit - Operation not permitted
FAIL: failed to initialize user cnauroth
{code}

That error comes from this code in container-executor.c:

{code}
int create_directory_for_user(const char* path) {
  // set 2750 permissions and group sticky bit
  mode_t permissions = S_IRWXU | S_IRGRP | S_IXGRP | S_ISGID;
...
  if (chmod(path, permissions) != 0) {
fprintf(LOGFILE, "Can't chmod %s to add the sticky bit - %s\n",
path, strerror(errno));
ret = -1;
{code}

I tried testing {{chmod}} to set the setgid bit, and sure enough it fails on 
FreeBSD.  I can set the setuid bit and the sticky bit.  The problem only 
happens for trying to set the setgid bit when I'm a non-root user.

{code}
> chmod 4750 /tmp/test-container-executor/local-1/usercache/cnauroth

> chmod 2750 /tmp/test-container-executor/local-1/usercache/cnauroth
chmod: /tmp/test-container-executor/local-1/usercache/cnauroth: Operation not 
permitted

> chmod 1750 /tmp/test-container-executor/local-1/usercache/cnauroth
{code}

I don't see this behavior on any other OS.  I assume it's some kind of 
environmental configuration quirk, but I haven't been able to find any tips in 
documentation.  Have you seen this?  Does the test pass for you on FreeBSD?

> container-executor support for FreeBSD, NetBSD, and others if conf path is 
> absolute
> ---
>
> Key: YARN-5456
> URL: https://issues.apache.org/jira/browse/YARN-5456
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha2
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
> Attachments: YARN-5456.00.patch
>
>
> YARN-5121 changed how container-executor fixed quite a few portability 
> issues, but it also changed how it determines it's location to be very 
> operating specific for security reasons.  We should add support for FreeBSD 
> to unbreak it's ports entry, NetBSD (the sysctl options are just in a 
> different order), and for operating systems that do not have a defined 
> method, an escape hatch.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5121) fix some container-executor portability issues

2016-07-29 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400278#comment-15400278
 ] 

Chris Nauroth commented on YARN-5121:
-

Allen, sorry, I just spotted one more thing.  Would you please check for 
{{NULL}} returns from the {{malloc}} calls in {{__get_exec_readproc}} and the 
OS X implementation of {{get_executable}}?

> fix some container-executor portability issues
> --
>
> Key: YARN-5121
> URL: https://issues.apache.org/jira/browse/YARN-5121
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
>Priority: Blocker
> Attachments: YARN-5121.00.patch, YARN-5121.01.patch, 
> YARN-5121.02.patch, YARN-5121.03.patch, YARN-5121.04.patch, 
> YARN-5121.06.patch, YARN-5121.07.patch
>
>
> container-executor has some issues that are preventing it from even compiling 
> on the OS X jenkins instance.  Let's fix those.  While we're there, let's 
> also try to take care of some of the other portability problems that have 
> crept in over the years, since it used to work great on Solaris but now 
> doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5121) fix some container-executor portability issues

2016-07-29 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400252#comment-15400252
 ] 

Chris Nauroth commented on YARN-5121:
-

Thanks for the detailed explanation.  It's all clear to me now.  I expect this 
will be ready to commit after your next revision to fix the few remaining 
nitpicks.  That next revision can fix the one remaining compiler warning too.

[~chris.douglas], let us know if you have any more feedback.  If not, then I 
would likely +1 and commit soon.

bq. (This whole conversation is rather timely, given that Roger Faulkner just 
passed away recently.)

I did not know the name before, but I just read an "In Memoriam" article.  
Thank you, Roger.

> fix some container-executor portability issues
> --
>
> Key: YARN-5121
> URL: https://issues.apache.org/jira/browse/YARN-5121
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
>Priority: Blocker
> Attachments: YARN-5121.00.patch, YARN-5121.01.patch, 
> YARN-5121.02.patch, YARN-5121.03.patch, YARN-5121.04.patch, YARN-5121.06.patch
>
>
> container-executor has some issues that are preventing it from even compiling 
> on the OS X jenkins instance.  Let's fix those.  While we're there, let's 
> also try to take care of some of the other portability problems that have 
> crept in over the years, since it used to work great on Solaris but now 
> doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-5121) fix some container-executor portability issues

2016-07-29 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400171#comment-15400171
 ] 

Chris Nauroth commented on YARN-5121:
-

[~aw], thank you for this patch.  I have confirmed a successful full build and 
run of test-container-executor on OS X and Linux.

Just a few questions:

bq. For 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/compat/{fstatat|openat|unlinkat}.h:

I just want to double-check with you that the fchmodat.h and fdopendir.h 
implementations are not BSD-licensed, and that's why they're not listed in 
LICENSE.txt and instead have an Apache license header.  Is that correct?

{code}
  fprintf(stderr,"ret = %s\n", ret);
{code}

Chris D mentioned previously that this might have been a leftover from 
debugging.  Did you intend to keep it, or should we drop it?

{code}
char* get_executable() {
 return __get_exec_readproc("/proc/self/path/a.out");
}
{code}

Please check the indentation on the return statement.

Is "/proc/self/path/a.out" correct?  The /proc/self part makes sense to me, but 
the rest of it surprised me.  Is that a.out like the default gcc binary output 
path?  I have nearly zero experience with Solaris, so I trust your knowledge 
here.  :-)


> fix some container-executor portability issues
> --
>
> Key: YARN-5121
> URL: https://issues.apache.org/jira/browse/YARN-5121
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 3.0.0-alpha1
>Reporter: Allen Wittenauer
>Assignee: Allen Wittenauer
>Priority: Blocker
> Attachments: YARN-5121.00.patch, YARN-5121.01.patch, 
> YARN-5121.02.patch, YARN-5121.03.patch, YARN-5121.04.patch, YARN-5121.06.patch
>
>
> container-executor has some issues that are preventing it from even compiling 
> on the OS X jenkins instance.  Let's fix those.  While we're there, let's 
> also try to take care of some of the other portability problems that have 
> crept in over the years, since it used to work great on Solaris but now 
> doesn't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4887) AM-RM protocol changes for identifying resource-requests explicitly

2016-05-21 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295122#comment-15295122
 ] 

Chris Nauroth commented on YARN-4887:
-

Hello [~subru].  I think the YARN build would need configuration to exclude 
protobuf-generated sources, which do tend to generate a lot of Javadoc warnings 
that we can't do anything about.  For example, here is an exclusion from 
hadoop-hdfs-project/hadoop-hdfs/pom.xml:

{code}
  
org.apache.maven.plugins
maven-javadoc-plugin

  
org.apache.hadoop.hdfs.protocol.proto

  
{code}

I don't see a similar exclusion in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/pom.xml.

> AM-RM protocol changes for identifying resource-requests explicitly
> ---
>
> Key: YARN-4887
> URL: https://issues.apache.org/jira/browse/YARN-4887
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: applications, resourcemanager
>Reporter: Subru Krishnan
>Assignee: Subru Krishnan
> Attachments: YARN-4887-v1.patch, YARN-4887-v2.patch
>
>
> YARN-4879 proposes the addition of a simple delta allocate protocol. This 
> JIRA is to track the changes in AM-RM protocol to accomplish it. The crux is 
> the addition of ID field in ResourceRequest and Container. The detailed 
> proposal is in the parent JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-4681) ProcfsBasedProcessTree should not calculate private clean pages

2016-03-01 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174361#comment-15174361
 ] 

Chris Nauroth commented on YARN-4681:
-

[~je.ik], thank you for updating the patch.  I'm +1 for this change, pending a 
pre-commit test run from Jenkins.  I just clicked the Submit Patch button, so 
Jenkins should pick it up now.

However, I'm not confident enough to commit it immediately.  I'd like to see 
reviews from committers who spend more time in YARN than me.  I'd also like to 
find out if anyone thinks it should be configurable whether it checks locked or 
performs the old calculation.  I don't have a sense for how widely people are 
dependent on the current smaps checks.

> ProcfsBasedProcessTree should not calculate private clean pages
> ---
>
> Key: YARN-4681
> URL: https://issues.apache.org/jira/browse/YARN-4681
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Jan Lukavsky
>Assignee: Jan Lukavsky
> Attachments: YARN-4681.patch, YARN-4681.patch
>
>
> ProcfsBasedProcessTree in Node manager calculates memory used by a process 
> tree by parsing {{/etc//smaps}}, where it calculates {{min(Pss, 
> Shared_Dirty) + Private_Dirty + Private_Clean}}. Because not {{mlocked}} 
> private clean pages can be reclaimed by kernel, this should be changed to 
> calculating only {{Locked}} pages instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4681) ProcfsBasedProcessTree should not calculate private clean pages

2016-03-01 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-4681:

Assignee: Jan Lukavsky

> ProcfsBasedProcessTree should not calculate private clean pages
> ---
>
> Key: YARN-4681
> URL: https://issues.apache.org/jira/browse/YARN-4681
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Jan Lukavsky
>Assignee: Jan Lukavsky
> Attachments: YARN-4681.patch, YARN-4681.patch
>
>
> ProcfsBasedProcessTree in Node manager calculates memory used by a process 
> tree by parsing {{/etc//smaps}}, where it calculates {{min(Pss, 
> Shared_Dirty) + Private_Dirty + Private_Clean}}. Because not {{mlocked}} 
> private clean pages can be reclaimed by kernel, this should be changed to 
> calculating only {{Locked}} pages instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-4682) AMRM client to log when AMRM token updated

2016-02-12 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-4682:

Assignee: Prabhu Joseph

[~Prabhu Joseph], I added you as a contributor on the YARN project and assigned 
this issue to you.  Thanks for the patch!

[~ste...@apache.org], I added you to the Committers role in the YARN project, 
so you should have the rights to do this in the future.

> AMRM client to log when AMRM token updated
> --
>
> Key: YARN-4682
> URL: https://issues.apache.org/jira/browse/YARN-4682
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: client
>Affects Versions: 2.7.2
>Reporter: Steve Loughran
>Assignee: Prabhu Joseph
> Attachments: YARN-4682-002.patch, YARN-4682.patch, YARN-4682.patch.1
>
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> There's no information right now as to when the AMRM token gets updated; if 
> something has gone wrong with the update, you can't tell when it last when 
> through.
> fix: add a log statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required

2016-02-09 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139709#comment-15139709
 ] 

Chris Nauroth commented on YARN-4594:
-

I don't actually run it on Mac.  The impact though is that I can no longer do a 
full distro build of the source tree with {{-Pnative}}.  libhadoop.dylib is at 
least partially functional.

> container-executor fails to remove directory tree when chmod required
> -
>
> Key: YARN-4594
> URL: https://issues.apache.org/jira/browse/YARN-4594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Fix For: 2.9.0
>
> Attachments: YARN-4594.001.patch, YARN-4594.002.patch, 
> YARN-4594.003.patch, YARN-4594.004.patch
>
>
> test-container-executor.c doesn't work:
> * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually 
> /usr/bin/ls on many systems.
> * The recursive delete logic in container-executor.c fails -- nftw does the 
> wrong thing when confronted with directories with the wrong mode (permission 
> bits), leading to an attempt to run rmdir on a non-empty directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required

2016-02-09 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139679#comment-15139679
 ] 

Chris Nauroth commented on YARN-4594:
-

This patch broke compilation on Mac OS X 10.9, where the SDK does not include a 
definition of {{AT_REMOVEDIR}} in fcntl.h or unistd.h.  If you have the 10.10 
SDK installed, then the header does have {{AT_REMOVEDIR}}.  (See 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/usr/include/sys/fcntl.h).
  I haven't yet figured out if there is an easy way Mac users can set it to 
cross-compile with the 10.10 SDK as a workaround.  For now, I'm just going to 
have to skip this part of the build on Mac.

> container-executor fails to remove directory tree when chmod required
> -
>
> Key: YARN-4594
> URL: https://issues.apache.org/jira/browse/YARN-4594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Fix For: 2.9.0
>
> Attachments: YARN-4594.001.patch, YARN-4594.002.patch, 
> YARN-4594.003.patch, YARN-4594.004.patch
>
>
> test-container-executor.c doesn't work:
> * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually 
> /usr/bin/ls on many systems.
> * The recursive delete logic in container-executor.c fails -- nftw does the 
> wrong thing when confronted with directories with the wrong mode (permission 
> bits), leading to an attempt to run rmdir on a non-empty directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4594) container-executor fails to remove directory tree when chmod required

2016-02-09 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139827#comment-15139827
 ] 

Chris Nauroth commented on YARN-4594:
-

[~cmccabe], no worries, and no finger pointing intended.  I only meant to 
document what I had found here in case other Mac users stumble on the same 
issue.

FWIW, I'd prefer not to patch the Hadoop source at all and instead find some 
external way to target the 10.10 SDK, where the constant is defined.

> container-executor fails to remove directory tree when chmod required
> -
>
> Key: YARN-4594
> URL: https://issues.apache.org/jira/browse/YARN-4594
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Fix For: 2.9.0
>
> Attachments: YARN-4594.001.patch, YARN-4594.002.patch, 
> YARN-4594.003.patch, YARN-4594.004.patch
>
>
> test-container-executor.c doesn't work:
> * It assumes that realpath(/bin/ls) will be /bin/ls, whereas it is actually 
> /usr/bin/ls on many systems.
> * The recursive delete logic in container-executor.c fails -- nftw does the 
> wrong thing when confronted with directories with the wrong mode (permission 
> bits), leading to an attempt to run rmdir on a non-empty directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4681) ProcfsBasedProcessTree should not calculate private clean pages

2016-02-09 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139318#comment-15139318
 ] 

Chris Nauroth commented on YARN-4681:
-

[~je.ik], thank you for posting a patch.  Have you had a chance to try testing 
this with the Spark use case that you described on the mailing list?

One important point that you brought up on the mailing list is that the Locked 
field in smaps doesn't seem to be universal across all tested kernel versions.

bq. If I do this on an older kernel (2.6.x), the Locked field is missing.

The current patch completely switches the calculation from using Private_Clean 
to Locked.  For a final version of the patch, we'd want to make sure that the 
change doesn't break anything for older kernels that don't show the Locked 
field.

You also discussed possibly even more aggressive strategies, like trying to 
anticipate that the kernel might free more memory by flushing file-backed dirty 
pages.  During YARN-1775, there was some discussion about supporting different 
kinds of configurability for this calculation.  That topic might warrant 
further discussion here.

> ProcfsBasedProcessTree should not calculate private clean pages
> ---
>
> Key: YARN-4681
> URL: https://issues.apache.org/jira/browse/YARN-4681
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Jan Lukavsky
> Attachments: YARN-4681.patch
>
>
> ProcfsBasedProcessTree in Node manager calculates memory used by a process 
> tree by parsing {{/etc//smaps}}, where it calculates {{min(Pss, 
> Shared_Dirty) + Private_Dirty + Private_Clean}}. Because not {{mlocked}} 
> private clean pages can be reclaimed by kernel, this should be changed to 
> calculating only {{Locked}} pages instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4681) ProcfsBasedProcessTree should not calculate private clean pages

2016-02-09 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15139292#comment-15139292
 ] 

Chris Nauroth commented on YARN-4681:
-

Linking to YARN-1775, which initially implemented the support for reading 
/proc//smaps information.  Also notifying a few of the participants on 
that earlier discussion: [~rajesh.balamohan], [~ka...@cloudera.com] and 
[~vinodkv].

This issue relates to this discussion on the dev mailing list:

http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201602.mbox/%3CD2DF3D85.3AAF8%25cnauroth%40hortonworks.com%3E

> ProcfsBasedProcessTree should not calculate private clean pages
> ---
>
> Key: YARN-4681
> URL: https://issues.apache.org/jira/browse/YARN-4681
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.6.0, 2.7.0
>Reporter: Jan Lukavsky
> Attachments: YARN-4681.patch
>
>
> ProcfsBasedProcessTree in Node manager calculates memory used by a process 
> tree by parsing {{/etc//smaps}}, where it calculates {{min(Pss, 
> Shared_Dirty) + Private_Dirty + Private_Clean}}. Because not {{mlocked}} 
> private clean pages can be reclaimed by kernel, this should be changed to 
> calculating only {{Locked}} pages instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (YARN-3458) CPU resource monitoring in Windows

2015-12-16 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth reassigned YARN-3458:
---

Assignee: Chris Nauroth  (was: Inigo Goiri)

> CPU resource monitoring in Windows
> --
>
> Key: YARN-3458
> URL: https://issues.apache.org/jira/browse/YARN-3458
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.0
> Environment: Windows
>Reporter: Inigo Goiri
>Assignee: Chris Nauroth
>Priority: Minor
>  Labels: BB2015-05-TBR, containers, metrics, windows
> Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, 
> YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch, YARN-3458-7.patch, 
> YARN-3458-8.patch, YARN-3458-9.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of getCpuUsagePercent() for 
> WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to 
> do it. I reused the CpuTimeTracker using 1 jiffy=1ms.
> This was left open by YARN-3122.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3458) CPU resource monitoring in Windows

2015-12-16 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3458:

Assignee: Inigo Goiri  (was: Chris Nauroth)

> CPU resource monitoring in Windows
> --
>
> Key: YARN-3458
> URL: https://issues.apache.org/jira/browse/YARN-3458
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.0
> Environment: Windows
>Reporter: Inigo Goiri
>Assignee: Inigo Goiri
>Priority: Minor
>  Labels: BB2015-05-TBR, containers, metrics, windows
> Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, 
> YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch, YARN-3458-7.patch, 
> YARN-3458-8.patch, YARN-3458-9.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of getCpuUsagePercent() for 
> WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to 
> do it. I reused the CpuTimeTracker using 1 jiffy=1ms.
> This was left open by YARN-3122.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3458) CPU resource monitoring in Windows

2015-12-16 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3458:

Hadoop Flags: Reviewed

+1 for patch v9.  I'll wait a few days before committing, since I see other 
watchers on the issue.

> CPU resource monitoring in Windows
> --
>
> Key: YARN-3458
> URL: https://issues.apache.org/jira/browse/YARN-3458
> Project: Hadoop YARN
>  Issue Type: New Feature
>  Components: nodemanager
>Affects Versions: 2.7.0
> Environment: Windows
>Reporter: Inigo Goiri
>Assignee: Inigo Goiri
>Priority: Minor
>  Labels: BB2015-05-TBR, containers, metrics, windows
> Attachments: YARN-3458-1.patch, YARN-3458-2.patch, YARN-3458-3.patch, 
> YARN-3458-4.patch, YARN-3458-5.patch, YARN-3458-6.patch, YARN-3458-7.patch, 
> YARN-3458-8.patch, YARN-3458-9.patch
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> The current implementation of getCpuUsagePercent() for 
> WindowsBasedProcessTree is left as unavailable. Attached a proposal of how to 
> do it. I reused the CpuTimeTracker using 1 jiffy=1ms.
> This was left open by YARN-3122.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations

2015-12-08 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047165#comment-15047165
 ] 

Chris Nauroth commented on YARN-4248:
-

This patch introduced license warnings on the testing json files.  Here is an 
example from the latest pre-commit run on HADOOP-11505.

https://builds.apache.org/job/PreCommit-HADOOP-Build/8202/artifact/patchprocess/patch-asflicense-problems.txt

Would you please either revert or quickly correct the license warning?  Thank 
you.

> REST API for submit/update/delete Reservations
> --
>
> Key: YARN-4248
> URL: https://issues.apache.org/jira/browse/YARN-4248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Fix For: 2.8.0
>
> Attachments: YARN-4248.2.patch, YARN-4248.3.patch, YARN-4248.5.patch, 
> YARN-4248.6.patch, YARN-4248.patch
>
>
> This JIRA tracks work to extend the RMWebService to support REST APIs to 
> submit/update/delete reservations. This will ease integration with external 
> tools that are not java-based.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations

2015-12-08 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047389#comment-15047389
 ] 

Chris Nauroth commented on YARN-4248:
-

Hi [~curino].  It looks like [~chris.douglas] just uploaded a patch to set up 
an exclusion of the json files from the license check.  +1 for this.  Thanks, 
Chris.

> REST API for submit/update/delete Reservations
> --
>
> Key: YARN-4248
> URL: https://issues.apache.org/jira/browse/YARN-4248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Fix For: 2.8.0
>
> Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, 
> YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch
>
>
> This JIRA tracks work to extend the RMWebService to support REST APIs to 
> submit/update/delete reservations. This will ease integration with external 
> tools that are not java-based.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4248) REST API for submit/update/delete Reservations

2015-12-08 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047553#comment-15047553
 ] 

Chris Nauroth commented on YARN-4248:
-

bq. Not sure why it wasn't flagged by test-patch.

I decided to dig into this.  At the time that pre-commit ran for YARN-4248, 
there was an unrelated license warning present in HDFS, introduced by HDFS-9414.

https://builds.apache.org/job/PreCommit-YARN-Build/9872/artifact/patchprocess/patch-asflicense-problems.txt

Unfortunately, if there is a pre-existing license warning, then the {{mvn 
apache-rat:check}} build halts at that first failing module.  Since 
hadoop-hdfs-client builds before hadoop-yarn-server-resourcemanager, it masked 
the new license warnings introduced by this patch.  This is visible here if you 
scroll to the bottom and notice module Apache Hadoop HDFS Client failed, 
followed by skipping all subsequent modules.

https://builds.apache.org/job/PreCommit-YARN-Build/9872/artifact/patchprocess/patch-asflicense-root.txt

Maybe we can do better when there are pre-existing license warnings, perhaps by 
using the {{--fail-at-end}} option to make sure we check all modules.  I filed 
YETUS-221.

> REST API for submit/update/delete Reservations
> --
>
> Key: YARN-4248
> URL: https://issues.apache.org/jira/browse/YARN-4248
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.8.0
>Reporter: Carlo Curino
>Assignee: Carlo Curino
> Fix For: 2.8.0
>
> Attachments: YARN-4248-asflicense.patch, YARN-4248.2.patch, 
> YARN-4248.3.patch, YARN-4248.5.patch, YARN-4248.6.patch, YARN-4248.patch
>
>
> This JIRA tracks work to extend the RMWebService to support REST APIs to 
> submit/update/delete reservations. This will ease integration with external 
> tools that are not java-based.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-10-01 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939990#comment-14939990
 ] 

Chris Nauroth commented on YARN-3514:
-

Hello [~vvasudev].

As per prior comments from [~leftnoteasy] and [~vinodkv], we suspect the 
current patch does not fully address all potential problems with use of Active 
Directory "DOMAIN\login" usernames in YARN.  I don't have bandwidth right now 
to hunt down those additional problems and fix them.

I think these are the options for handling this JIRA now:
# Finish the review of the fix that is already here and commit it.  Handle 
subsequent issues in separate JIRAs.
# Unassign it from me and see if someone else can pick it up, run with my 
current patch, look for more problems and then turn that into a more 
comprehensive patch.
# Continue to let this linger until I or someone else frees up time for more 
investigation.

> Active directory usernames like domain\login cause YARN failures
> 
>
> Key: YARN-3514
> URL: https://issues.apache.org/jira/browse/YARN-3514
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.2.0
> Environment: CentOS6
>Reporter: john lilley
>Assignee: Chris Nauroth
>Priority: Minor
>  Labels: BB2015-05-TBR
> Attachments: YARN-3514.001.patch, YARN-3514.002.patch
>
>
> We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
> Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
> are able to authenticate, browse HDFS, etc.  However, YARN fails during 
> localization because it seems to get confused by the presence of a \ 
> character in the local user name.
> Our AD authentication on the nodes goes through sssd and set configured to 
> map AD users onto the form domain\username.  For example, our test user has a 
> Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
> "domain\hadoopuser".  We have no problem validating that user with PAM, 
> logging in as that user, su-ing to that user, etc.
> However, when we attempt to run a YARN application master, the localization 
> step fails when setting up the local cache directory for the AM.  The error 
> that comes out of the RM logs:
> 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
> ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
> diagnostics='Application application_1429295486450_0001 failed 1 times due to 
> AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
> -1000 due to: Application application_1429295486450_0001 initialization 
> failed (exitCode=255) with output: main : command provided 0
> main : user is DOMAIN\hadoopuser
> main : requested yarn user is domain\hadoopuser
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
> directory: 
> /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
> at 
> org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
> .Failing this attempt.. Failing the application.'
> However, when we look on the node launching the AM, we see this:
> [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
> [root@rpb-cdh-kerb-2 usercache]# ls -l
> drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
> There appears to be different treatment of the \ character in different 
> places.  Something creates the directory as "domain\hadoopuser" but something 
> else later attempts to use it as "domain%5Chadoopuser".  I’m not sure where 
> or why the URL escapement converts the \ to %5C or why this is not consistent.
> I should also mention, for the sake of completeness, our auth_to_local rule 
> is set up to map u...@domain.com to domain\user:
> RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3962) If we change node manager identity to run as virtual account, then resource localization service fails to start with incorrect permission

2015-08-06 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661004#comment-14661004
 ] 

Chris Nauroth commented on YARN-3962:
-

This looks good to me.  I agree with Xuan that it would be good to find a way 
to add unit tests.  Thank you, Madhumita!

 If we change node manager identity to run as virtual account, then resource 
 localization service fails to start with incorrect permission
 -

 Key: YARN-3962
 URL: https://issues.apache.org/jira/browse/YARN-3962
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.6.0
Reporter: madhumita chakraborty
 Attachments: YARN-3962-002.patch, Yarn-3962.001.patch


 For azure hdinsight we need to change node manager to run as virtual account 
 instead of use account. Else after azure reimage, it wont be able to access 
 the map output data of the running job in that node. But when we changed the 
 nodemanager to run as virtual account we got this error, 
  2015-06-02 06:11:45,281 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Writing credentials to the nmPrivate file 
 c:/apps1/temp/hdfs/nm-local-dir/nmPrivate/container_1433128260970_0007_01_01.tokens.
  Credentials list: 
  2015-06-02 06:11:45,313 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Permissions incorrectly set for dir 
 c:/apps1/temp/hdfs/nm-local-dir/usercache, should be rwxr-xr-x, actual value 
 = rwxrwxr-x
  2015-06-02 06:11:45,313 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Attempting to initialize c:/apps1/temp/hdfs/nm-local-dir
  2015-06-02 06:11:45,375 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Permissions incorrectly set for dir 
 c:/apps1/temp/hdfs/nm-local-dir/usercache, should be rwxr-xr-x, actual value 
 = rwxrwxr-x
  2015-06-02 06:11:45,375 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Failed to setup local dir c:/apps1/temp/hdfs/nm-local-dir, which was marked 
 as good.
  org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Permissions 
 incorrectly set for dir c:/apps1/temp/hdfs/nm-local-dir/usercache, should be 
 rwxr-xr-x, actual value = rwxrwxr-x
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1400)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1367)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$900(ResourceLocalizationService.java:137)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1085)
  2015-06-02 06:11:45,375 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
  Localizer failed
  org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Failed to setup 
 local dir c:/apps1/temp/hdfs/nm-local-dir, which was marked as good.
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1372)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.access$900(ResourceLocalizationService.java:137)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1085)
  Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
 Permissions incorrectly set for dir 
 c:/apps1/temp/hdfs/nm-local-dir/usercache, should be rwxr-xr-x, actual value 
 = rwxrwxr-x
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.checkLocalDir(ResourceLocalizationService.java:1400)
  at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.getInitializedLocalDirs(ResourceLocalizationService.java:1367)
 Fix - When node manager runs as virtual account, the resourcelocalization 
 service fails to come. It checks for the permission of usercache and file 
 cache to be 755 and nmPrivate to be 700. But in windows, for virtual account, 
 the owner and group is same. So this pemrission check fails. So added a check 
 that is user is equal to group, then umask validation

[jira] [Commented] (YARN-3834) Scrub debug logging of tokens during resource localization.

2015-06-21 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595338#comment-14595338
 ] 

Chris Nauroth commented on YARN-3834:
-

Xuan, thank you for the code review and commit.

 Scrub debug logging of tokens during resource localization.
 ---

 Key: YARN-3834
 URL: https://issues.apache.org/jira/browse/YARN-3834
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Fix For: 2.8.0

 Attachments: YARN-3834.001.patch


 During resource localization, the NodeManager logs tokens at debug level to 
 aid troubleshooting.  This includes the full token representation.  Best 
 practice is to avoid logging anything secret, even at debug level.  We can 
 improve on this by changing the logging to use a scrubbed representation of 
 the token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3834) Scrub debug logging of tokens during resource localization.

2015-06-19 Thread Chris Nauroth (JIRA)

Chris Nauroth created YARN-3834:
---

 Summary: Scrub debug logging of tokens during resource 
localization.
 Key: YARN-3834
 URL: https://issues.apache.org/jira/browse/YARN-3834
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Chris Nauroth
Assignee: Chris Nauroth


During resource localization, the NodeManager logs tokens at debug level to aid 
troubleshooting.  This includes the full token representation.  Best practice 
is to avoid logging anything secret, even at debug level.  We can improve on 
this by changing the logging to use a scrubbed representation of the token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3834) Scrub debug logging of tokens during resource localization.

2015-06-19 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3834:

Attachment: YARN-3834.001.patch

The attached patch changes the code to use {{Token#toString}}.  The 
{{toString}} method is already coded to be safe for logging, because it does 
not include any representation of the secret.  Thanks also to [~vicaya] for the 
suggestion to add logging of a fingerprint of the full representation, which is 
a one-way hash (non-reversible, therefore safe).

 Scrub debug logging of tokens during resource localization.
 ---

 Key: YARN-3834
 URL: https://issues.apache.org/jira/browse/YARN-3834
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Affects Versions: 2.7.1
Reporter: Chris Nauroth
Assignee: Chris Nauroth
 Attachments: YARN-3834.001.patch


 During resource localization, the NodeManager logs tokens at debug level to 
 aid troubleshooting.  This includes the full token representation.  Best 
 practice is to avoid logging anything secret, even at debug level.  We can 
 improve on this by changing the logging to use a scrubbed representation of 
 the token.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3786) Document yarn class path options

2015-06-08 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3786:

Component/s: documentation
 Issue Type: Improvement  (was: Bug)

 Document yarn class path options
 

 Key: YARN-3786
 URL: https://issues.apache.org/jira/browse/YARN-3786
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: documentation
Reporter: Brahma Reddy Battula
Assignee: Brahma Reddy Battula
 Attachments: YARN-3786.patch


 --global, --jar options are not documented.
 {code}
 $ yarn classpath --help
 classpath [--glob|--jar path|-h|--help] :
   Prints the classpath needed to get the Hadoop jar and the required
   libraries.
   Options:
   --glob   expand wildcards
   --jar path write classpath as manifest in jar named path
   -h, --help   print help
 {code}
 current document:
 {code}
 User Commands
 Commands useful for users of a hadoop cluster.
 classpath
 Usage: yarn classpath
 Prints the class path needed to get the Hadoop jar and the required libraries
 {code}
 http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#classpath



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3685) NodeManager unnecessarily knows about classpath-jars due to Windows limitations

2015-05-26 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560391#comment-14560391
 ] 

Chris Nauroth commented on YARN-3685:
-

bq. YARN_APPLICATION_CLASSPATH is essentially unused.

In that case, this is definitely worth revisiting as part of this issue.  
Perhaps it's not a problem anymore.  This had been used in the past, as seen in 
bug reports like YARN-1138.

 NodeManager unnecessarily knows about classpath-jars due to Windows 
 limitations
 ---

 Key: YARN-3685
 URL: https://issues.apache.org/jira/browse/YARN-3685
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Reporter: Vinod Kumar Vavilapalli

 Found this while looking at cleaning up ContainerExecutor via YARN-3648, 
 making it a sub-task.
 YARN *should not* know about classpaths. Our original design modeled around 
 this. But when we added windows suppport, due to classpath issues, we ended 
 up breaking this abstraction via YARN-316. We should clean this up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-26 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559939#comment-14559939
 ] 

Chris Nauroth commented on YARN-3626:
-

Thanks, Craig!  We could potentially stick the {{@Private}} annotation directly 
onto {{ApplicationConstants#CLASSPATH_PREPEND_DISTCACHE}}.  I'll let Vinod 
chime in on whether or not this was the intent of his feedback.

+1 from me, pending Jenkins run.

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Fix For: 2.7.1

 Attachments: YARN-3626.0.patch, YARN-3626.11.patch, 
 YARN-3626.14.patch, YARN-3626.15.patch, YARN-3626.4.patch, YARN-3626.6.patch, 
 YARN-3626.9.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3685) NodeManager unnecessarily knows about classpath-jars due to Windows limitations

2015-05-26 Thread Chris Nauroth (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559464#comment-14559464
]

Chris Nauroth commented on YARN-3685:
-

bq. Which ones are these?

I was thinking of stuff like {{yarn.application.classpath}}, where values are
defined in terms of things like the {{HADOOP_YARN_HOME}} and
{{HADOOP_COMMON_HOME}} environment variables, and those values might not match
the file system layout at the client side.

bq. Not clear this is true or not. Have to see the final solution/patch to
realistically reason about this.

Let's say hypothetically a change for this goes into 2.8.0. I was thinking
that would make it impossible for a 2.7.0 client to work correctly with a 2.8.0
NodeManager, because that client wouldn't take care of classpath bundling and
instead expect the NodeManager to do it.

Brainstorming a bit, maybe we can figure out a way for a 2.8.0 NodeManager to
detect if the client hasn't already taken care of classpath bundling, and if
not, stick to the current logic. Backwards-compatibility logic like this would
go into branch-2, but could be dropped from trunk.

NodeManager unnecessarily knows about classpath-jars due to Windows
limitations
---

Key: YARN-3685
URL: https://issues.apache.org/jira/browse/YARN-3685
Project: Hadoop YARN
Issue Type: Sub-task
Components: nodemanager
Reporter: Vinod Kumar Vavilapalli

Found this while looking at cleaning up ContainerExecutor via YARN-3648,
making it a sub-task.
YARN *should not* know about classpaths. Our original design modeled around
this. But when we added windows suppport, due to classpath issues, we ended
up breaking this abstraction via YARN-316. We should clean this up.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-25 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14558495#comment-14558495
 ] 

Chris Nauroth commented on YARN-3626:
-

Hi Craig.  This looks good to me.  I have just one minor nitpick.  I think the 
logic in {{ContainerLaunch}} for setting {{preferLocalizedJars}} could be 
simplified to this:

{code}
boolean preferLocalizedJars = Boolean.valueOf(classpathPrependDistCache);
{code}

{{Boolean#valueOf}} is null-safe.

Thanks!

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Fix For: 2.7.1

 Attachments: YARN-3626.0.patch, YARN-3626.11.patch, 
 YARN-3626.14.patch, YARN-3626.4.patch, YARN-3626.6.patch, YARN-3626.9.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3685) NodeManager unnecessarily knows about classpath-jars due to Windows limitations

2015-05-20 Thread Chris Nauroth (JIRA)

[
https://issues.apache.org/jira/browse/YARN-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14552680#comment-14552680
]

Chris Nauroth commented on YARN-3685:
-

[~vinodkv], thanks for the notification. I was not aware of this design goal
at the time of YARN-316.

Perhaps it's possible to move the classpath jar generation to the MR client or
AM. It's not immediately obvious to me which of those 2 choices is better.
We'd need to change the manifest to use relative paths in the Class-Path
attribute instead of absolute paths. (The client and AM are not aware of the
exact layout of the NodeManager's {{yarn.nodemanager.local-dirs}}, so the
client can't predict the absolute paths at time of container launch.)

There is one piece of logic that I don't see how to handle though. Some
classpath entries are defined in terms of environment variables. These
environment variables are expanded at the NodeManager via the container launch
scripts. This was true of Linux even before YARN-316, so in that sense, YARN
did already have some classpath logic indirectly. Environment variables cannot
be used inside a manifest's Class-Path, so for Windows, NodeManager expands the
environment variables before populating Class-Path. It would be incorrect to
do the environment variable expansion at the MR client, because it might be
running with different configuration than the NodeManager. I suppose if the AM
did the expansion, then that would work in most cases, but it creates an
assumption that the AM container is running with configuration that matches all
NodeManagers in the cluster. I don't believe that assumption exists today.

If we do move classpath handling out of the NodeManager, then it would be a
backwards-incompatible change, and so it could not be shipped in the 2.x
release line.

NodeManager unnecessarily knows about classpath-jars due to Windows
limitations
---

Key: YARN-3685
URL: https://issues.apache.org/jira/browse/YARN-3685
Project: Hadoop YARN
Issue Type: Sub-task
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-15 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546365#comment-14546365
 ] 

Chris Nauroth commented on YARN-3626:
-

I don't fully understand the objection to the former patch that had been 
committed.

bq. The new configuration added is supposed to be per app, but it is now a 
server side configuration.

There was a new YARN configuration property for triggering this behavior, but 
the MR application would toggle on that YARN property only if the MR job 
submission had {{MAPREDUCE_JOB_USER_CLASSPATH_FIRST}} on.  From {{MRApps}}:

{code}
boolean userClassesTakesPrecedence = 
  conf.getBoolean(MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, false);

if (userClassesTakesPrecedence) {
  conf.set(YarnConfiguration.YARN_APPLICATION_CLASSPATH_PREPEND_DISTCACHE,
true);
}
{code}

I thought this implemented per app behavior, because it could vary between MR 
app submission instances.  It would not be a requirement to put 
{{YARN_APPLICATION_CLASSPATH_PREPEND_DISTCACHE}} into the server configs and 
have the client and server share configs.

Is there a detail I'm missing?

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Fix For: 2.7.1

 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, 
 YARN-3626.9.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3626) On Windows localized resources are not moved to the front of the classpath when they should be

2015-05-15 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546400#comment-14546400
 ] 

Chris Nauroth commented on YARN-3626:
-

I see now.  Thanks for the clarification.  In that case, I agree with the new 
proposal.

 On Windows localized resources are not moved to the front of the classpath 
 when they should be
 --

 Key: YARN-3626
 URL: https://issues.apache.org/jira/browse/YARN-3626
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
 Environment: Windows
Reporter: Craig Welch
Assignee: Craig Welch
 Fix For: 2.7.1

 Attachments: YARN-3626.0.patch, YARN-3626.4.patch, YARN-3626.6.patch, 
 YARN-3626.9.patch


 In response to the mapreduce.job.user.classpath.first setting the classpath 
 is ordered differently so that localized resources will appear before system 
 classpath resources when tasks execute.  On Windows this does not work 
 because the localized resources are not linked into their final location when 
 the classpath jar is created.  To compensate for that localized jar resources 
 are added directly to the classpath generated for the jar rather than being 
 discovered from the localized directories.  Unfortunately, they are always 
 appended to the classpath, and so are never preferred over system resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.

2015-05-07 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532965#comment-14532965
 ] 

Chris Nauroth commented on YARN-3549:
-

bq. Can we not get RawLocalFileSystem to automatically use native fstat if the 
native library is available? That way all users can simply benefit from this 
seamlessly.

That's an interesting idea.  Looking at this more closely, 
{{ResourceLocalizationService#checkLocalDir}} really can't use the existing 
{{NativeIO.POSIX#getFstat}} method anyway.  That one is a passthrough to 
{{fstat}}, which operates on an open file descriptor.  Instead, 
{{ResourceLocalizationService#checkLocalDir}} really wants plain {{stat}} or 
maybe {{lstat}}, which operates on a path string.  Forcing this code path to 
open the file just for the sake of passing an fd to {{fstat}} isn't ideal.

Let's try it!  I filed HADOOP-11935 as a pre-requisite to do the Hadoop Common 
work.

 use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 

 Key: YARN-3549
 URL: https://issues.apache.org/jira/browse/YARN-3549
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu

 Use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 As discussed in YARN-3491, shell-based implementation getPermission runs 
 shell command ls -ld to get permission, which take 4 or 5 ms(very slow).
 We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in 
 RawLocalFileSystem to get rid of shell-based implementation for FileStatus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.

2015-05-05 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14528725#comment-14528725
 ] 

Chris Nauroth commented on YARN-3549:
-

Hi [~djp].  This will be a YARN code change in the localizer, so YARN is the 
appropriate project to track it.  The code change will involve calling a native 
fstat method provided in hadoop-common, but that code already exists, and I 
don't expect it will need any changes to support this.

 use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 

 Key: YARN-3549
 URL: https://issues.apache.org/jira/browse/YARN-3549
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu

 Use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 As discussed in YARN-3491, shell-based implementation getPermission runs 
 shell command ls -ld to get permission, which take 4 or 5 ms(very slow).
 We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in 
 RawLocalFileSystem to get rid of shell-based implementation for FileStatus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-05-04 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527907#comment-14527907
 ] 

Chris Nauroth commented on YARN-3514:
-

Looking at the original description, I see upper-case DOMAIN is getting 
translated to lower-case domain in this environment.  It's likely that this 
environment would get an ownership mismatch error even after getting past the 
current bug.

{code}
drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
{code}

Nice catch, Wangda.

Is it necessary to translate to lower-case, or can the domain portion of the 
name be left in upper-case to match the OS level?

bq. One possible solution is ignoring cases while compare user name, but that 
will be problematic when user De/de existed at the same time.

I've seen a few mentions online that Active Directory is not case-sensitive but 
is case-preserving.  That means it will preserve the case you used in 
usernames, but the case doesn't matter for comparisons.  I've also seen 
references that DNS has similar behavior with regards to case.

I can't find a definitive statement though that this is guaranteed behavior.  
I'd feel safer making this kind of change if we had a definitive reference.

 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch, YARN-3514.002.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3549) use JNI-based FileStatus implementation from io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation from RawLocalFileSystem in checkLocalDir.

2015-04-27 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14514842#comment-14514842
 ] 

Chris Nauroth commented on YARN-3549:
-

Hi [~zxu].  Since this is a proposal to call native code, I'd like to make sure 
test suites are passing on both Linux and Windows when it's ready.  The method 
is implemented for both Linux and Windows, so I do expect it would work fine, 
but I'd like to make sure.  If you don't have access to a Windows VM for 
testing, I'd be happy to volunteer to test on Windows for you when a patch is 
ready.  Thanks!

 use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 

 Key: YARN-3549
 URL: https://issues.apache.org/jira/browse/YARN-3549
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: zhihai xu
Assignee: zhihai xu

 Use JNI-based FileStatus implementation from 
 io.nativeio.NativeIO.POSIX#getFstat instead of shell-based implementation 
 from RawLocalFileSystem in checkLocalDir.
 As discussed in YARN-3491, shell-based implementation getPermission runs 
 shell command ls -ld to get permission, which take 4 or 5 ms(very slow).
 We should switch to io.nativeio.NativeIO.POSIX#getFstat as implementation in 
 RawLocalFileSystem to get rid of shell-based implementation for FileStatus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (YARN-3524) Mapreduce failed due to AM Container-Launch failure at NM on windows

2015-04-27 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth resolved YARN-3524.
-
Resolution: Not A Problem

Hello [~KaveenBigdata].  Nice debugging!

The native components for Hadoop on Windows are built using either Windows SDK 
7.1 or Visual Studio 2010.  Because of this, there is a runtime dependency on 
the C++ 2010 runtime dll, which is MSVCR100.dll.  You are correct that the fix 
in this case is to install the missing dll.  I believe this is the official 
download location:

https://www.microsoft.com/en-us/download/details.aspx?id=13523

Since this does not represent a bug in the Hadoop codebase, I'm resolving this 
issue as Not a Problem.

 Mapreduce failed due to AM Container-Launch failure at NM on windows
 

 Key: YARN-3524
 URL: https://issues.apache.org/jira/browse/YARN-3524
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.5.2
 Environment: Windows server 2012 and Windows-8
 Hadoop-2.5.2
 Java-1.7
Reporter: Kaveen Raajan

 I tried to run TEZ job on windows machine 
 I successfully Build Tez-0.6.0 against Hadoop-2.5.2
 Then I configured Tez-0.6.0 as like in http://tez.apache.org/install.html
 But I face following error while running this command
 Note: I'm using HADOOP High Availability setup.
 {code}
 Running OrderedWordCount
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in [jar:file:/C:/Hadoop/
 share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBind
 er.class]
 SLF4J: Found binding in [jar:file:/C:/Tez/lib
 /slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation.
 SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
 15/04/15 10:47:57 INFO client.TezClient: Tez Client Version: [ 
 component=tez-api
 , version=0.6.0, revision=${buildNumber}, 
 SCM-URL=scm:git:https://git-wip-us.apa
 che.org/repos/asf/tez.git, buildTime=2015-04-15T01:13:02Z ]
 15/04/15 10:48:00 INFO client.TezClient: Submitting DAG application with id: 
 app
 lication_1429073725727_0005
 15/04/15 10:48:00 INFO Configuration.deprecation: fs.default.name is 
 deprecated.
  Instead, use fs.defaultFS
 15/04/15 10:48:00 INFO client.TezClientUtils: Using tez.lib.uris value from 
 conf
 iguration: hdfs://HACluster/apps/Tez/,hdfs://HACluster/apps/Tez/lib/
 15/04/15 10:48:01 INFO client.TezClient: Stage directory /tmp/app/tez/sta
 ging doesn't exist and is created
 15/04/15 10:48:01 INFO client.TezClient: Tez system stage directory 
 hdfs://HACluster
 /tmp/app/tez/staging/.tez/application_1429073725727_0005 doesn't ex
 ist and is created
 15/04/15 10:48:02 INFO client.TezClient: Submitting DAG to YARN, 
 applicationId=a
 pplication_1429073725727_0005, dagName=OrderedWordCount
 15/04/15 10:48:03 INFO impl.YarnClientImpl: Submitted application 
 application_14
 29073725727_0005
 15/04/15 10:48:03 INFO client.TezClient: The url to track the Tez AM: 
 http://MASTER_NN1:8088/proxy/application_1429073725727_0005/
 15/04/15 10:48:03 INFO client.DAGClientImpl: Waiting for DAG to start running
 15/04/15 10:48:09 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
 OrderedWordCount failed with diagnostics: [Application 
 application_1429073725727
 _0005 failed 2 times due to AM Container for 
 appattempt_1429073725727_0005_0
 2 exited with  exitCode: -1073741515 due to: Exception from container-launch: 
 Ex
 itCodeException exitCode=-1073741515:
 ExitCodeException exitCode=-1073741515:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
 at org.apache.hadoop.util.Shell.run(Shell.java:455)
 at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
 702)
 at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
 unchContainer(DefaultContainerExecutor.java:195)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
 ontainerLaunch.call(ContainerLaunch.java:300)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
 ontainerLaunch.call(ContainerLaunch.java:81)
 at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
 java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
 .java:615)
 at java.lang.Thread.run(Thread.java:744)
 1 file(s) moved.
 Container exited with a non-zero exit code -1073741515
 .Failing this attempt.. Failing the application.]
 {code}
 While Seeing at Resourcemanager log:
 {code}
 2015-04-19 21:49:57,533 INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
 completedContainer

[jira] [Updated] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-04-21 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3514:

 Component/s: (was: yarn)
  nodemanager
Target Version/s: 2.8.0
Assignee: Chris Nauroth

 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-04-21 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3514:

Attachment: YARN-3514.001.patch

I'm attaching a patch with the fix I described in my last comment.  I added a 
test that passes a file name containing a '\' character through localization.  
With the existing code using {{URI#getRawPath}}, the test fails as shown below. 
 (Note the incorrect URI-encoded path, similar to the reported symptom in the 
description.)  After switching to {{URI#getPath}}, the test passes as expected.

{code}
Failed tests: 
  TestContainerLocalizer.testLocalizerDiskCheckDoesNotUriEncodePath:265 
Argument(s) are different! Wanted:
containerLocalizer.checkDir(/my\File);
- at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestContainerLocalizer.testLocalizerDiskCheckDoesNotUriEncodePath(TestContainerLocalizer.java:265)
Actual invocation has different arguments:
containerLocalizer.checkDir(/my%5CFile);
- at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestContainerLocalizer.testLocalizerDiskCheckDoesNotUriEncodePath(TestContainerLocalizer.java:264)
{code}


 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Priority: Minor
 Attachments: YARN-3514.001.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-04-21 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3514:

Attachment: YARN-3514.002.patch

In the first patch, the new test passed for me locally but failed on Jenkins.  
I think this is because I was using a hard-coded destination path for the 
localized resource, and this might have caused a permissions violation on the 
Jenkins host.  Here is patch v002.  I changed the test so that the localized 
resource is relative to the user's filecache, which is in the proper test 
working directory.  I also added a second test to make sure that we don't 
accidentally URI-decode anything.

bq. I am very impressed with the short time it took to patch.

Thanks!  Before we declare victory though, can you check that your local file 
system allows the '\' character in file and directory names?  The patch here 
definitely fixes a bug, but testing the '\' character on your local file system 
will tell us whether or not the whole problem is resolved for your deployment.  
Even better would be if you have the capability to test with my patch applied.


 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch, YARN-3514.002.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-04-21 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505711#comment-14505711
 ] 

Chris Nauroth commented on YARN-3514:
-

[~john.lil...@redpoint.net], thank you for the confirmation.

 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Assignee: Chris Nauroth
Priority: Minor
 Attachments: YARN-3514.001.patch, YARN-3514.002.patch


 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3514) Active directory usernames like domain\login cause YARN failures

2015-04-20 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503666#comment-14503666
 ] 

Chris Nauroth commented on YARN-3514:
-

[~john.lil...@redpoint.net], thank you for the detailed bug report.

I believe the root cause is likely to be in container localization's URI 
parsing to construct the local download path.  The relevant code is in 
{{ContainerLocalizer#download}}:

{code}
  CallablePath download(Path path, LocalResource rsrc,
  UserGroupInformation ugi) throws IOException {
DiskChecker.checkDir(new File(path.toUri().getRawPath()));
return new FSDownload(lfs, ugi, conf, path, rsrc);
  }
{code}

We're taking a {{Path}} and converting it to URI form, but I don't think 
{{getRawPath}} is the correct call for us to access the path portion of the 
URI.  A possible fix would be to switch to {{getPath}}, which would actually 
decode back to the original form.

{code}
scala new org.apache.hadoop.fs.Path(domain\\hadoopuser).toUri().getRawPath()
new org.apache.hadoop.fs.Path(domain\\hadoopuser).toUri().getRawPath()
res4: java.lang.String = domain%5Chadoopuser

scala new org.apache.hadoop.fs.Path(domain\\hadoopuser).toUri().getPath()
new org.apache.hadoop.fs.Path(domain\\hadoopuser).toUri().getPath()
res5: java.lang.String = domain\hadoopuser
{code}


 Active directory usernames like domain\login cause YARN failures
 

 Key: YARN-3514
 URL: https://issues.apache.org/jira/browse/YARN-3514
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn
Affects Versions: 2.2.0
 Environment: CentOS6
Reporter: john lilley
Priority: Minor

 We have a 2.2.0 (Cloudera 5.3) cluster running on CentOS6 that is 
 Kerberos-enabled and uses an external AD domain controller for the KDC.  We 
 are able to authenticate, browse HDFS, etc.  However, YARN fails during 
 localization because it seems to get confused by the presence of a \ 
 character in the local user name.
 Our AD authentication on the nodes goes through sssd and set configured to 
 map AD users onto the form domain\username.  For example, our test user has a 
 Kerberos principal of hadoopu...@domain.com and that maps onto a CentOS user 
 domain\hadoopuser.  We have no problem validating that user with PAM, 
 logging in as that user, su-ing to that user, etc.
 However, when we attempt to run a YARN application master, the localization 
 step fails when setting up the local cache directory for the AM.  The error 
 that comes out of the RM logs:
 2015-04-17 12:47:09 INFO net.redpoint.yarnapp.Client[0]: monitorApplication: 
 ApplicationReport: appId=1, state=FAILED, progress=0.0, finalStatus=FAILED, 
 diagnostics='Application application_1429295486450_0001 failed 1 times due to 
 AM Container for appattempt_1429295486450_0001_01 exited with  exitCode: 
 -1000 due to: Application application_1429295486450_0001 initialization 
 failed (exitCode=255) with output: main : command provided 0
 main : user is DOMAIN\hadoopuser
 main : requested yarn user is domain\hadoopuser
 org.apache.hadoop.util.DiskChecker$DiskErrorException: Cannot create 
 directory: 
 /data/yarn/nm/usercache/domain%5Chadoopuser/appcache/application_1429295486450_0001/filecache/10
 at 
 org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:105)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.download(ContainerLocalizer.java:199)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:241)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.main(ContainerLocalizer.java:347)
 .Failing this attempt.. Failing the application.'
 However, when we look on the node launching the AM, we see this:
 [root@rpb-cdh-kerb-2 ~]# cd /data/yarn/nm/usercache
 [root@rpb-cdh-kerb-2 usercache]# ls -l
 drwxr-s--- 4 DOMAIN\hadoopuser yarn 4096 Apr 17 12:10 domain\hadoopuser
 There appears to be different treatment of the \ character in different 
 places.  Something creates the directory as domain\hadoopuser but something 
 else later attempts to use it as domain%5Chadoopuser.  I’m not sure where 
 or why the URL escapement converts the \ to %5C or why this is not consistent.
 I should also mention, for the sake of completeness, our auth_to_local rule 
 is set up to map u...@domain.com to domain\user:
 RULE:[1:$1@$0](^.*@DOMAIN\.COM$)s/^(.*)@DOMAIN\.COM$/domain\\$1/g



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer

2015-03-23 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376914#comment-14376914
 ] 

Chris Nauroth commented on YARN-3336:
-

[~zxu], I apologize, but I missed entering your name on the git commit message:

{code}
commit 6ca1f12024fd7cec7b01df0f039ca59f3f365dc1
Author: cnauroth cnaur...@apache.org
Date:   Mon Mar 23 10:45:50 2015 -0700

YARN-3336. FileSystem memory leak in DelegationTokenRenewer.
{code}

Unfortunately, this isn't something we can change, because it could mess up the 
git history.

You're still there in CHANGES.txt though, so you get the proper credit for the 
patch:

{code}
YARN-3336. FileSystem memory leak in DelegationTokenRenewer.
(Zhihai Xu via cnauroth)
{code}


 FileSystem memory leak in DelegationTokenRenewer
 

 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Fix For: 2.7.0

 Attachments: YARN-3336.000.patch, YARN-3336.001.patch, 
 YARN-3336.002.patch, YARN-3336.003.patch, YARN-3336.004.patch


 FileSystem memory leak in DelegationTokenRenewer.
 Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
 FileSystem entry will be added to  FileSystem#CACHE which will never be 
 garbage collected.
 This is the implementation of obtainSystemTokensForUser:
 {code}
   protected Token?[] obtainSystemTokensForUser(String user,
   final Credentials credentials) throws IOException, InterruptedException 
 {
 // Get new hdfs tokens on behalf of this user
 UserGroupInformation proxyUser =
 UserGroupInformation.createProxyUser(user,
   UserGroupInformation.getLoginUser());
 Token?[] newTokens =
 proxyUser.doAs(new PrivilegedExceptionActionToken?[]() {
   @Override
   public Token?[] run() throws Exception {
 return FileSystem.get(getConfig()).addDelegationTokens(
   UserGroupInformation.getLoginUser().getUserName(), credentials);
   }
 });
 return newTokens;
   }
 {code}
 The memory leak happened when FileSystem.get(getConfig()) is called with a 
 new proxy user.
 Because createProxyUser will always create a new Subject.
 The calling sequence is 
 FileSystem.get(getConfig())=FileSystem.get(getDefaultUri(conf), 
 conf)=FileSystem.CACHE.get(uri, conf)=FileSystem.CACHE.getInternal(uri, 
 conf, key)=FileSystem.CACHE.map.get(key)=createFileSystem(uri, conf)
 {code}
 public static UserGroupInformation createProxyUser(String user,
   UserGroupInformation realUser) {
 if (user == null || user.isEmpty()) {
   throw new IllegalArgumentException(Null user);
 }
 if (realUser == null) {
   throw new IllegalArgumentException(Null real user);
 }
 Subject subject = new Subject();
 SetPrincipal principals = subject.getPrincipals();
 principals.add(new User(user));
 principals.add(new RealUser(realUser));
 UserGroupInformation result =new UserGroupInformation(subject);
 result.setAuthenticationMethod(AuthenticationMethod.PROXY);
 return result;
   }
 {code}
 FileSystem#Cache#Key.equals will compare the ugi
 {code}
   Key(URI uri, Configuration conf, long unique) throws IOException {
 scheme = uri.getScheme()==null?:uri.getScheme().toLowerCase();
 authority = 
 uri.getAuthority()==null?:uri.getAuthority().toLowerCase();
 this.unique = unique;
 this.ugi = UserGroupInformation.getCurrentUser();
   }
   public boolean equals(Object obj) {
 if (obj == this) {
   return true;
 }
 if (obj != null  obj instanceof Key) {
   Key that = (Key)obj;
   return isEqual(this.scheme, that.scheme)
   isEqual(this.authority, that.authority)
   isEqual(this.ugi, that.ugi)
   (this.unique == that.unique);
 }
 return false;
   }
 {code}
 UserGroupInformation.equals will compare subject by reference.
 {code}
   public boolean equals(Object o) {
 if (o == this) {
   return true;
 } else if (o == null || getClass() != o.getClass()) {
   return false;
 } else {
   return subject == ((UserGroupInformation) o).subject;
 }
   }
 {code}
 So in this case, every time createProxyUser and FileSystem.get(getConfig()) 
 are called, a new FileSystem will be created and a new entry will be added to 
 FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer

2015-03-20 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-3336:

Target Version/s: 2.7.0
Hadoop Flags: Reviewed

+1 for patch v004 pending Jenkins.

 FileSystem memory leak in DelegationTokenRenewer
 

 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3336.000.patch, YARN-3336.001.patch, 
 YARN-3336.002.patch, YARN-3336.003.patch, YARN-3336.004.patch


 FileSystem memory leak in DelegationTokenRenewer.
 Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
 FileSystem entry will be added to  FileSystem#CACHE which will never be 
 garbage collected.
 This is the implementation of obtainSystemTokensForUser:
 {code}
   protected Token?[] obtainSystemTokensForUser(String user,
   final Credentials credentials) throws IOException, InterruptedException 
 {
 // Get new hdfs tokens on behalf of this user
 UserGroupInformation proxyUser =
 UserGroupInformation.createProxyUser(user,
   UserGroupInformation.getLoginUser());
 Token?[] newTokens =
 proxyUser.doAs(new PrivilegedExceptionActionToken?[]() {
   @Override
   public Token?[] run() throws Exception {
 return FileSystem.get(getConfig()).addDelegationTokens(
   UserGroupInformation.getLoginUser().getUserName(), credentials);
   }
 });
 return newTokens;
   }
 {code}
 The memory leak happened when FileSystem.get(getConfig()) is called with a 
 new proxy user.
 Because createProxyUser will always create a new Subject.
 The calling sequence is 
 FileSystem.get(getConfig())=FileSystem.get(getDefaultUri(conf), 
 conf)=FileSystem.CACHE.get(uri, conf)=FileSystem.CACHE.getInternal(uri, 
 conf, key)=FileSystem.CACHE.map.get(key)=createFileSystem(uri, conf)
 {code}
 public static UserGroupInformation createProxyUser(String user,
   UserGroupInformation realUser) {
 if (user == null || user.isEmpty()) {
   throw new IllegalArgumentException(Null user);
 }
 if (realUser == null) {
   throw new IllegalArgumentException(Null real user);
 }
 Subject subject = new Subject();
 SetPrincipal principals = subject.getPrincipals();
 principals.add(new User(user));
 principals.add(new RealUser(realUser));
 UserGroupInformation result =new UserGroupInformation(subject);
 result.setAuthenticationMethod(AuthenticationMethod.PROXY);
 return result;
   }
 {code}
 FileSystem#Cache#Key.equals will compare the ugi
 {code}
   Key(URI uri, Configuration conf, long unique) throws IOException {
 scheme = uri.getScheme()==null?:uri.getScheme().toLowerCase();
 authority = 
 uri.getAuthority()==null?:uri.getAuthority().toLowerCase();
 this.unique = unique;
 this.ugi = UserGroupInformation.getCurrentUser();
   }
   public boolean equals(Object obj) {
 if (obj == this) {
   return true;
 }
 if (obj != null  obj instanceof Key) {
   Key that = (Key)obj;
   return isEqual(this.scheme, that.scheme)
   isEqual(this.authority, that.authority)
   isEqual(this.ugi, that.ugi)
   (this.unique == that.unique);
 }
 return false;
   }
 {code}
 UserGroupInformation.equals will compare subject by reference.
 {code}
   public boolean equals(Object o) {
 if (o == this) {
   return true;
 } else if (o == null || getClass() != o.getClass()) {
   return false;
 } else {
   return subject == ((UserGroupInformation) o).subject;
 }
   }
 {code}
 So in this case, every time createProxyUser and FileSystem.get(getConfig()) 
 are called, a new FileSystem will be created and a new entry will be added to 
 FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer

2015-03-20 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372127#comment-14372127
 ] 

Chris Nauroth commented on YARN-3336:
-

Hello, [~zxu].  Thank you for providing the new patch and adding the test.

I think we can avoid the changes in {{FileSystem}} by adding an instance 
counter to {{MyFS}}.  We can increment it in the constructor and decrement it 
in {{close}}.  Then, the test can get the value of the counter before making 
the calls to {{obtainSystemTokensForUser}} and assert that the counter has the 
same value after those calls.

 FileSystem memory leak in DelegationTokenRenewer
 

 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3336.000.patch, YARN-3336.001.patch, 
 YARN-3336.002.patch, YARN-3336.003.patch


 FileSystem memory leak in DelegationTokenRenewer.
 Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
 FileSystem entry will be added to  FileSystem#CACHE which will never be 
 garbage collected.
 This is the implementation of obtainSystemTokensForUser:
 {code}
   protected Token?[] obtainSystemTokensForUser(String user,
   final Credentials credentials) throws IOException, InterruptedException 
 {
 // Get new hdfs tokens on behalf of this user
 UserGroupInformation proxyUser =
 UserGroupInformation.createProxyUser(user,
   UserGroupInformation.getLoginUser());
 Token?[] newTokens =
 proxyUser.doAs(new PrivilegedExceptionActionToken?[]() {
   @Override
   public Token?[] run() throws Exception {
 return FileSystem.get(getConfig()).addDelegationTokens(
   UserGroupInformation.getLoginUser().getUserName(), credentials);
   }
 });
 return newTokens;
   }
 {code}
 The memory leak happened when FileSystem.get(getConfig()) is called with a 
 new proxy user.
 Because createProxyUser will always create a new Subject.
 The calling sequence is 
 FileSystem.get(getConfig())=FileSystem.get(getDefaultUri(conf), 
 conf)=FileSystem.CACHE.get(uri, conf)=FileSystem.CACHE.getInternal(uri, 
 conf, key)=FileSystem.CACHE.map.get(key)=createFileSystem(uri, conf)
 {code}
 public static UserGroupInformation createProxyUser(String user,
   UserGroupInformation realUser) {
 if (user == null || user.isEmpty()) {
   throw new IllegalArgumentException(Null user);
 }
 if (realUser == null) {
   throw new IllegalArgumentException(Null real user);
 }
 Subject subject = new Subject();
 SetPrincipal principals = subject.getPrincipals();
 principals.add(new User(user));
 principals.add(new RealUser(realUser));
 UserGroupInformation result =new UserGroupInformation(subject);
 result.setAuthenticationMethod(AuthenticationMethod.PROXY);
 return result;
   }
 {code}
 FileSystem#Cache#Key.equals will compare the ugi
 {code}
   Key(URI uri, Configuration conf, long unique) throws IOException {
 scheme = uri.getScheme()==null?:uri.getScheme().toLowerCase();
 authority = 
 uri.getAuthority()==null?:uri.getAuthority().toLowerCase();
 this.unique = unique;
 this.ugi = UserGroupInformation.getCurrentUser();
   }
   public boolean equals(Object obj) {
 if (obj == this) {
   return true;
 }
 if (obj != null  obj instanceof Key) {
   Key that = (Key)obj;
   return isEqual(this.scheme, that.scheme)
   isEqual(this.authority, that.authority)
   isEqual(this.ugi, that.ugi)
   (this.unique == that.unique);
 }
 return false;
   }
 {code}
 UserGroupInformation.equals will compare subject by reference.
 {code}
   public boolean equals(Object o) {
 if (o == this) {
   return true;
 } else if (o == null || getClass() != o.getClass()) {
   return false;
 } else {
   return subject == ((UserGroupInformation) o).subject;
 }
   }
 {code}
 So in this case, every time createProxyUser and FileSystem.get(getConfig()) 
 are called, a new FileSystem will be created and a new entry will be added to 
 FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3336) FileSystem memory leak in DelegationTokenRenewer

2015-03-12 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358146#comment-14358146
 ] 

Chris Nauroth commented on YARN-3336:
-

Hi [~zxu].  Nice catch.

I think the current version of the patch would change the owner of all obtained 
delegation tokens from the application submitter to the user running the 
ResourceManager daemon (i.e. the yarn user).  Instead, can we simply call 
{{close}} on the {{FileSystem}} after {{addDelegationTokens}}?  Closing a 
{{FileSystem}} also has the effect of removing it from the cache.  Since we 
already know that a new instance is getting created every time through this 
code path, I don't think closing the instance can impact any other threads.

 FileSystem memory leak in DelegationTokenRenewer
 

 Key: YARN-3336
 URL: https://issues.apache.org/jira/browse/YARN-3336
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Critical
 Attachments: YARN-3336.000.patch


 FileSystem memory leak in DelegationTokenRenewer.
 Every time DelegationTokenRenewer#obtainSystemTokensForUser is called, a new 
 FileSystem entry will be added to  FileSystem#CACHE which will never be 
 garbage collected.
 This is the implementation of obtainSystemTokensForUser:
 {code}
   protected Token?[] obtainSystemTokensForUser(String user,
   final Credentials credentials) throws IOException, InterruptedException 
 {
 // Get new hdfs tokens on behalf of this user
 UserGroupInformation proxyUser =
 UserGroupInformation.createProxyUser(user,
   UserGroupInformation.getLoginUser());
 Token?[] newTokens =
 proxyUser.doAs(new PrivilegedExceptionActionToken?[]() {
   @Override
   public Token?[] run() throws Exception {
 return FileSystem.get(getConfig()).addDelegationTokens(
   UserGroupInformation.getLoginUser().getUserName(), credentials);
   }
 });
 return newTokens;
   }
 {code}
 The memory leak happened when FileSystem.get(getConfig()) is called with a 
 new proxy user.
 Because createProxyUser will always create a new Subject.
 The calling sequence is 
 FileSystem.get(getConfig())=FileSystem.get(getDefaultUri(conf), 
 conf)=FileSystem.CACHE.get(uri, conf)=FileSystem.CACHE.getInternal(uri, 
 conf, key)=FileSystem.CACHE.map.get(key)=createFileSystem(uri, conf)
 {code}
 public static UserGroupInformation createProxyUser(String user,
   UserGroupInformation realUser) {
 if (user == null || user.isEmpty()) {
   throw new IllegalArgumentException(Null user);
 }
 if (realUser == null) {
   throw new IllegalArgumentException(Null real user);
 }
 Subject subject = new Subject();
 SetPrincipal principals = subject.getPrincipals();
 principals.add(new User(user));
 principals.add(new RealUser(realUser));
 UserGroupInformation result =new UserGroupInformation(subject);
 result.setAuthenticationMethod(AuthenticationMethod.PROXY);
 return result;
   }
 {code}
 FileSystem#Cache#Key.equals will compare the ugi
 {code}
   Key(URI uri, Configuration conf, long unique) throws IOException {
 scheme = uri.getScheme()==null?:uri.getScheme().toLowerCase();
 authority = 
 uri.getAuthority()==null?:uri.getAuthority().toLowerCase();
 this.unique = unique;
 this.ugi = UserGroupInformation.getCurrentUser();
   }
   public boolean equals(Object obj) {
 if (obj == this) {
   return true;
 }
 if (obj != null  obj instanceof Key) {
   Key that = (Key)obj;
   return isEqual(this.scheme, that.scheme)
   isEqual(this.authority, that.authority)
   isEqual(this.ugi, that.ugi)
   (this.unique == that.unique);
 }
 return false;
   }
 {code}
 UserGroupInformation.equals will compare subject by reference.
 {code}
   public boolean equals(Object o) {
 if (o == this) {
   return true;
 } else if (o == null || getClass() != o.getClass()) {
   return false;
 } else {
   return subject == ((UserGroupInformation) o).subject;
 }
   }
 {code}
 So in this case, every time createProxyUser and FileSystem.get(getConfig()) 
 are called, a new FileSystem will be created and a new entry will be added to 
 FileSystem.CACHE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2465) Make YARN unit tests work when pseudo distributed cluster is running

2015-02-14 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14321263#comment-14321263
 ] 

Chris Nauroth commented on YARN-2465:
-

This patch looks good to me.  Ming, do you want to take Steve's suggestion to 
address the application history server too?

 Make YARN unit tests work when pseudo distributed cluster is running
 

 Key: YARN-2465
 URL: https://issues.apache.org/jira/browse/YARN-2465
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Ming Ma
Assignee: Ming Ma
 Attachments: YARN-2465.patch


 This is useful for development where you might have some pseudo distributed 
 cluster in the background and don't want to stop it to run unit test cases. 
 Most YARN test cases pass, except some tests that use localization service 
 try to bind to the default localization service port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2899) Run TestDockerContainerExecutorWithMocks on Linux only

2015-02-14 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2899:

Component/s: test
 nodemanager

 Run TestDockerContainerExecutorWithMocks on Linux only
 --

 Key: YARN-2899
 URL: https://issues.apache.org/jira/browse/YARN-2899
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, test
Reporter: Ming Ma
Assignee: Ming Ma
Priority: Minor
 Fix For: 2.7.0

 Attachments: YARN-2899.patch


 It seems the test should strictly check for Linux, otherwise, it will fail 
 when the OS isn't Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2796) deprecate sbin/*.sh

2015-02-11 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2796:

 Component/s: scripts
Assignee: Allen Wittenauer
Hadoop Flags: Reviewed

+1 for the patch. Thank you, Allen.

The test warning from Jenkins isn't relevant, because Jenkins doesn't run tests 
on the shell code.

 deprecate sbin/*.sh
 ---

 Key: YARN-2796
 URL: https://issues.apache.org/jira/browse/YARN-2796
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: scripts
Reporter: Allen Wittenauer
Assignee: Allen Wittenauer
 Attachments: YARN-2796-00.patch


 We should deprecate mark all yarn sbin/*.sh commands (except for start and 
 stop) as deprecated in trunk so that they may be removed in a future release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3015) yarn classpath command should support same options as hadoop classpath.

2015-01-19 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14283028#comment-14283028
 ] 

Chris Nauroth commented on YARN-3015:
-

Hello [~varun_saxena].  This looks good for trunk.  Would you also provide a 
patch for branch-2?  Thank you.

 yarn classpath command should support same options as hadoop classpath.
 ---

 Key: YARN-3015
 URL: https://issues.apache.org/jira/browse/YARN-3015
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Chris Nauroth
Assignee: Varun Saxena
Priority: Minor
 Attachments: YARN-3015.002.patch, YARN-3015.003.patch, 
 YARN-3015.004.patch, YARN-3015.005.patch


 HADOOP-10903 enhanced the {{hadoop classpath}} command to support optional 
 expansion of the wildcards and bundling the classpath into a jar file 
 containing a manifest with the Class-Path attribute. The other classpath 
 commands should do the same for consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3066) Hadoop leaves orphaned tasks running after job is killed

2015-01-15 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279202#comment-14279202
 ] 

Chris Nauroth commented on YARN-3066:
-

I'm not familiar with {{ssid}} on FreeBSD.  Does it have the same usage as 
Linux {{setsid}}?  If so, then perhaps an appropriate workaround is to copy 
that binary to {{setsid}} and make sure it's available on the {{PATH}}.  This 
might not require any YARN code changes.

bq. I propose to make Shell.isSetsidAvailable test more strict and fail to 
start if it is not found.

This would likely have to be considered backwards-incompatible, because 
applications would fail to start on existing systems that don't have 
{{setsid}}.  I suppose the new behavior could be hidden behind an opt-in 
configuration property.  Also, we need to keep in mind that 
{{Shell.isSetsidAvailable}} is always {{false}} on Windows.  (On Windows, we 
handle the issue of orphaned processes by using Windows API job objects instead 
of {{setsid}}.)

 Hadoop leaves orphaned tasks running after job is killed
 

 Key: YARN-3066
 URL: https://issues.apache.org/jira/browse/YARN-3066
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
 Environment: Hadoop 2.4.1 (probably all later too), FreeBSD-10.1
Reporter: Dmitry Sivachenko

 When spawning user task, node manager checks for setsid(1) utility and spawns 
 task program via it. See 
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java
  for instance:
 String exec = Shell.isSetsidAvailable? exec setsid : exec;
 FreeBSD, unlike Linux, does not have setsid(1) utility.  So plain exec is 
 used to spawn user task.  If that task spawns other external programs (this 
 is common case if a task program is a shell script) and user kills job via 
 mapred job -kill Job, these child processes remain running.
 1) Why do you silently ignore the absence of setsid(1) and spawn task process 
 via exec: this is the guarantee to have orphaned processes when job is 
 prematurely killed.
 2) FreeBSD has a replacement third-party program called ssid (which does 
 almost the same as Linux's setsid).  It would be nice to detect which binary 
 is present during configure stage and put @SETSID@ macros into java file to 
 use the correct name.
 I propose to make Shell.isSetsidAvailable test more strict and fail to start 
 if it is not found:  at least we will know about the problem at start rather 
 than guess why there are orphaned tasks running forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-3015) yarn classpath command should support same options as hadoop classpath.

2015-01-07 Thread Chris Nauroth (JIRA)

Chris Nauroth created YARN-3015:
---

 Summary: yarn classpath command should support same options as 
hadoop classpath.
 Key: YARN-3015
 URL: https://issues.apache.org/jira/browse/YARN-3015
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Chris Nauroth
Priority: Minor


HADOOP-10903 enhanced the {{hadoop classpath}} command to support optional 
expansion of the wildcards and bundling the classpath into a jar file 
containing a manifest with the Class-Path attribute. The other classpath 
commands should do the same for consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3015) yarn classpath command should support same options as hadoop classpath.

2015-01-07 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268630#comment-14268630
 ] 

Chris Nauroth commented on YARN-3015:
-

Thanks to [~aw] for reporting it.

 yarn classpath command should support same options as hadoop classpath.
 ---

 Key: YARN-3015
 URL: https://issues.apache.org/jira/browse/YARN-3015
 Project: Hadoop YARN
  Issue Type: Bug
  Components: scripts
Reporter: Chris Nauroth
Priority: Minor

 HADOOP-10903 enhanced the {{hadoop classpath}} command to support optional 
 expansion of the wildcards and bundling the classpath into a jar file 
 containing a manifest with the Class-Path attribute. The other classpath 
 commands should do the same for consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2929) Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows support

2014-12-22 Thread Chris Nauroth (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256390#comment-14256390
]

Chris Nauroth commented on YARN-2929:
-

[~ozawa], thank you for providing the additional details.

Right now, the typical workflow is that the application submission context
controls the Java classpath by setting the {{CLASSPATH}} environment variable.
If we take the example of MapReduce, the relevant code is in {{YARNRunner}} and
{{MRApps}}. There is support in there for handling environment variables in
cross-platform application submission. However, even putting that aside, I
believe there is no problem with using '/' for a Windows job submission if you
use this technique. The NodeManager ultimately translates the classpath
through {{Path}} and bundles the whole classpath into a jar file manifest to be
referenced by the running container.

I believe the only problem shown in the example is that the application
submission is trying to set classpath by command-line argument. Is it possible
to switch to using the {{CLASSPATH}} environment variable technique, similar to
how the MapReduce code does it? If yes, then there is no need for the proposed
patch.

There are also some other potential issues with trying to pass classpath on the
command line in Windows. It's very easy to hit the Windows maximum command
line length limitation of 8191 characters. NodeManager already has the logic
to work around this by bundling the classpath into a jar file manifest, and
you'd get that for free by using the environment variable technique.

Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows
support

Key: YARN-2929
URL: https://issues.apache.org/jira/browse/YARN-2929
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Attachments: YARN-2929.001.patch

Some frameworks like Spark is tackling to run jobs on Windows(SPARK-1825).
For better multiple platform support, we should introduce
ApplicationConstants.FILE_PATH_SEPARATOR for making filepath
platform-independent.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2929) Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows support

2014-12-11 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243482#comment-14243482
 ] 

Chris Nauroth commented on YARN-2929:
-

I apologize, but I'm still having trouble seeing the usefulness of this.  Using 
{{Path}} as I described earlier effectively changes any file path into valid 
URI syntax, using forward slashes instead of back slashes.  I expect this would 
then be a valid, usable path at the NodeManager regardless of its OS.  Even if 
the path originates from a shell script, I don't see how that would make a 
difference.

Do you have an example YARN application submission that would demonstrate the 
problem in more detail?  Alternatively, if you could point out a spot in 
Spark's YARN application submission code that demonstrates the problem, then I 
could look at that.  I am assuming here that a path originating from a shell 
script would get passed into a Spark Java process, where Spark code would have 
an opportunity to use the {{Path}} class like I described.  Please let me know 
if my assumption is wrong.

There isn't anything necessarily wrong with the patch posted.  It just looks to 
me at this point like it isn't required.  By minimizing token replacement rules 
like this, we'd reduce the number of special cases that YARN application 
writers would need to consider.

 Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows 
 support
 

 Key: YARN-2929
 URL: https://issues.apache.org/jira/browse/YARN-2929
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
 Attachments: YARN-2929.001.patch


 Some frameworks like Spark is tackling to run jobs on Windows(SPARK-1825). 
 For better multiple platform support, we should introduce 
 ApplicationConstants.FILE_PATH_SEPARATOR for making filepath 
 platform-independent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2929) Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows support

2014-12-06 Thread Chris Nauroth (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14237043#comment-14237043
]

Chris Nauroth commented on YARN-2929:
-

Hi, [~ozawa].

I would not expect this change to be necessary. To achieve cross-platform
application submissions, I believe you would just need to pass a given file
system path through the {{org.apache.hadoop.fs.Path}} class. In that class, we
have implemented logic for handling Windows paths. Part of that logic replaces
all back slashes with forward slashes when running on Windows. For example,
{{new Path(C:\foo\bar).toString()}} yields the string {{C:/foo/bar}} when
running on Windows. The forward slash format works fine on both Linux and
Windows.

This is different from the classpath separator. We didn't have any similar
special handling for that, which is why we needed to implement YARN-1824.

I looked at SPARK-1825, and it seems there is a question about handling of
SPARK_HOME. I'd expect this path could differ between client and server.
Regardless of the issue of path separator, the actual path is likely to be
different, so it seems like the server side really needs to be responsible for
injecting this.

I don't have any experience with Spark though, so please let me know if I'm
missing something. Thanks!

Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows
support

Key: YARN-2929
URL: https://issues.apache.org/jira/browse/YARN-2929
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Tsuyoshi OZAWA
Assignee: Tsuyoshi OZAWA
Attachments: YARN-2929.001.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.

2014-11-07 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202571#comment-14202571
 ] 

Chris Nauroth commented on YARN-2803:
-

Thanks again, Craig.  I re-verified that the tests pass in my environment with 
this version of the patch.  I agree with the argument to retain the current 
behavior of secure mode (such as it is).

Sorry to nitpick, but it looks like some lines are indented by 1 space instead 
of 2.  Would you mind fixing that?  I'll be +1 after that.

{code}
if (exec instanceof WindowsSecureContainerExecutor) {
 jarDir = nmPrivateClasspathJarDir;
} else {
 jarDir = pwd; 
}
{code}


 MR distributed cache not working correctly on Windows after NodeManager 
 privileged account changes.
 ---

 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2803.0.patch, YARN-2803.1.patch


 This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
 {{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running 
 git bisect, I traced it to the YARN-2198 patch to remove the need to run 
 NodeManager as a privileged account.  The tests started failing when that 
 patch was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.

2014-11-07 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202602#comment-14202602
 ] 

Chris Nauroth commented on YARN-2803:
-

+1 for the v2 patch.  I'll commit this.

 MR distributed cache not working correctly on Windows after NodeManager 
 privileged account changes.
 ---

 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2803.0.patch, YARN-2803.1.patch, YARN-2803.2.patch


 This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
 {{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running 
 git bisect, I traced it to the YARN-2198 patch to remove the need to run 
 NodeManager as a privileged account.  The tests started failing when that 
 patch was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.

2014-11-06 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2803:

Assignee: Craig Welch
Hadoop Flags: Reviewed

+1 for the patch.  I verified that {{TestMRJobs}} and {{TestUberAM}} pass in my 
Windows environment.

I'm going to hold off on committing until tomorrow in case anyone else watching 
wants to comment regarding secure mode.  I do think we need to commit this, 
because without it, we have a regression in non-secure mode on Windows, which 
has been shipping for several releases already.  Secure mode is still under 
development as I understand it.

 MR distributed cache not working correctly on Windows after NodeManager 
 privileged account changes.
 ---

 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Assignee: Craig Welch
Priority: Critical
 Attachments: YARN-2803.0.patch


 This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
 {{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running 
 git bisect, I traced it to the YARN-2198 patch to remove the need to run 
 NodeManager as a privileged account.  The tests started failing when that 
 patch was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-11-06 Thread Chris Nauroth (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201674#comment-14201674
]

Chris Nauroth commented on YARN-2198:
-

This patch caused {{TestWinUtils#testChmod}} to fail. I submitted a patch on
HADOOP-11280 to fix the test.

Remove the need to run NodeManager as privileged account for Windows Secure
Container Executor
--

Key: YARN-2198
URL: https://issues.apache.org/jira/browse/YARN-2198
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Fix For: 2.6.0

Attachments: .YARN-2198.delta.10.patch, YARN-2198.1.patch,
YARN-2198.11.patch, YARN-2198.12.patch, YARN-2198.13.patch,
YARN-2198.14.patch, YARN-2198.15.patch, YARN-2198.16.patch,
YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch,
YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.delta.7.patch,
YARN-2198.separation.patch, YARN-2198.trunk.10.patch,
YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch,
YARN-2198.trunk.8.patch, YARN-2198.trunk.9.patch

YARN-1972 introduces a Secure Windows Container Executor. However this
executor requires the process launching the container to be LocalSystem or a
member of the a local Administrators group. Since the process in question is
the NodeManager, the requirement translates to the entire NM to run as a
privileged account, a very large surface area to review and protect.
This proposal is to move the privileged operations into a dedicated NT
service. The NM can run as a low privilege account and communicate with the
privileged NT service when it needs to launch a container. This would reduce
the surface exposed to the high privileges.
There has to exist a secure, authenticated and authorized channel of
communication between the NM and the privileged NT service. Possible
alternatives are a new TCP endpoint, Java RPC etc. My proposal though would
be to use Windows LPC (Local Procedure Calls), which is a Windows platform
specific inter-process communication channel that satisfies all requirements
and is easy to deploy. The privileged NT service would register and listen on
an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop
with libwinutils which would host the LPC client code. The client would
connect to the LPC port (NtConnectPort) and send a message requesting a
container launch (NtRequestWaitReplyPort). LPC provides authentication and
the privileged NT service can use authorization API (AuthZ) to validate the
caller.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.

2014-11-03 Thread Chris Nauroth (JIRA)

Chris Nauroth created YARN-2803:
---

 Summary: MR distributed cache not working correctly on Windows 
after NodeManager privileged account changes.
 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Priority: Critical


This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
{{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running git 
bisect, I traced it to the YARN-2198 patch to remove the need to run 
NodeManager as a privileged account.  The tests started failing when that patch 
was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2803) MR distributed cache not working correctly on Windows after NodeManager privileged account changes.

2014-11-03 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195451#comment-14195451
 ] 

Chris Nauroth commented on YARN-2803:
-

Here is the stack trace from a failure.

{code}
testDistributedCache(org.apache.hadoop.mapreduce.v2.TestMRJobs)  Time elapsed: 1
6.844 sec   FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.mapreduce.v2.TestMRJobs._testDistributedCache(TestMRJobs.java:881)
at 
org.apache.hadoop.mapreduce.v2.TestMRJobs.testDistributedCache(TestMRJobs.java:891)
{code}

The task log shows the assertion failing when it tries to find 
job.jar/lib/lib2.jar.

{code}
2014-11-03 15:36:33,652 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error 
running child : java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertNotNull(Assert.java:621)
at org.junit.Assert.assertNotNull(Assert.java:631)
at 
org.apache.hadoop.mapreduce.v2.TestMRJobs$DistributedCacheChecker.setup(TestMRJobs.java:764)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:784)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:169)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1640)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
{code}


 MR distributed cache not working correctly on Windows after NodeManager 
 privileged account changes.
 ---

 Key: YARN-2803
 URL: https://issues.apache.org/jira/browse/YARN-2803
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Chris Nauroth
Priority: Critical

 This problem is visible by running {{TestMRJobs#testDistributedCache}} or 
 {{TestUberAM#testDistributedCache}} on Windows.  Both tests fail.  Running 
 git bisect, I traced it to the YARN-2198 patch to remove the need to run 
 NodeManager as a privileged account.  The tests started failing when that 
 patch was committed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor

2014-11-03 Thread Chris Nauroth (JIRA)

[
https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14195453#comment-14195453
]

Chris Nauroth commented on YARN-2198:
-

It appears that this patch has broken some MR distributed cache functionality
on Windows, or at least caused a failure in
{{TestMRJobs#testDistributedCache}}. Please see YARN-2803 for more details.

Remove the need to run NodeManager as privileged account for Windows Secure
Container Executor
--

Key: YARN-2198
URL: https://issues.apache.org/jira/browse/YARN-2198
Project: Hadoop YARN
Issue Type: Improvement
Reporter: Remus Rusanu
Assignee: Remus Rusanu
Labels: security, windows
Fix For: 2.6.0

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2677) registry punycoding of usernames doesn't fix all usernames to be DNS-valid

2014-10-30 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2677:

Hadoop Flags: Reviewed

+1 for the patch.  I verified this on both Mac and Windows.  Thanks, Steve!

 registry punycoding of usernames doesn't fix all usernames to be DNS-valid
 --

 Key: YARN-2677
 URL: https://issues.apache.org/jira/browse/YARN-2677
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2677-001.patch, YARN-2677-002.patch


 The registry has a restriction DNS-valid names only to retain the future 
 option of DNS exporting of the registry.
 to handle complex usernames, it punycodes the username first, using Java's 
 {{java.net.IDN}} class.
 This turns out to only map high unicode- ASCII, and does nothing for 
 ascii-but-invalid-hostname chars, so stopping users with DNS-illegal names 
 (e.g. with an underscore in them) from being able to register



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2700:

Hadoop Flags: Reviewed

+1 for the patch, pending Jenkins.  Thanks for the fix, Steve.

 TestSecureRMRegistryOperations failing on windows: auth problems
 

 Key: YARN-2700
 URL: https://issues.apache.org/jira/browse/YARN-2700
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows Server, Win7
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2700-001.patch


 TestSecureRMRegistryOperations failing on windows: unable to create the root 
 /registry path with permissions problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2700) TestSecureRMRegistryOperations failing on windows: auth problems

2014-10-22 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14180086#comment-14180086
 ] 

Chris Nauroth commented on YARN-2700:
-

bq. ...pending Jenkins...

Never mind.  It looks like Jenkins and I had a race condition commenting.  :-)  
You have a full +1 from me now.

 TestSecureRMRegistryOperations failing on windows: auth problems
 

 Key: YARN-2700
 URL: https://issues.apache.org/jira/browse/YARN-2700
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows Server, Win7
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2700-001.patch


 TestSecureRMRegistryOperations failing on windows: unable to create the root 
 /registry path with permissions problems.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives

2014-10-21 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2720:

 Component/s: nodemanager
Target Version/s: 2.6.0

 Windows: Wildcard classpath variables not expanded against resources 
 contained in archives
 --

 Key: YARN-2720
 URL: https://issues.apache.org/jira/browse/YARN-2720
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Craig Welch
Assignee: Craig Welch

 On windows there are limitations to the length of command lines and 
 environment variables which prevent placing all classpath resources into 
 these elements.  Instead, a jar containing only a classpath manifest is 
 created to provide the classpath.  During this process wildcard references 
 are expanded by inspecting the filesystem.  Since archives are extracted to a 
 different location and linked into the final location after the classpath jar 
 is created, resources referred to via wildcards which exist in localized 
 archives  (.zip, tar.gz) are not added to the classpath manifest jar.  Since 
 these entries are removed from the final classpath for the container they are 
 not on the container's classpath as they should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives

2014-10-21 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2720:

Hadoop Flags: Reviewed

+1 for the patch, pending Jenkins run.  I've verified that this works in my 
environment with a few test runs.  Thank you for fixing this, Craig.

 Windows: Wildcard classpath variables not expanded against resources 
 contained in archives
 --

 Key: YARN-2720
 URL: https://issues.apache.org/jira/browse/YARN-2720
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch


 On windows there are limitations to the length of command lines and 
 environment variables which prevent placing all classpath resources into 
 these elements.  Instead, a jar containing only a classpath manifest is 
 created to provide the classpath.  During this process wildcard references 
 are expanded by inspecting the filesystem.  Since archives are extracted to a 
 different location and linked into the final location after the classpath jar 
 is created, resources referred to via wildcards which exist in localized 
 archives  (.zip, tar.gz) are not added to the classpath manifest jar.  Since 
 these entries are removed from the final classpath for the container they are 
 not on the container's classpath as they should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2720) Windows: Wildcard classpath variables not expanded against resources contained in archives

2014-10-21 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14178897#comment-14178897
 ] 

Chris Nauroth commented on YARN-2720:
-

The Findbugs warnings are unrelated.  I'll commit this.

 Windows: Wildcard classpath variables not expanded against resources 
 contained in archives
 --

 Key: YARN-2720
 URL: https://issues.apache.org/jira/browse/YARN-2720
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Craig Welch
Assignee: Craig Welch
 Attachments: YARN-2720.2.patch, YARN-2720.3.patch, YARN-2720.4.patch


 On windows there are limitations to the length of command lines and 
 environment variables which prevent placing all classpath resources into 
 these elements.  Instead, a jar containing only a classpath manifest is 
 created to provide the classpath.  During this process wildcard references 
 are expanded by inspecting the filesystem.  Since archives are extracted to a 
 different location and linked into the final location after the classpath jar 
 is created, resources referred to via wildcards which exist in localized 
 archives  (.zip, tar.gz) are not added to the classpath manifest jar.  Since 
 these entries are removed from the final classpath for the container they are 
 not on the container's classpath as they should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (YARN-2692) ktutil test hanging on some machines/ktutil versions

2014-10-20 Thread Chris Nauroth (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Nauroth updated YARN-2692:

Hadoop Flags: Reviewed

+1 for the patch.  I agree that we're not really losing any test coverage by 
removing this.  {{TestSecureRegistry}} will make use of the same keytab file 
implicitly.

 ktutil test hanging on some machines/ktutil versions
 

 Key: YARN-2692
 URL: https://issues.apache.org/jira/browse/YARN-2692
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2692-001.patch


 a couple of the registry security tests run native {{ktutil}}; this is 
 primarily to debug the keytab generation. [~cnauroth] reports that some 
 versions of {{kinit}} hang. Fix: rm the tests. [YARN-2689]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows

2014-10-15 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172629#comment-14172629
 ] 

Chris Nauroth commented on YARN-2689:
-

Hi Steve.  Thanks for fixing this.  The patch looks good to me, and I verified 
that it fixes the tests on Windows.  However, I'm seeing a problem when running 
on Mac and Linux.  It's hanging while executing {{ktutil}}.  On my systems, 
{{ktutil}} is an interactive command, so the tests are starting up the child 
process, and then it's never exiting.  (See stack trace below.)  Some quick 
searching indicates that some installations of {{ktutil}} are non-interactive, 
but others are entirely interactive (MIT for example).

{code}
JUnit prio=10 tid=0x7f424488c000 nid=0x675 runnable [0x7f4239c02000]
   java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:236)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked 0xed550918 (a java.io.BufferedInputStream)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:282)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:324)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
- locked 0xed3df918 (a java.io.InputStreamReader)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:153)
at java.io.BufferedReader.read1(BufferedReader.java:204)
at java.io.BufferedReader.read(BufferedReader.java:278)
- locked 0xed3df918 (a java.io.InputStreamReader)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.parseExecResult(Shell.java:721)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:530)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:708)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:797)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:780)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{code}


 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2689-001.patch


 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2689) TestSecureRMRegistryOperations failing on windows

2014-10-15 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14172676#comment-14172676
 ] 

Chris Nauroth commented on YARN-2689:
-

+1 for the patch.  Thanks again, Steve.

 TestSecureRMRegistryOperations failing on windows
 -

 Key: YARN-2689
 URL: https://issues.apache.org/jira/browse/YARN-2689
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api, resourcemanager
Affects Versions: 2.6.0
 Environment: Windows server, Java 7, ZK 3.4.6
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2689-001.patch


 the micro ZK service used in the {{TestSecureRMRegistryOperations}} test 
 doesnt start on windows, 
 {code}
 org.apache.hadoop.service.ServiceStateException: java.io.IOException: Could 
 not configure server because SASL configuration did not allow the  ZooKeeper 
 server to authenticate itself properly: 
 javax.security.auth.login.LoginException: Unable to obtain password from user
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2668) yarn-registry JAR won't link against ZK 3.4.5

2014-10-10 Thread Chris Nauroth (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14167333#comment-14167333
 ] 

Chris Nauroth commented on YARN-2668:
-

Thanks for catching that, Steve.  +1 for patch v2 pending jenkins.

 yarn-registry JAR won't link against ZK 3.4.5
 -

 Key: YARN-2668
 URL: https://issues.apache.org/jira/browse/YARN-2668
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: client
Affects Versions: 2.6.0
Reporter: Steve Loughran
Assignee: Steve Loughran
 Attachments: YARN-2668-001.patch, YARN-2668-002.patch

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 It's been reported that the registry code doesn't link against ZK 3.4.5 as 
 the enable/disable SASL client property isn't there, which went in with 
 ZOOKEEPER-1657.
 pulling in the constant and {{isEnabled()}} check will ensure registry 
 linkage, even though the ability for a client to disable SASL auth will be 
 lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 >

1 - 100 of 282 matches

Mail list logo