[jira] [Commented] (YARN-3862) Decide which contents to retrieve and send back in response in TimelineReader
[ https://issues.apache.org/jira/browse/YARN-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994758#comment-14994758 ] Sangjin Lee commented on YARN-3862: --- [~jrottinghuis] and I went over the patch in some more detail, and have a few high level suggestions. I don't think the qualifiers that are being created currently in TimelineEntityReader.constructFilterListBasedOnFields() are quite right. We already talked about breaking down and pushing down the logic into its appropriate specific entity reader implementations. In addition, instead of trying to compute the byte arrays using the raw ingredients like Separators, we should rely on the \*ColumnPrefix classes to give you the byte arrays. That would lead to more properly encapsulated (and correct) code. ColumnPrefix classes already do something like the following (see ApplicationColumnPrefix.store() for example): {code} byte[] columnQualifier = ColumnHelper.getColumnQualifier(this.columnPrefixBytes, qualifier); {code} We could expose a new method on ColumnPrefix like {code} public interface ColumnPrefix { ... byte[] getColumnQualifierBytes(String qualifier); ... } {code} And specific implementations can implement that method. That way, all the proper column prefix handling is managed and encapsulated by ColumnPrefix classes. When we move the logic of creating the filter list to its appropriate entity reader classes, those classes already know which column prefix they're dealing with, and they can simply call these methods to get the bytes back. That will make the implementation much cleaner. Hope this helps. Let me know if you have any questions... Thanks! > Decide which contents to retrieve and send back in response in TimelineReader > - > > Key: YARN-3862 > URL: https://issues.apache.org/jira/browse/YARN-3862 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-3862-YARN-2928.wip.01.patch, > YARN-3862-YARN-2928.wip.02.patch > > > Currently, we will retrieve all the contents of the field if that field is > specified in the query API. In case of configs and metrics, this can become a > lot of data even though the user doesn't need it. So we need to provide a way > to query only a set of configs or metrics. > As a comma spearated list of configs/metrics to be returned will be quite > cumbersome to specify, we have to support either of the following options : > # Prefix match > # Regex > # Group the configs/metrics and query that group. > We also need a facility to specify a metric time window to return metrics in > a that window. This may be useful in plotting graphs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4053) Change the way metric values are stored in HBase Storage
[ https://issues.apache.org/jira/browse/YARN-4053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994603#comment-14994603 ] Vrushali C commented on YARN-4053: -- Thanks [~varun_saxena] for the patch and [~djp] , [~gtCarrera], [~Naganarasimha], [~sjlee0] and [~jrottinghuis] for the discussion so far! [~jrottinghuis] , [~sjlee0] and I had an offline discussion on this yesterday. We discussed at length along the following vectors: - metric datatype: long, double, either or, both? - metric type storage and retrieval for: single values vs timeseries - metrics in the context of aggregation: how to indicate whether to aggregate or no. - operations on metrics: sum vs average, min/max To summarize the discussion: - Our proposal is to proceed with supporting only longs for now. We went over several situations of how to store and query for decimal numbers: as Doubles or as numerator/denominator, how to use filters while scanning for such stored values, how would aggregation look at it etc. We thought about which metrics are to be stored as Doubles and how the precision might affect aggregation. We finally concluded that we should start with storing longs only and make the code strictly accept longs (not even ints or shorts). - For single value vs time series, we suggest using a column prefix to distinguish them. For the read path, we can assume it is a single value unless specifically specified by the client as a time series (as clients would need to intend to read time series explicitly). - Regarding indicating whether to aggregate or not, we suggest to rely mostly on the flow run aggregation. For those use cases that need to access metrics off of tables other than the flow run table (e.g. time-based aggregation), we need to explore ways to specify this information as input (config, etc.) - So, the current patch is along the lines of our proposal of using longs for metrics. But we are considering a different approach of creating a "converter" type and implementation. For other non metric columns, a "generic" converter that uses the GenericObjectMapper can be created and used implicitly. For the numeric (long) columns, a long converter would be used explicitly. We also need to revisit how it's done in FlowScanner (it missed one of the places in the current patch for example). We need to get at the instances of ColumnPrefix and ColumnFamily, etc. and use them to get the converter in the flow scanner. @Varun Would it be fine if I took over this jira to patch it with the above points? thanks Vrushali > Change the way metric values are stored in HBase Storage > > > Key: YARN-4053 > URL: https://issues.apache.org/jira/browse/YARN-4053 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: YARN-2928 >Reporter: Varun Saxena >Assignee: Varun Saxena > Attachments: YARN-4053-YARN-2928.01.patch, > YARN-4053-YARN-2928.02.patch > > > Currently HBase implementation uses GenericObjectMapper to convert and store > values in backend HBase storage. This converts everything into a string > representation(ASCII/UTF-8 encoded byte array). > While this is fine in most cases, it does not quite serve our use case for > metrics. > So we need to decide how are we going to encode and decode metric values and > store them in HBase. > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler
[ https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994799#comment-14994799 ] Wangda Tan commented on YARN-3980: -- Just noticed this ticket. [~sunilg], since this patch is almost ready to go, if you think it's fine, could you update YARN-4292 to base on this one? And could you take a look at this patch to see if there's any changes need to merge? [~goiri], thanks for working on this, two comments {code} 352 public void setContainersUtilization( 353 ResourceUtilization containersUtilization) { 354 if (containersUtilization != null) { 355 this.containersUtilization = containersUtilization; 356 } 357 } {code} It seems SchedulerNode cannot update containerUtilization since it's always not null. I think you should directly update utilization when set. And I suggest to update SchedulerNode.containersUtilization/nodeUtilization to use violate. And for naming, I suggest to change getContainersUtilization in RMNode/SchedulerNode/RMNodeStatusEvent to get*Aggregated*ContainersUtilization, which is more straight forward to me. > Plumb resource-utilization info in node heartbeat through to the scheduler > -- > > Key: YARN-3980 > URL: https://issues.apache.org/jira/browse/YARN-3980 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Assignee: Inigo Goiri > Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, > YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch > > > YARN-1012 and YARN-3534 collect resource utilization information for all > containers and the node respectively and send it to the RM on node heartbeat. > We should plumb it through to the scheduler so the scheduler can make use of > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler
[ https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994811#comment-14994811 ] Inigo Goiri commented on YARN-3980: --- OK, I'll try it for both. > Plumb resource-utilization info in node heartbeat through to the scheduler > -- > > Key: YARN-3980 > URL: https://issues.apache.org/jira/browse/YARN-3980 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Assignee: Inigo Goiri > Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, > YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch > > > YARN-1012 and YARN-3534 collect resource utilization information for all > containers and the node respectively and send it to the RM on node heartbeat. > We should plumb it through to the scheduler so the scheduler can make use of > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Po updated YARN-4184: -- Attachment: YARN-4184.v1.patch > Remove update reservation state api from state store as its not used by > ReservationSystem > - > > Key: YARN-4184 > URL: https://issues.apache.org/jira/browse/YARN-4184 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler, fairscheduler, resourcemanager >Reporter: Anubhav Dhoot >Assignee: Sean Po > Attachments: YARN-4184.v1.patch > > > ReservationSystem uses remove/add for updates and thus update api in state > store is not needed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler
[ https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994805#comment-14994805 ] Inigo Goiri commented on YARN-3980: --- How does the violate thing work? Do you have any example? I'll change the naming for getContainersUtilization. > Plumb resource-utilization info in node heartbeat through to the scheduler > -- > > Key: YARN-3980 > URL: https://issues.apache.org/jira/browse/YARN-3980 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Assignee: Inigo Goiri > Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, > YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch > > > YARN-1012 and YARN-3534 collect resource utilization information for all > containers and the node respectively and send it to the RM on node heartbeat. > We should plumb it through to the scheduler so the scheduler can make use of > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3980) Plumb resource-utilization info in node heartbeat through to the scheduler
[ https://issues.apache.org/jira/browse/YARN-3980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994807#comment-14994807 ] Wangda Tan commented on YARN-3980: -- Sorry, it's typo, I meant: volatile. > Plumb resource-utilization info in node heartbeat through to the scheduler > -- > > Key: YARN-3980 > URL: https://issues.apache.org/jira/browse/YARN-3980 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, scheduler >Affects Versions: 2.7.1 >Reporter: Karthik Kambatla >Assignee: Inigo Goiri > Attachments: YARN-3980-v0.patch, YARN-3980-v1.patch, > YARN-3980-v2.patch, YARN-3980-v3.patch, YARN-3980-v4.patch > > > YARN-1012 and YARN-3534 collect resource utilization information for all > containers and the node respectively and send it to the RM on node heartbeat. > We should plumb it through to the scheduler so the scheduler can make use of > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4184) Remove update reservation state api from state store as its not used by ReservationSystem
[ https://issues.apache.org/jira/browse/YARN-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995004#comment-14995004 ] Hadoop QA commented on YARN-4184: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 5s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 9s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 27s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 14s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 16s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 58m 5s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 59m 8s {color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 128m 32s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | JDK v1.8.0_60 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | | JDK v1.7.0_79 Failed junit tests | hadoop.yarn.server.resourcemanager.TestAMAuthorization | | | hadoop.yarn.server.resourcemanager.TestClientRMTokens | \\ \\ || Subsystem || Report/Notes || | Docker | Client=1.7.1 Server=1.7.1 Image:test-patch-base-hadoop-date2015-11-07 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12771140/YARN-4184.v1.patch | | JIRA Issue | YARN-4184 | | Optional Tests | asflicense javac javadoc mvninstall unit findbugs checkstyle compile | | uname | Linux 75b2139115a8 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool |
[jira] [Updated] (YARN-4320) TestJobHistoryEventHandler fails as AHS in MiniYarnCluster no longer binds to default port 8188
[ https://issues.apache.org/jira/browse/YARN-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4320: -- Fix Version/s: 2.6.3 > TestJobHistoryEventHandler fails as AHS in MiniYarnCluster no longer binds to > default port 8188 > --- > > Key: YARN-4320 > URL: https://issues.apache.org/jira/browse/YARN-4320 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Varun Saxena >Assignee: Varun Saxena > Fix For: 3.0.0, 2.8.0, 2.7.2, 2.6.3 > > Attachments: YARN-4320.01.patch > > > {noformat} > Running org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 40.256 sec > <<< FAILURE! - in > org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler > testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler) > Time elapsed: 35.764 sec <<< ERROR! > java.lang.RuntimeException: Failed to connect to timeline server. Connection > retries limit exceeded. The posted timeline event may be missing > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245) > at com.sun.jersey.api.client.Client.handle(Client.java:648) > at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) > at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) > at > com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:474) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:320) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:320) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:305) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForTimelineServer(JobHistoryEventHandler.java:1015) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:586) > at > org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.handleEvent(TestJobHistoryEventHandler.java:719) > at > org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:507) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4320) TestJobHistoryEventHandler fails as AHS in MiniYarnCluster no longer binds to default port 8188
[ https://issues.apache.org/jira/browse/YARN-4320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993332#comment-14993332 ] Sangjin Lee commented on YARN-4320: --- Committed it to branch-2.6 too. > TestJobHistoryEventHandler fails as AHS in MiniYarnCluster no longer binds to > default port 8188 > --- > > Key: YARN-4320 > URL: https://issues.apache.org/jira/browse/YARN-4320 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.1 >Reporter: Varun Saxena >Assignee: Varun Saxena > Fix For: 3.0.0, 2.8.0, 2.7.2, 2.6.3 > > Attachments: YARN-4320.01.patch > > > {noformat} > Running org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler > Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 40.256 sec > <<< FAILURE! - in > org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler > testTimelineEventHandling(org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler) > Time elapsed: 35.764 sec <<< ERROR! > java.lang.RuntimeException: Failed to connect to timeline server. Connection > retries limit exceeded. The posted timeline event may be missing > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245) > at com.sun.jersey.api.client.Client.handle(Client.java:648) > at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) > at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) > at > com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:474) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:320) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:320) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:305) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.processEventForTimelineServer(JobHistoryEventHandler.java:1015) > at > org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler.handleEvent(JobHistoryEventHandler.java:586) > at > org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.handleEvent(TestJobHistoryEventHandler.java:719) > at > org.apache.hadoop.mapreduce.jobhistory.TestJobHistoryEventHandler.testTimelineEventHandling(TestJobHistoryEventHandler.java:507) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3840) Resource Manager web ui issue when sorting application by id (with application having id > 9999)
[ https://issues.apache.org/jira/browse/YARN-3840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993412#comment-14993412 ] Mohammad Shahid Khan commented on YARN-3840: UT failiure -- not related to current patch findbugs -- not related to current patch > Resource Manager web ui issue when sorting application by id (with > application having id > ) > > > Key: YARN-3840 > URL: https://issues.apache.org/jira/browse/YARN-3840 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.0 >Reporter: LINTE >Assignee: Mohammad Shahid Khan > Attachments: RMApps.png, YARN-3840-1.patch, YARN-3840-2.patch, > YARN-3840-3.patch, YARN-3840-4.patch, YARN-3840-5.patch, YARN-3840-6.patch, > yarn-3840-7.patch > > > On the WEBUI, the global main view page : > http://resourcemanager:8088/cluster/apps doesn't display applications over > . > With command line it works (# yarn application -list). > Regards, > Alexandre -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994018#comment-14994018 ] Greg Senia commented on YARN-4336: -- [~jlowe] just confirmed that HADOOP-12413 does seem to fix it.. Just ran a test with the else if block... I think as a safety measure for the time being I may still keep my tactical fix but I'll move the regex to compile once as if the LDAP storm from our jobs show up again I'm going to have a bad day :) Confirmed it never was sent out to NSS/LDAP appattempt_1446307555640_0052_01 > YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP > > > Key: YARN-4336 > URL: https://issues.apache.org/jira/browse/YARN-4336 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0, 2.4.1, 2.6.0, 2.7.0, 2.6.1, 2.7.1 > Environment: NSS w/ SSSD or Dell/Quest - VASD >Reporter: Greg Senia >Assignee: Greg Senia > Attachments: YARN-4336-tactical.txt > > > Hi folks after performing some debug for our Unix Engineering and Active > Directory teams it was discovered that on YARN Container Initialization a > call via Hadoop Common AccessControlList.java: > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > Unfortunately with the security call to check access on > "appattempt_X_X_X" will always return false but will make > unnecessary calls to NameSwitch service on linux which will call things like > SSSD/Quest VASD which will then initiate LDAP calls looking for non existent > userid's causing excessive load on LDAP. > For now our tactical work around is as follows: > /** >* Checks if a user represented by the provided {@link UserGroupInformation} >* is a member of the Access Control List >* @param ugi UserGroupInformation to check if contained in the ACL >* @return true if ugi is member of the list >*/ > public final boolean isUserInList(UserGroupInformation ugi) { > if (allAllowed || users.contains(ugi.getShortUserName())) { > return true; > } else { > String patternString = "^appattempt_\\d+_\\d+_\\d+$"; > Pattern pattern = Pattern.compile(patternString); > Matcher matcher = pattern.matcher(ugi.getShortUserName()); > boolean matches = matcher.matches(); > if (matches) { > LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR > GROUPS!!");; > return false; > } > > > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > } > return false; > } > public boolean isUserAllowed(UserGroupInformation ugi) { > return isUserInList(ugi); > } > Example of VASD Debug log showing the lookups for one task attempt 32 of them: > One task: > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter >
[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994272#comment-14994272 ] Varun Saxena commented on YARN-4330: Patch does the following : 1. If node resource monitoring interval or container monitoring interval is <= 0, considering this is as disabling monitoring. Interval <=0 doesnt make much sense anyways. Resource calculator plugin(even the default one) wont be required if interval is <=0. Have made changes in relevant classes to take care of this change. Also, I have set this config to 0 in MiniYARNCluster. Dummy plugin wont be required in this case. 2. In NodeManagerHardwareUtils, we take the memory and CPU from config if hardware detection is disabled irrespective of whether resource calculator plugin can be created or not . Moved around the code in the class to check for the config for disable first and returning value from config if its so. In MiniYARNCluster have explicitly set it to false. I dont think hardware detection is required for tests. 3. Catching UnsupportedOperationException and logging it at info. No stack trace is printed. For other exceptions, stack trace will be printed(keeping it consistent with previous behavior). Maybe stack trace in case of other unexpected exceptions may be useful. > MiniYARNCluster prints multiple Failed to instantiate default resource > calculator warning messages > --- > > Key: YARN-4330 > URL: https://issues.apache.org/jira/browse/YARN-4330 > Project: Hadoop YARN > Issue Type: Bug > Components: test, yarn >Affects Versions: 2.8.0 > Environment: OSX, JUnit >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-4330.01.patch > > > Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I > see multiple stack traces warning me that a resource calculator plugin could > not be created > {code} > (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - > java.lang.UnsupportedOperationException: Could not determine OS: Failed to > instantiate default resource calculator. > java.lang.UnsupportedOperationException: Could not determine OS > {code} > This is a minicluster. It doesn't need resource calculation. It certainly > doesn't need test logs being cluttered with even more stack traces which will > only generate false alarms about tests failing. > There needs to be a way to turn this off, and the minicluster should have it > that way by default. > Being ruthless and marking as a blocker, because its a fairly major > regression for anyone testing with the minicluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4337) Resolve all docs errors in *.apt.vm for YARN
[ https://issues.apache.org/jira/browse/YARN-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994275#comment-14994275 ] Chen He commented on YARN-4337: --- Thank you for the reply, [~adaniels], I changed the affect version to 2.6.x. > Resolve all docs errors in *.apt.vm for YARN > > > Key: YARN-4337 > URL: https://issues.apache.org/jira/browse/YARN-4337 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.6.1 >Reporter: Chen He >Priority: Minor > Labels: documentation, newbie > > This is a newbie++ docs ticket. > Simple example, In WebServiceInfo.apt.vm > *** JSON response with single resource > HTTP Request: > GET > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 > Response Status Line: > HTTP/1.1 200 OK > Response Header: > +---+ > The URI > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 is > invalid. It should be "apps" instead of "app" in the URI. It may mislead > first time users to think that YARN REST API does not work. Similarly, we > should remove all similar typos or minor errors in *.apt.vm file for YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3842) NMProxy should retry on NMNotYetReadyException
[ https://issues.apache.org/jira/browse/YARN-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-3842: -- Target Version/s: 2.7.1, 2.6.3 (was: 2.7.1) Running into this in a couple of places, we should get this into 2.6.3. > NMProxy should retry on NMNotYetReadyException > -- > > Key: YARN-3842 > URL: https://issues.apache.org/jira/browse/YARN-3842 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Robert Kanter >Priority: Critical > Fix For: 2.7.1 > > Attachments: MAPREDUCE-6409.001.patch, MAPREDUCE-6409.002.patch, > YARN-3842.001.patch, YARN-3842.002.patch > > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4337) Resolve all docs errors in *.apt.vm for YARN
[ https://issues.apache.org/jira/browse/YARN-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994283#comment-14994283 ] Daniel Templeton commented on YARN-4337: Yep, the error is there in 2.6.1, but it's fixed in 2.7.0 by YARN-3436. Any reason not to close this as a dup? > Resolve all docs errors in *.apt.vm for YARN > > > Key: YARN-4337 > URL: https://issues.apache.org/jira/browse/YARN-4337 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.6.1 >Reporter: Chen He >Priority: Minor > Labels: documentation, newbie > > This is a newbie++ docs ticket. > Simple example, In WebServiceInfo.apt.vm > *** JSON response with single resource > HTTP Request: > GET > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 > Response Status Line: > HTTP/1.1 200 OK > Response Header: > +---+ > The URI > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 is > invalid. It should be "apps" instead of "app" in the URI. It may mislead > first time users to think that YARN REST API does not work. Similarly, we > should remove all similar typos or minor errors in *.apt.vm file for YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-4337) Resolve all docs errors in *.apt.vm for YARN
[ https://issues.apache.org/jira/browse/YARN-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He resolved YARN-4337. --- Resolution: Duplicate > Resolve all docs errors in *.apt.vm for YARN > > > Key: YARN-4337 > URL: https://issues.apache.org/jira/browse/YARN-4337 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.6.1 >Reporter: Chen He >Priority: Minor > Labels: documentation, newbie > > This is a newbie++ docs ticket. > Simple example, In WebServiceInfo.apt.vm > *** JSON response with single resource > HTTP Request: > GET > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 > Response Status Line: > HTTP/1.1 200 OK > Response Header: > +---+ > The URI > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 is > invalid. It should be "apps" instead of "app" in the URI. It may mislead > first time users to think that YARN REST API does not work. Similarly, we > should remove all similar typos or minor errors in *.apt.vm file for YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4337) Resolve all docs errors in *.apt.vm for YARN
[ https://issues.apache.org/jira/browse/YARN-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-4337: -- Affects Version/s: (was: 2.7.1) 2.6.1 > Resolve all docs errors in *.apt.vm for YARN > > > Key: YARN-4337 > URL: https://issues.apache.org/jira/browse/YARN-4337 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.6.1 >Reporter: Chen He >Priority: Minor > Labels: documentation, newbie > > This is a newbie++ docs ticket. > Simple example, In WebServiceInfo.apt.vm > *** JSON response with single resource > HTTP Request: > GET > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 > Response Status Line: > HTTP/1.1 200 OK > Response Header: > +---+ > The URI > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 is > invalid. It should be "apps" instead of "app" in the URI. It may mislead > first time users to think that YARN REST API does not work. Similarly, we > should remove all similar typos or minor errors in *.apt.vm file for YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4337) Resolve all docs errors in *.apt.vm for YARN
[ https://issues.apache.org/jira/browse/YARN-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994279#comment-14994279 ] Chen He commented on YARN-4337: --- I think after branch-2.7, we are not suing apt.vm file anymore. Those *.apt.vm are in 2.6.x > Resolve all docs errors in *.apt.vm for YARN > > > Key: YARN-4337 > URL: https://issues.apache.org/jira/browse/YARN-4337 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.6.1 >Reporter: Chen He >Priority: Minor > Labels: documentation, newbie > > This is a newbie++ docs ticket. > Simple example, In WebServiceInfo.apt.vm > *** JSON response with single resource > HTTP Request: > GET > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 > Response Status Line: > HTTP/1.1 200 OK > Response Header: > +---+ > The URI > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 is > invalid. It should be "apps" instead of "app" in the URI. It may mislead > first time users to think that YARN REST API does not work. Similarly, we > should remove all similar typos or minor errors in *.apt.vm file for YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4241) Typo in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994278#comment-14994278 ] Anthony Rojas commented on YARN-4241: - Linking to YARN-3943 to add the "yarn.nodemanager.disk-health-checker.disk-utilization-watermark-low-per-disk-percentage" property in yarn-default.xml > Typo in yarn-default.xml > > > Key: YARN-4241 > URL: https://issues.apache.org/jira/browse/YARN-4241 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, yarn >Reporter: Anthony Rojas >Assignee: Anthony Rojas >Priority: Trivial > Labels: newbie > Attachments: YARN-4241.patch, YARN-4241.patch.1 > > > Typo in description section of yarn-default.xml, under the properties: > yarn.nodemanager.disk-health-checker.min-healthy-disks > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage > yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb > The reference to yarn-nodemanager.local-dirs should be > yarn.nodemanager.local-dirs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4241) Typo in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994297#comment-14994297 ] Anthony Rojas commented on YARN-4241: - Correction, please disregard this comment. > Typo in yarn-default.xml > > > Key: YARN-4241 > URL: https://issues.apache.org/jira/browse/YARN-4241 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, yarn >Reporter: Anthony Rojas >Assignee: Anthony Rojas >Priority: Trivial > Labels: newbie > Attachments: YARN-4241.patch, YARN-4241.patch.1 > > > Typo in description section of yarn-default.xml, under the properties: > yarn.nodemanager.disk-health-checker.min-healthy-disks > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage > yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb > The reference to yarn-nodemanager.local-dirs should be > yarn.nodemanager.local-dirs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
Greg Senia created YARN-4336: Summary: YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP Key: YARN-4336 URL: https://issues.apache.org/jira/browse/YARN-4336 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.7.1, 2.6.1, 2.7.0, 2.6.0, 2.4.1, 2.4.0 Environment: NSS w/ SSSD or Dell/Quest - VASD Reporter: Greg Senia Hi folks after performing some debug for our Unix Engineering and Active Directory teams it was discovered that on YARN Container Initialization a call via Hadoop Common AccessControlList.java: for(String group: ugi.getGroupNames()) { if (groups.contains(group)) { return true; } } Unfortunately with the security call to check access on "appattempt_X_X_X" will always return false but will make unnecessary calls to NameSwitch service on linux which will call things like SSSD/Quest VASD which will then initiate LDAP calls looking for non existent userid's causing excessive load on LDAP. For now our tactical work around is as follows: /** * Checks if a user represented by the provided {@link UserGroupInformation} * is a member of the Access Control List * @param ugi UserGroupInformation to check if contained in the ACL * @return true if ugi is member of the list */ public final boolean isUserInList(UserGroupInformation ugi) { if (allAllowed || users.contains(ugi.getShortUserName())) { return true; } else { String patternString = "^appattempt_\\d+_\\d+_\\d+$"; Pattern pattern = Pattern.compile(patternString); Matcher matcher = pattern.matcher(ugi.getShortUserName()); boolean matches = matcher.matches(); if (matches) { LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR GROUPS!!");; return false; } for(String group: ugi.getGroupNames()) { if (groups.contains(group)) { return true; } } } return false; } public boolean isUserAllowed(UserGroupInformation ugi) { return isUserInList(ugi); } Example of VASD Debug log showing the lookups for one task attempt 32 of them: One task: Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching with filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, base=<>, scope= Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching with filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, base=<>, scope= Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching with filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, base=<>, scope= Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching with filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, base=<>, scope= Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) Oct 30 22:56:45 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching with filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, base=<>, scope= Oct 30 22:56:45 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching with filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, base=<>, scope= Oct 30 22:57:18 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter
[jira] [Updated] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Senia updated YARN-4336: - Attachment: YARN-4336-tactical.txt tactical fix > YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP > > > Key: YARN-4336 > URL: https://issues.apache.org/jira/browse/YARN-4336 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0, 2.4.1, 2.6.0, 2.7.0, 2.6.1, 2.7.1 > Environment: NSS w/ SSSD or Dell/Quest - VASD >Reporter: Greg Senia >Assignee: Greg Senia > Attachments: YARN-4336-tactical.txt > > > Hi folks after performing some debug for our Unix Engineering and Active > Directory teams it was discovered that on YARN Container Initialization a > call via Hadoop Common AccessControlList.java: > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > Unfortunately with the security call to check access on > "appattempt_X_X_X" will always return false but will make > unnecessary calls to NameSwitch service on linux which will call things like > SSSD/Quest VASD which will then initiate LDAP calls looking for non existent > userid's causing excessive load on LDAP. > For now our tactical work around is as follows: > /** >* Checks if a user represented by the provided {@link UserGroupInformation} >* is a member of the Access Control List >* @param ugi UserGroupInformation to check if contained in the ACL >* @return true if ugi is member of the list >*/ > public final boolean isUserInList(UserGroupInformation ugi) { > if (allAllowed || users.contains(ugi.getShortUserName())) { > return true; > } else { > String patternString = "^appattempt_\\d+_\\d+_\\d+$"; > Pattern pattern = Pattern.compile(patternString); > Matcher matcher = pattern.matcher(ugi.getShortUserName()); > boolean matches = matcher.matches(); > if (matches) { > LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR > GROUPS!!");; > return false; > } > > > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > } > return false; > } > public boolean isUserAllowed(UserGroupInformation ugi) { > return isUserInList(ugi); > } > Example of VASD Debug log showing the lookups for one task attempt 32 of them: > One task: > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with >
[jira] [Commented] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993836#comment-14993836 ] Jason Lowe commented on YARN-4336: -- I believe this is a duplicate of YARN-3452. We fixed it by reverting HADOOP-10650 in our internal build since we don't need the blacklisting functionality added by that feature, and that's what caused the excess lookups. IMHO the real fix is to have YARN not use bogus user names, but I don't know if that's going to be an easy change to make. > YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP > > > Key: YARN-4336 > URL: https://issues.apache.org/jira/browse/YARN-4336 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0, 2.4.1, 2.6.0, 2.7.0, 2.6.1, 2.7.1 > Environment: NSS w/ SSSD or Dell/Quest - VASD >Reporter: Greg Senia >Assignee: Greg Senia > Attachments: YARN-4336-tactical.txt > > > Hi folks after performing some debug for our Unix Engineering and Active > Directory teams it was discovered that on YARN Container Initialization a > call via Hadoop Common AccessControlList.java: > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > Unfortunately with the security call to check access on > "appattempt_X_X_X" will always return false but will make > unnecessary calls to NameSwitch service on linux which will call things like > SSSD/Quest VASD which will then initiate LDAP calls looking for non existent > userid's causing excessive load on LDAP. > For now our tactical work around is as follows: > /** >* Checks if a user represented by the provided {@link UserGroupInformation} >* is a member of the Access Control List >* @param ugi UserGroupInformation to check if contained in the ACL >* @return true if ugi is member of the list >*/ > public final boolean isUserInList(UserGroupInformation ugi) { > if (allAllowed || users.contains(ugi.getShortUserName())) { > return true; > } else { > String patternString = "^appattempt_\\d+_\\d+_\\d+$"; > Pattern pattern = Pattern.compile(patternString); > Matcher matcher = pattern.matcher(ugi.getShortUserName()); > boolean matches = matcher.matches(); > if (matches) { > LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR > GROUPS!!");; > return false; > } > > > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > } > return false; > } > public boolean isUserAllowed(UserGroupInformation ugi) { > return isUserInList(ugi); > } > Example of VASD Debug log showing the lookups for one task attempt 32 of them: > One task: > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30
[jira] [Commented] (YARN-4219) New levelDB cache storage for timeline v1.5
[ https://issues.apache.org/jira/browse/YARN-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993878#comment-14993878 ] Jason Lowe commented on YARN-4219: -- Shouldn't we commit some form of YARN-3942 before committing this? I got the impression this was an underlying storage to be used by the EntityFileTimelineStore, or is there a use-case for this outside of the v1.5 core code in YARN-3942? Patch looks pretty good except for one thing I noticed: The pom file should not have hardcoded versions in it. It should omit the versions and leave it up to hadoop-project/pom.xml to define that. Otherwise we risk having different portions of code needing different versions of the same dependency. > New levelDB cache storage for timeline v1.5 > --- > > Key: YARN-4219 > URL: https://issues.apache.org/jira/browse/YARN-4219 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-4219-trunk.001.patch, YARN-4219-trunk.002.patch, > YARN-4219-trunk.003.patch > > > We need to have an "offline" caching storage for timeline server v1.5 after > the changes in YARN-3942. The in memory timeline storage may run into OOM > issues when used as a cache storage for entity file timeline storage. We can > refactor the code and have a level db based caching storage for this use > case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993899#comment-14993899 ] Greg Senia commented on YARN-4336: -- [~jlowe] I dumped stack traces below and it seems to match what was done in Hadoop-10650.. Do you see an issue with my workaround for now in my own env until HWX can provide a final solution? Seems like this could also be related... https://issues.apache.org/jira/browse/HADOOP-12413 Stack Trace: 2015-11-06 11:25:52,313 DEBUG ipc.Server (Server.java:processOneRpc(1762)) - got #-33 2015-11-06 11:25:52,313 DEBUG security.SaslRpcServer (SaslRpcServer.java:create(174)) - Created SASL server with mechanism = DIGEST- MD5 2015-11-06 11:25:52,314 DEBUG ipc.Server (Server.java:doSaslReply(1424)) - Sending sasl message state: NEGOTIATE auths { method: "TOKEN" mechanism: "DIGEST-MD5" protocol: "" serverId: "default" challenge: "realm=\"default\",nonce=\"389ZufpXfkC6CKunYceHayMBI3KM7v3keu9nPC/b\",qop=\"auth\",charset=utf-8,algorithm=md5-sess" } auths { method: "KERBEROS" mechanism: "GSSAPI" protocol: "nm" serverId: "xhadoopm5d.example.com" } 2015-11-06 11:25:52,314 DEBUG ipc.Server (Server.java:processResponse(972)) - Socket Reader #1 for port 8040: responding to null fro m 157.121.72.167:64599 Call#-33 Retry#-1 2015-11-06 11:25:52,314 DEBUG ipc.Server (Server.java:processResponse(991)) - Socket Reader #1 for port 8040: responding to null fro m 157.121.72.167:64599 Call#-33 Retry#-1 Wrote 212 bytes. 2015-11-06 11:25:52,343 DEBUG ipc.Server (Server.java:processOneRpc(1762)) - got #-33 2015-11-06 11:25:52,343 DEBUG ipc.Server (Server.java:processSaslToken(1393)) - Have read input token of size 246 for processing by saslServer.evaluateResponse() 2015-11-06 11:25:52,344 DEBUG security.SaslRpcServer (SaslRpcServer.java:handle(308)) - SASL server DIGEST-MD5 callback: setting pas sword for client: testing (auth:SIMPLE) 2015-11-06 11:25:52,344 DEBUG security.SaslRpcServer (SaslRpcServer.java:handle(325)) - SASL server DIGEST-MD5 callback: setting can onicalized client ID: testing 2015-11-06 11:25:52,345 DEBUG ipc.Server (Server.java:buildSaslResponse(1410)) - Will send SUCCESS token of size 40 from saslServer. 2015-11-06 11:25:52,345 DEBUG ipc.Server (Server.java:saslProcess(1298)) - SASL server context established. Negotiated QoP is auth 2015-11-06 11:25:52,345 DEBUG ipc.Server (Server.java:saslProcess(1303)) - SASL server successfully authenticated client: testing (a uth:SIMPLE) 2015-11-06 11:25:52,345 INFO ipc.Server (Server.java:saslProcess(1306)) - Auth successful for testing (auth:SIMPLE) 2015-11-06 11:25:52,345 DEBUG ipc.Server (Server.java:doSaslReply(1424)) - Sending sasl message state: SUCCESS token: "rspauth=9bfdf3e61c489664e885d7043b352c24" 2015-11-06 11:25:52,345 DEBUG ipc.Server (Server.java:processResponse(972)) - Socket Reader #1 for port 8040: responding to null fro m 157.121.72.167:64599 Call#-33 Retry#-1 2015-11-06 11:25:52,346 DEBUG ipc.Server (Server.java:processResponse(991)) - Socket Reader #1 for port 8040: responding to null fro m 157.121.72.167:64599 Call#-33 Retry#-1 Wrote 64 bytes. 2015-11-06 11:25:52,357 DEBUG ipc.Server (Server.java:processOneRpc(1762)) - got #-3 2015-11-06 11:25:52,357 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - java.lang.Thread.getSt ackTrace(Thread.java:1589) 2015-11-06 11:25:52,357 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - org.apache.hadoop.security.UserGroupInformation.getGroupNames(UserGroupInformation.java:1487) 2015-11-06 11:25:52,357 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - org.apache.hadoop.security.authorize.AccessControlList.isUserInList(AccessControlList.java:252) 2015-11-06 11:25:52,357 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - org.apache.hadoop.security.authorize.AccessControlList.isUserAllowed(AccessControlList.java:262) 2015-11-06 11:25:52,357 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - org.apache.hadoop.security.authorize.ServiceAuthorizationManager.authorize(ServiceAuthorizationManager.java:110) 2015-11-06 11:25:52,357 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - org.apache.hadoop.ipc.Server.authorize(Server.java:2507) 2015-11-06 11:25:52,358 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - org.apache.hadoop.ipc.Server.access$3300(Server.java:135) 2015-11-06 11:25:52,358 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) - org.apache.hadoop.ipc.Server$Connection.authorizeConnection(Server.java:1923) 2015-11-06 11:25:52,358 DEBUG security.UserGroupInformation (UserGroupInformation.java:getGroupNames(1488)) -
[jira] [Commented] (YARN-3452) Bogus token usernames cause many invalid group lookups
[ https://issues.apache.org/jira/browse/YARN-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993927#comment-14993927 ] Allen Wittenauer commented on YARN-3452: I wonder what happens when they *do* resolve... > Bogus token usernames cause many invalid group lookups > -- > > Key: YARN-3452 > URL: https://issues.apache.org/jira/browse/YARN-3452 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Reporter: Jason Lowe > > YARN uses a number of bogus usernames for tokens, like application attempt > IDs for NM tokens or even the hardcoded "testing" for the container localizer > token. These tokens cause the RPC layer to do group lookups on these bogus > usernames which will never succeed but can take a long time to perform. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993932#comment-14993932 ] Jason Lowe commented on YARN-4336: -- bq. Seems like this could also be related... https://issues.apache.org/jira/browse/HADOOP-12413 Nice find! I totally missed that when it went by. I'll pull that fix into the 2.6 and 2.7 lines. I think that could eliminate the bogus lookups in practice when the reverse ACL isn't being used. bq. Do you see an issue with my workaround for now in my own env until HWX can provide a final solution? It will work. Nit: it's pricey to compile the pattern every time, could just compile it once. Or as I mentioned above, I think pulling in HADOOP-12413 to your build could also eliminate the bogus lookups (assuming you don't use the reverse ACL feature). > YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP > > > Key: YARN-4336 > URL: https://issues.apache.org/jira/browse/YARN-4336 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0, 2.4.1, 2.6.0, 2.7.0, 2.6.1, 2.7.1 > Environment: NSS w/ SSSD or Dell/Quest - VASD >Reporter: Greg Senia >Assignee: Greg Senia > Attachments: YARN-4336-tactical.txt > > > Hi folks after performing some debug for our Unix Engineering and Active > Directory teams it was discovered that on YARN Container Initialization a > call via Hadoop Common AccessControlList.java: > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > Unfortunately with the security call to check access on > "appattempt_X_X_X" will always return false but will make > unnecessary calls to NameSwitch service on linux which will call things like > SSSD/Quest VASD which will then initiate LDAP calls looking for non existent > userid's causing excessive load on LDAP. > For now our tactical work around is as follows: > /** >* Checks if a user represented by the provided {@link UserGroupInformation} >* is a member of the Access Control List >* @param ugi UserGroupInformation to check if contained in the ACL >* @return true if ugi is member of the list >*/ > public final boolean isUserInList(UserGroupInformation ugi) { > if (allAllowed || users.contains(ugi.getShortUserName())) { > return true; > } else { > String patternString = "^appattempt_\\d+_\\d+_\\d+$"; > Pattern pattern = Pattern.compile(patternString); > Matcher matcher = pattern.matcher(ugi.getShortUserName()); > boolean matches = matcher.matches(); > if (matches) { > LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR > GROUPS!!");; > return false; > } > > > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > } > return false; > } > public boolean isUserAllowed(UserGroupInformation ugi) { > return isUserInList(ugi); > } > Example of VASD Debug log showing the lookups for one task attempt 32 of them: > One task: > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with >
[jira] [Commented] (YARN-3452) Bogus token usernames cause many invalid group lookups
[ https://issues.apache.org/jira/browse/YARN-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993941#comment-14993941 ] Jason Lowe commented on YARN-3452: -- Yeah, probably nothing good. [~gss2002] pointed out HADOOP-12413 which I think will also remove the bogus lookups in practice when users aren't using the reverse ACL feature that was added in HADOOP-10650. I'll pull that into 2.6 and 2.7, since I think most users won't be using that new feature. We'll still need to stop using the bogus usernames for those that are using that reverse-ACL feature or if someone else tries to do something with the ugi assuming it actually was a valid user. > Bogus token usernames cause many invalid group lookups > -- > > Key: YARN-3452 > URL: https://issues.apache.org/jira/browse/YARN-3452 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Reporter: Jason Lowe > > YARN uses a number of bogus usernames for tokens, like application attempt > IDs for NM tokens or even the hardcoded "testing" for the container localizer > token. These tokens cause the RPC layer to do group lookups on these bogus > usernames which will never succeed but can take a long time to perform. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4326) Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to default port 8188
[ https://issues.apache.org/jira/browse/YARN-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sangjin Lee updated YARN-4326: -- Fix Version/s: 2.7.3 2.6.3 Pulled the fix to branch-2.7 and branch-2.6. > Fix TestDistributedShell timeout as AHS in MiniYarnCluster no longer binds to > default port 8188 > --- > > Key: YARN-4326 > URL: https://issues.apache.org/jira/browse/YARN-4326 > Project: Hadoop YARN > Issue Type: Bug >Reporter: MENG DING >Assignee: MENG DING > Fix For: 2.8.0, 2.6.3, 2.7.3 > > Attachments: YARN-4326.patch > > > The timeout originates in ApplicationMaster, where it fails to connect to > timeline server, and retry exceeds limits: > {code} > 2015-11-02 21:57:38,066 INFO [main] impl.TimelineClientImpl > (TimelineClientImpl.java:serviceInit(299)) - Timeline service address: > http://mdinglin02:0/ws/v1/timeline/ > 2015-11-02 21:57:38,099 INFO [main] impl.TimelineClientImpl > (TimelineClientImpl.java:logException(213)) - Exception caught by > TimelineClientConnectionRetry, will try 30 more time(s). > ... > ... > java.lang.RuntimeException: Failed to connect to timeline server. Connection > retries limit exceeded. The posted timeline event may be missing > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:206) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:245) > at com.sun.jersey.api.client.Client.handle(Client.java:648) > at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) > at > com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) > at > com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:477) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:326) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$1.run(TimelineClientImpl.java:323) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:323) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:308) > at > org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.publishApplicationAttemptEvent(ApplicationMaster.java:1184) > at > org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.run(ApplicationMaster.java:571) > at > org.apache.hadoop.yarn.applications.distributedshell.ApplicationMaster.main(ApplicationMaster.java:302) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14993711#comment-14993711 ] Steve Loughran commented on YARN-4330: -- +1 for downgrading the stack trace to DEBUG level; anything at INFO/WARN should include the calculator plugin conf value in case that is the problem. and another +1 for having a way to turn this off for minicluster tests. Having a dummy plugin would be more generally useful, and avoid having yet another config option > MiniYARNCluster prints multiple Failed to instantiate default resource > calculator warning messages > --- > > Key: YARN-4330 > URL: https://issues.apache.org/jira/browse/YARN-4330 > Project: Hadoop YARN > Issue Type: Bug > Components: test, yarn >Affects Versions: 2.8.0 > Environment: OSX, JUnit >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Blocker > > Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I > see multiple stack traces warning me that a resource calculator plugin could > not be created > {code} > (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - > java.lang.UnsupportedOperationException: Could not determine OS: Failed to > instantiate default resource calculator. > java.lang.UnsupportedOperationException: Could not determine OS > {code} > This is a minicluster. It doesn't need resource calculation. It certainly > doesn't need test logs being cluttered with even more stack traces which will > only generate false alarms about tests failing. > There needs to be a way to turn this off, and the minicluster should have it > that way by default. > Being ruthless and marking as a blocker, because its a fairly major > regression for anyone testing with the minicluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3452) Bogus token usernames cause many invalid group lookups
[ https://issues.apache.org/jira/browse/YARN-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994017#comment-14994017 ] Greg Senia commented on YARN-3452: -- [~jlowe] just confirmed that HADOOP-12413 does seem to fix it.. Just ran a test with the else if block... I think as a safety measure for the time being I may still keep my tactical fix but I'll move the regex to compile once as if the LDAP storm from our jobs show up again I'm going to have a bad day :) Confirmed it never was sent out to NSS/LDAP appattempt_1446307555640_0052_01 > Bogus token usernames cause many invalid group lookups > -- > > Key: YARN-3452 > URL: https://issues.apache.org/jira/browse/YARN-3452 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Reporter: Jason Lowe > > YARN uses a number of bogus usernames for tokens, like application attempt > IDs for NM tokens or even the hardcoded "testing" for the container localizer > token. These tokens cause the RPC layer to do group lookups on these bogus > usernames which will never succeed but can take a long time to perform. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4219) New levelDB cache storage for timeline v1.5
[ https://issues.apache.org/jira/browse/YARN-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994141#comment-14994141 ] Li Lu commented on YARN-4219: - Yes, this is a caching storage for ATS v1.5 design. We can prioritize any "up-level" storage that uses this caching storage. I'll fix the maven problem soon. > New levelDB cache storage for timeline v1.5 > --- > > Key: YARN-4219 > URL: https://issues.apache.org/jira/browse/YARN-4219 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 2.8.0 >Reporter: Li Lu >Assignee: Li Lu > Attachments: YARN-4219-trunk.001.patch, YARN-4219-trunk.002.patch, > YARN-4219-trunk.003.patch > > > We need to have an "offline" caching storage for timeline server v1.5 after > the changes in YARN-3942. The in memory timeline storage may run into OOM > issues when used as a cache storage for entity file timeline storage. We can > refactor the code and have a level db based caching storage for this use > case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-4330: --- Attachment: YARN-4330.01.patch > MiniYARNCluster prints multiple Failed to instantiate default resource > calculator warning messages > --- > > Key: YARN-4330 > URL: https://issues.apache.org/jira/browse/YARN-4330 > Project: Hadoop YARN > Issue Type: Bug > Components: test, yarn >Affects Versions: 2.8.0 > Environment: OSX, JUnit >Reporter: Steve Loughran >Assignee: Varun Saxena >Priority: Blocker > Attachments: YARN-4330.01.patch > > > Whenever I try to start a MiniYARNCluster on Branch-2 (commit #0b61cca), I > see multiple stack traces warning me that a resource calculator plugin could > not be created > {code} > (ResourceCalculatorPlugin.java:getResourceCalculatorPlugin(184)) - > java.lang.UnsupportedOperationException: Could not determine OS: Failed to > instantiate default resource calculator. > java.lang.UnsupportedOperationException: Could not determine OS > {code} > This is a minicluster. It doesn't need resource calculation. It certainly > doesn't need test logs being cluttered with even more stack traces which will > only generate false alarms about tests failing. > There needs to be a way to turn this off, and the minicluster should have it > that way by default. > Being ruthless and marking as a blocker, because its a fairly major > regression for anyone testing with the minicluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-4337) Resolve all docs errors in *.apt.vm for YARN
[ https://issues.apache.org/jira/browse/YARN-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994137#comment-14994137 ] Daniel Templeton commented on YARN-4337: I'm looking at the WebServicesIntro.md file, and I don't see that error: {code} JSON response with single resource HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/apps/application\_1324057493980\_0001 Response Status Line: HTTP/1.1 200 OK Response Header: {code} Are you looking at the current version of the docs? > Resolve all docs errors in *.apt.vm for YARN > > > Key: YARN-4337 > URL: https://issues.apache.org/jira/browse/YARN-4337 > Project: Hadoop YARN > Issue Type: Bug > Components: webapp >Affects Versions: 2.7.1 >Reporter: Chen He >Priority: Minor > Labels: documentation, newbie > > This is a newbie++ docs ticket. > Simple example, In WebServiceInfo.apt.vm > *** JSON response with single resource > HTTP Request: > GET > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 > Response Status Line: > HTTP/1.1 200 OK > Response Header: > +---+ > The URI > http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 is > invalid. It should be "apps" instead of "app" in the URI. It may mislead > first time users to think that YARN REST API does not work. Similarly, we > should remove all similar typos or minor errors in *.apt.vm file for YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994140#comment-14994140 ] Xuan Gong commented on YARN-2556: - Thanks for the work. [~sjlee0] and [~lichangleo] Could you give us some instruction on how to run this performance tool ? Maybe add a document on related ats md and at least give us an example command to run this tool. > Tool to measure the performance of the timeline server > -- > > Key: YARN-2556 > URL: https://issues.apache.org/jira/browse/YARN-2556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Chang Li > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, > YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, > YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, > YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, > YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, > YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, > YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch > > > We need to be able to understand the capacity model for the timeline server > to give users the tools they need to deploy a timeline server with the > correct capacity. > I propose we create a mapreduce job that can measure timeline server write > and read performance. Transactions per second, I/O for both read and write > would be a good start. > This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-4337) Resolve all docs errors in *.apt.vm for YARN
Chen He created YARN-4337: - Summary: Resolve all docs errors in *.apt.vm for YARN Key: YARN-4337 URL: https://issues.apache.org/jira/browse/YARN-4337 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.7.1 Reporter: Chen He Priority: Minor This is a newbie++ docs ticket. Simple example, In WebServiceInfo.apt.vm *** JSON response with single resource HTTP Request: GET http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 Response Status Line: HTTP/1.1 200 OK Response Header: +---+ The URI http://rmhost.domain:8088/ws/v1/cluster/app/application_1324057493980_0001 is invalid. It should be "apps" instead of "app" in the URI. It may mislead first time users to think that YARN REST API does not work. Similarly, we should remove all similar typos or minor errors in *.apt.vm file for YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Senia updated YARN-4336: - Attachment: tactical_defense.patch > YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP > > > Key: YARN-4336 > URL: https://issues.apache.org/jira/browse/YARN-4336 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0, 2.4.1, 2.6.0, 2.7.0, 2.6.1, 2.7.1 > Environment: NSS w/ SSSD or Dell/Quest - VASD >Reporter: Greg Senia >Assignee: Greg Senia > Attachments: YARN-4336-tactical.txt, tactical_defense.patch > > > Hi folks after performing some debug for our Unix Engineering and Active > Directory teams it was discovered that on YARN Container Initialization a > call via Hadoop Common AccessControlList.java: > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > Unfortunately with the security call to check access on > "appattempt_X_X_X" will always return false but will make > unnecessary calls to NameSwitch service on linux which will call things like > SSSD/Quest VASD which will then initiate LDAP calls looking for non existent > userid's causing excessive load on LDAP. > For now our tactical work around is as follows: > /** >* Checks if a user represented by the provided {@link UserGroupInformation} >* is a member of the Access Control List >* @param ugi UserGroupInformation to check if contained in the ACL >* @return true if ugi is member of the list >*/ > public final boolean isUserInList(UserGroupInformation ugi) { > if (allAllowed || users.contains(ugi.getShortUserName())) { > return true; > } else { > String patternString = "^appattempt_\\d+_\\d+_\\d+$"; > Pattern pattern = Pattern.compile(patternString); > Matcher matcher = pattern.matcher(ugi.getShortUserName()); > boolean matches = matcher.matches(); > if (matches) { > LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR > GROUPS!!");; > return false; > } > > > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > } > return false; > } > public boolean isUserAllowed(UserGroupInformation ugi) { > return isUserInList(ugi); > } > Example of VASD Debug log showing the lookups for one task attempt 32 of them: > One task: > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with >
[jira] [Commented] (YARN-4330) MiniYARNCluster prints multiple Failed to instantiate default resource calculator warning messages
[ https://issues.apache.org/jira/browse/YARN-4330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994386#comment-14994386 ] Hadoop QA commented on YARN-4330: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 5s {color} | {color:blue} docker + precommit patch detected. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 1 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 19s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 49s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 17s {color} | {color:red} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common in trunk has 3 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s {color} | {color:green} trunk passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 10s {color} | {color:green} trunk passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 7s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 46s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 47s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 23s {color} | {color:red} Patch generated 1 new checkstyle issues in hadoop-yarn-project/hadoop-yarn (total was 29, now 29). {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 0s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 58s {color} | {color:green} the patch passed with JDK v1.8.0_60 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 9s {color} | {color:green} the patch passed with JDK v1.7.0_79 {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 46s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 23s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.8.0_60. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 5m 8s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.8.0_60. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 2m 2s {color} | {color:green} hadoop-yarn-common in the patch passed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 51s {color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed with JDK v1.7.0_79. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 5m 10s {color} | {color:red} hadoop-yarn-server-tests in the patch failed with JDK v1.7.0_79. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} |
[jira] [Updated] (YARN-3452) Bogus token usernames cause many invalid group lookups
[ https://issues.apache.org/jira/browse/YARN-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Senia updated YARN-3452: - Attachment: tactical_defense.patch > Bogus token usernames cause many invalid group lookups > -- > > Key: YARN-3452 > URL: https://issues.apache.org/jira/browse/YARN-3452 > Project: Hadoop YARN > Issue Type: Bug > Components: security >Reporter: Jason Lowe > Attachments: tactical_defense.patch > > > YARN uses a number of bogus usernames for tokens, like application attempt > IDs for NM tokens or even the hardcoded "testing" for the container localizer > token. These tokens cause the RPC layer to do group lookups on these bogus > usernames which will never succeed but can take a long time to perform. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Senia updated YARN-4336: - Attachment: (was: YARN-4336-tactical.txt) > YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP > > > Key: YARN-4336 > URL: https://issues.apache.org/jira/browse/YARN-4336 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0, 2.4.1, 2.6.0, 2.7.0, 2.6.1, 2.7.1 > Environment: NSS w/ SSSD or Dell/Quest - VASD >Reporter: Greg Senia >Assignee: Greg Senia > Attachments: tactical_defense.patch > > > Hi folks after performing some debug for our Unix Engineering and Active > Directory teams it was discovered that on YARN Container Initialization a > call via Hadoop Common AccessControlList.java: > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > Unfortunately with the security call to check access on > "appattempt_X_X_X" will always return false but will make > unnecessary calls to NameSwitch service on linux which will call things like > SSSD/Quest VASD which will then initiate LDAP calls looking for non existent > userid's causing excessive load on LDAP. > For now our tactical work around is as follows: > /** >* Checks if a user represented by the provided {@link UserGroupInformation} >* is a member of the Access Control List >* @param ugi UserGroupInformation to check if contained in the ACL >* @return true if ugi is member of the list >*/ > public final boolean isUserInList(UserGroupInformation ugi) { > if (allAllowed || users.contains(ugi.getShortUserName())) { > return true; > } else { > String patternString = "^appattempt_\\d+_\\d+_\\d+$"; > Pattern pattern = Pattern.compile(patternString); > Matcher matcher = pattern.matcher(ugi.getShortUserName()); > boolean matches = matcher.matches(); > if (matches) { > LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR > GROUPS!!");; > return false; > } > > > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > } > return false; > } > public boolean isUserAllowed(UserGroupInformation ugi) { > return isUserInList(ugi); > } > Example of VASD Debug log showing the lookups for one task attempt 32 of them: > One task: > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with >
[jira] [Updated] (YARN-4241) Typo in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-4241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Rojas updated YARN-4241: Attachment: YARN-4241.002.patch > Typo in yarn-default.xml > > > Key: YARN-4241 > URL: https://issues.apache.org/jira/browse/YARN-4241 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, yarn >Reporter: Anthony Rojas >Assignee: Anthony Rojas >Priority: Trivial > Labels: newbie > Attachments: YARN-4241.002.patch, YARN-4241.patch, YARN-4241.patch.1 > > > Typo in description section of yarn-default.xml, under the properties: > yarn.nodemanager.disk-health-checker.min-healthy-disks > yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage > yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb > The reference to yarn-nodemanager.local-dirs should be > yarn.nodemanager.local-dirs -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4336) YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP
[ https://issues.apache.org/jira/browse/YARN-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Senia updated YARN-4336: - Affects Version/s: (was: 2.4.1) (was: 2.4.0) > YARN NodeManager - Container Initialization - Excessive load on NSS/LDAP > > > Key: YARN-4336 > URL: https://issues.apache.org/jira/browse/YARN-4336 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0, 2.7.0, 2.6.1, 2.7.1 > Environment: NSS w/ SSSD or Dell/Quest - VASD >Reporter: Greg Senia >Assignee: Greg Senia > Attachments: tactical_defense.patch > > > Hi folks after performing some debug for our Unix Engineering and Active > Directory teams it was discovered that on YARN Container Initialization a > call via Hadoop Common AccessControlList.java: > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > Unfortunately with the security call to check access on > "appattempt_X_X_X" will always return false but will make > unnecessary calls to NameSwitch service on linux which will call things like > SSSD/Quest VASD which will then initiate LDAP calls looking for non existent > userid's causing excessive load on LDAP. > For now our tactical work around is as follows: > /** >* Checks if a user represented by the provided {@link UserGroupInformation} >* is a member of the Access Control List >* @param ugi UserGroupInformation to check if contained in the ACL >* @return true if ugi is member of the list >*/ > public final boolean isUserInList(UserGroupInformation ugi) { > if (allAllowed || users.contains(ugi.getShortUserName())) { > return true; > } else { > String patternString = "^appattempt_\\d+_\\d+_\\d+$"; > Pattern pattern = Pattern.compile(patternString); > Matcher matcher = pattern.matcher(ugi.getShortUserName()); > boolean matches = matcher.matches(); > if (matches) { > LOG.debug("Bailing !! AppAttempt Matches DONOT call UGI FOR > GROUPS!!");; > return false; > } > > > for(String group: ugi.getGroupNames()) { > if (groups.contains(group)) { > return true; > } > } > } > return false; > } > public boolean isUserAllowed(UserGroupInformation ugi) { > return isUserInList(ugi); > } > Example of VASD Debug log showing the lookups for one task attempt 32 of them: > One task: > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:55:43 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:15 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with > filter=<(&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01))>, > base=<>, scope= > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: _vasug_user_namesearch_gc: searching > GC for host service domain EXNSD.EXA.EXAMPLE.COM with filter > (&(objectCategory=Person)(samaccountname=appattempt_1446145939879_0022_01)) > Oct 30 22:56:45 xhadoopm5d vasd[20741]: libvas_attrs_find_uri: Searching > with >
[jira] [Comment Edited] (YARN-2556) Tool to measure the performance of the timeline server
[ https://issues.apache.org/jira/browse/YARN-2556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994140#comment-14994140 ] Xuan Gong edited comment on YARN-2556 at 11/6/15 9:57 PM: -- Thanks for the work. [~sjlee0] and [~lichangleo] Could you give us some instruction on how to run this performance tool ? Maybe add a document on related ats docs and at least give us an example command to run this tool. was (Author: xgong): Thanks for the work. [~sjlee0] and [~lichangleo] Could you give us some instruction on how to run this performance tool ? Maybe add a document on related ats md and at least give us an example command to run this tool. > Tool to measure the performance of the timeline server > -- > > Key: YARN-2556 > URL: https://issues.apache.org/jira/browse/YARN-2556 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Jonathan Eagles >Assignee: Chang Li > Labels: BB2015-05-TBR > Fix For: 2.8.0 > > Attachments: YARN-2556-WIP.patch, YARN-2556-WIP.patch, > YARN-2556.1.patch, YARN-2556.10.patch, YARN-2556.11.patch, > YARN-2556.12.patch, YARN-2556.13.patch, YARN-2556.13.whitespacefix.patch, > YARN-2556.14.patch, YARN-2556.14.whitespacefix.patch, YARN-2556.15.patch, > YARN-2556.2.patch, YARN-2556.3.patch, YARN-2556.4.patch, YARN-2556.5.patch, > YARN-2556.6.patch, YARN-2556.7.patch, YARN-2556.8.patch, YARN-2556.9.patch, > YARN-2556.patch, yarn2556.patch, yarn2556.patch, yarn2556_wip.patch > > > We need to be able to understand the capacity model for the timeline server > to give users the tools they need to deploy a timeline server with the > correct capacity. > I propose we create a mapreduce job that can measure timeline server write > and read performance. Transactions per second, I/O for both read and write > would be a good start. > This could be done as an example or test job that could be tied into gridmix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)