[jira] [Commented] (YARN-2879) Compatibility validation between YARN 2.2/2.4 and 2.6
[ https://issues.apache.org/jira/browse/YARN-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218405#comment-14218405 ] Zhijie Shen commented on YARN-2879: --- In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with new shuffle on NM; 3. Submitting via old client. We will see the following console exception: {code} Console Log: 14/11/17 14:56:19 INFO mapreduce.Job: Job job_1416264695865_0003 completed successfully java.lang.IllegalArgumentException: No enum constant org.apache.hadoop.mapreduce.JobCounter.MB_MILLIS_REDUCES at java.lang.Enum.valueOf(Enum.java:236) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.valueOf(FrameworkCounterGroup.java:148) at org.apache.hadoop.mapreduce.counters.FrameworkCounterGroup.findCounter(FrameworkCounterGroup.java:182) at org.apache.hadoop.mapreduce.counters.AbstractCounters.findCounter(AbstractCounters.java:154) at org.apache.hadoop.mapreduce.TypeConverter.fromYarn(TypeConverter.java:240) at org.apache.hadoop.mapred.ClientServiceDelegate.getJobCounters(ClientServiceDelegate.java:370) at org.apache.hadoop.mapred.YARNRunner.getJobCounters(YARNRunner.java:511) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:756) at org.apache.hadoop.mapreduce.Job$7.run(Job.java:753) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:753) at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1361) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} The problem is supposed to be fixed by MAPREDUCE-5831, however, it seems that we haven't cover all the problematic code path. Will another Jira again. > Compatibility validation between YARN 2.2/2.4 and 2.6 > - > > Key: YARN-2879 > URL: https://issues.apache.org/jira/browse/YARN-2879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Recently, I did some simple backward compatibility experiments. Bascially, > I've taken the following 2 steps: > 1. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *new* Hadoop (2.6) client. > 2. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *old* Hadoop (2.2/2.4) client that comes with the app. > I've tried these 2 steps on both insecure and secure cluster. Here's a short > summary: > || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle and RT 2.2 || MR 2.2 + Shuffle > and RT 2.6 || MR 2.4 + Shuffle and RT 2.4 || MR 2.4 + Shuffle and RT 2.6 || > | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible > | OK | OK | > | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | > OK | OK | > | Secure | New Client | OK | OK | Client Incompatible | Client Incompatible | > OK | OK | > | Secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK > | OK | > Note that I've tried to run NM with both old and new version of shuffle > handler plus the runtime libs. > In general, the compatibility looks good overall. There're a f
[jira] [Updated] (YARN-2879) Compatibility validation between YARN 2.2/2.4 and 2.6
[ https://issues.apache.org/jira/browse/YARN-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2879: -- Description: Recently, I did some simple backward compatibility experiments. Bascially, I've taken the following 2 steps: 1. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *new* Hadoop (2.6) client. 2. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *old* Hadoop (2.2/2.4) client that comes with the app. I've tried these 2 steps on both insecure and secure cluster. Here's a short summary: || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle and RT 2.2 || MR 2.2 + Shuffle and RT 2.6 || MR 2.4 + Shuffle and RT 2.4 || MR 2.4 + Shuffle and RT 2.6 || | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | | Secure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | Note that I've tried to run NM with both old and new version of shuffle handler plus the runtime libs. In general, the compatibility looks good overall. There're a few issues that are related to MR, but they seem to be not the YARN issue. I'll post the individual problem in the follow-up comments. was: Recently, I did some simple backward compatibility experiments. Bascially, I've taken the following 2 steps: 1. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *new* Hadoop (2.6) client. 2. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *old* Hadoop (2.2/2.4) client that comes with the app. I've tried these 2 steps on both insecure and secure cluster. Here's a short summary: || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle 2.2 || MR 2.2 + Shuffle 2.6 || MR 2.4 + Shuffle 2.4 || MR 2.4 + Shuffle 2.6 || | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | | Secure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | Note that I've tried to run NM with both old and new version of shuffle handler plus the runtime libs. In general, the compatibility looks good overall. There're a few issues that are related to MR, but they seem to be not the YARN issue. I'll post the individual problem in the follow-up comments. > Compatibility validation between YARN 2.2/2.4 and 2.6 > - > > Key: YARN-2879 > URL: https://issues.apache.org/jira/browse/YARN-2879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Recently, I did some simple backward compatibility experiments. Bascially, > I've taken the following 2 steps: > 1. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *new* Hadoop (2.6) client. > 2. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *old* Hadoop (2.2/2.4) client that comes with the app. > I've tried these 2 steps on both insecure and secure cluster. Here's a short > summary: > || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle and RT 2.2 || MR 2.2 + Shuffle > and RT 2.6 || MR 2.4 + Shuffle and RT 2.4 || MR 2.4 + Shuffle and RT 2.6 || > | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible > | OK | OK | > | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | > OK | OK | > | Secure | New Client | OK | OK | Client Incompatible | Client Incompatible | > OK | OK | > | Secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK > | OK | > Note that I've tried to run NM with both old and new version of shuffle > handler plus the runtime libs. > In general, the compatibility looks good overall. There're a few issues that > are related to MR, but they seem to be not the YARN issue. I'll post the > individual problem in the follow-up comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2879) Compatibility validation between YARN 2.2/2.4 and 2.6
[ https://issues.apache.org/jira/browse/YARN-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2879: -- Description: Recently, I did some simple backward compatibility experiments. Bascially, I've taken the following 2 steps: 1. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *new* Hadoop (2.6) client. 2. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *old* Hadoop (2.2/2.4) client that comes with the app. I've tried these 2 steps on both insecure and secure cluster. Here's a short summary: || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle 2.2 || MR 2.2 + Shuffle 2.6 || MR 2.4 + Shuffle 2.4 || MR 2.4 + Shuffle 2.6 || | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | | Secure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | Note that I've tried to run NM with both old and new version of shuffle handler plus the runtime libs. In general, the compatibility looks good overall. There're a few issues that are related to MR, but they seem to be not the YARN issue. I'll post the individual problem in the follow-up comments. was: Recently, I did some simple backward compatibility experiments. Bascially, I've taken the following 2 steps: 1. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *new* Hadoop (2.6) client. 2. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *old* Hadoop (2.2/2.4) client that comes with the app. I've tried these 2 steps on both insecure and secure cluster. Here's a short summary: || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle 2.2 || MR 2.2 + Shuffle 2.6 || MR 2.4 + Shuffle 2.4 || MR 2.4 + Shuffle 2.6 || | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | | secure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | Note that I've tried to run NM with both old and new shuffle handler version. In general, the compatibility looks good overall. There're a few issues that are related to MR, but they seem to be not the YARN issue. I'll post the individual problem in the follow-up comments. > Compatibility validation between YARN 2.2/2.4 and 2.6 > - > > Key: YARN-2879 > URL: https://issues.apache.org/jira/browse/YARN-2879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Recently, I did some simple backward compatibility experiments. Bascially, > I've taken the following 2 steps: > 1. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *new* Hadoop (2.6) client. > 2. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *old* Hadoop (2.2/2.4) client that comes with the app. > I've tried these 2 steps on both insecure and secure cluster. Here's a short > summary: > || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle 2.2 || MR 2.2 + Shuffle 2.6 || > MR 2.4 + Shuffle 2.4 || MR 2.4 + Shuffle 2.6 || > | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible > | OK | OK | > | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | > OK | OK | > | Secure | New Client | OK | OK | Client Incompatible | Client Incompatible | > OK | OK | > | Secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK > | OK | > Note that I've tried to run NM with both old and new version of shuffle > handler plus the runtime libs. > In general, the compatibility looks good overall. There're a few issues that > are related to MR, but they seem to be not the YARN issue. I'll post the > individual problem in the follow-up comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2879) Compatibility validation between YARN 2.2/2.4 and 2.6
[ https://issues.apache.org/jira/browse/YARN-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218388#comment-14218388 ] Zhijie Shen edited comment on YARN-2879 at 11/19/14 7:50 PM: - a. In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with either old or new shuffle handler on NM; 3. Submitting via new client. We will see the following console exception: {code} 14/11/17 23:47:45 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/zjshen/.staging/zjshen/.staging/job_1416270549965_0014 java.lang.NoSuchMethodError: org.apache.hadoop.http.HttpConfig.getSchemePrefix()Ljava/lang/String; at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:428) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:302) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:430) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) {code} b. In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with old shuffle on NM; 3. Submitting via old client. We will see the following exception in the AM Log: {code} 2014-11-17 15:09:06,157 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1416264695865_0007_01 2014-11-17 15:09:06,436 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.NoSuchMethodError: org.apache.hadoop.http.HttpConfig.setPolicy(Lorg/apache/hadoop/http/HttpConfig$Policy;)V at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1364) 2014-11-17 15:09:06,439 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. {code} The two exceptions are actually the same problem, but using the old client prevents it happening during app submission. Will file a separate Jira for it. was (Author: zjshen): a. In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with either old or new shuffle handler on NM; 3. Submitting via new client. We will see the following console exception: {code} 14/11/17 23:47:45 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/zjshen/.staging/zjshen/.staging/job_1416270549965_0014 java.lang.NoSuchMethodError: org.apache.hadoop.http.HttpConfig.getSchemePrefix()Ljava/lang/String; at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:428) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:302) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:430) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.
[jira] [Commented] (YARN-2879) Compatibility validation between YARN 2.2/2.4 and 2.6
[ https://issues.apache.org/jira/browse/YARN-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218388#comment-14218388 ] Zhijie Shen commented on YARN-2879: --- a. In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with either old or new shuffle handler on NM; 3. Submitting via new client. We will see the following console exception: {code} 14/11/17 23:47:45 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/zjshen/.staging/zjshen/.staging/job_1416270549965_0014 java.lang.NoSuchMethodError: org.apache.hadoop.http.HttpConfig.getSchemePrefix()Ljava/lang/String; at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:428) at org.apache.hadoop.mapred.YARNRunner.submitJob(YARNRunner.java:302) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:430) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268) at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286) at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306) at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71) at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) {code} b. In the following scenarios: 1. Either insecure or secure; 2. MR 2.2 with old on NM; 3. Submitting via old client. We will see the following exception in the AM Log: {code} 2014-11-17 15:09:06,157 INFO [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Created MRAppMaster for application appattempt_1416264695865_0007_01 2014-11-17 15:09:06,436 FATAL [main] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster java.lang.NoSuchMethodError: org.apache.hadoop.http.HttpConfig.setPolicy(Lorg/apache/hadoop/http/HttpConfig$Policy;)V at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1364) 2014-11-17 15:09:06,439 INFO [Thread-1] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a signal. Signaling RMCommunicator and JobHistoryEventHandler. {code} The two exceptions are actually the same problem, but using the old client prevents it happening during app submission. Will file a separate Jira for it. > Compatibility validation between YARN 2.2/2.4 and 2.6 > - > > Key: YARN-2879 > URL: https://issues.apache.org/jira/browse/YARN-2879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Recently, I did some simple backward compatibility experiments. Bascially, > I've taken the following 2 steps: > 1. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *new* Hadoop (2.6) client. > 2. Deploy the application (MR and DistributedShell) that is compiled against > *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is > submitted via *old* Hadoop (2.2/2.4) client that comes with the app. > I've tried these 2 steps on both insecure and secure cluster. Here's a short > summary: > || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle 2.2 || MR 2.2 + Shuffle 2.6 || > MR 2.4 + Shuffle 2.4 || MR 2.4 + Shuffle 2.6 || > | Insecure | New Client | OK | OK | Client Incompatible | Client Incomp
[jira] [Created] (YARN-2879) Compatibility validation between YARN 2.2/2.4 and 2.6
Zhijie Shen created YARN-2879: - Summary: Compatibility validation between YARN 2.2/2.4 and 2.6 Key: YARN-2879 URL: https://issues.apache.org/jira/browse/YARN-2879 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Recently, I did some simple backward compatibility experiments. Bascially, I've taken the following 2 steps: 1. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *new* Hadoop (2.6) client. 2. Deploy the application (MR and DistributedShell) that is compiled against *old* YARN API (2.2/2.4) on *new* YARN cluster (2.6). The application is submitted via *old* Hadoop (2.2/2.4) client that comes with the app. I've tried these 2 steps on both insecure and secure cluster. Here's a short summary: || || || DS 2.2 || DS 2.4 || MR 2.2 + Shuffle 2.2 || MR 2.2 + Shuffle 2.6 || MR 2.4 + Shuffle 2.4 || MR 2.4 + Shuffle 2.6 || | Insecure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | Insecure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | | secure | New Client | OK | OK | Client Incompatible | Client Incompatible | OK | OK | | secure | Old Client | OK | OK | AM Incompatible | Client Incompatible | OK | OK | Note that I've tried to run NM with both old and new shuffle handler version. In general, the compatibility looks good overall. There're a few issues that are related to MR, but they seem to be not the YARN issue. I'll post the individual problem in the follow-up comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2522) AHSClient may be not necessary
[ https://issues.apache.org/jira/browse/YARN-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2522: -- Target Version/s: 2.7.0 > AHSClient may be not necessary > -- > > Key: YARN-2522 > URL: https://issues.apache.org/jira/browse/YARN-2522 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > Per discussion in > [YARN-2033|https://issues.apache.org/jira/browse/YARN-2033?focusedCommentId=14126073&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14126073], > it may be not necessary to have a separate AHSClient. The methods can be > incorporated into TimelineClient. APPLICATION_HISTORY_ENABLED is also useless > then. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2870) Update examples in document of Timeline Server
[ https://issues.apache.org/jira/browse/YARN-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217074#comment-14217074 ] Zhijie Shen commented on YARN-2870: --- It's better to completely update the document (YARN-2854). Anyway, the patch is ready now, let's commit it. Thanks for the contribution, [~iwasakims]! > Update examples in document of Timeline Server > -- > > Key: YARN-2870 > URL: https://issues.apache.org/jira/browse/YARN-2870 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, timelineserver >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Trivial > Attachments: YARN-2870.1.patch > > > YARN-1982 renamed historyserver to timelineserver but there is still > deprecated name in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2870) Update examples in document of Timeline Server
[ https://issues.apache.org/jira/browse/YARN-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2870: -- Assignee: Masatake Iwasaki > Update examples in document of Timeline Server > -- > > Key: YARN-2870 > URL: https://issues.apache.org/jira/browse/YARN-2870 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, timelineserver >Reporter: Masatake Iwasaki >Assignee: Masatake Iwasaki >Priority: Trivial > Attachments: YARN-2870.1.patch > > > YARN-1982 renamed historyserver to timelineserver but there is still > deprecated name in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217049#comment-14217049 ] Zhijie Shen commented on YARN-2375: --- bq. Do you mean that we should not check for TIMELINE_SERVICE_ENABLED flag in the Application Master and rather have it work same way that it was doing before and only check that flag while sending data to timeline server? I think the logic could be: when TIMELINE_SERVICE_ENABLED == true, read the domain env var and construct the timeline client. Only if the timeline client is not null, the AM will send the data to timeline server where it should do it. > Allow enabling/disabling timeline server per framework > -- > > Key: YARN-2375 > URL: https://issues.apache.org/jira/browse/YARN-2375 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Mit Desai > Attachments: YARN-2375.patch, YARN-2375.patch > > > This JIRA is to remove the ats enabled flag check within the > TimelineClientImpl. Example where this fails is below. > While running secure timeline server with ats flag set to disabled on > resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216973#comment-14216973 ] Zhijie Shen commented on YARN-2375: --- [~mitdesai], thanks for the patch. Two suggestions: 1. We should still let DS work when the timeline service is disable, and we just need to prevent sending the timeline data to the timeline server while the DS app is running. 2. In JobHistoryEventHandler we need to check both the global config and the mr specific config to decide whether we emit MR history events. > Allow enabling/disabling timeline server per framework > -- > > Key: YARN-2375 > URL: https://issues.apache.org/jira/browse/YARN-2375 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Mit Desai > Attachments: YARN-2375.patch, YARN-2375.patch > > > This JIRA is to remove the ats enabled flag check within the > TimelineClientImpl. Example where this fails is below. > While running secure timeline server with ats flag set to disabled on > resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2165) Timelineserver should validate that yarn.timeline-service.ttl-ms is greater than zero
[ https://issues.apache.org/jira/browse/YARN-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216914#comment-14216914 ] Zhijie Shen commented on YARN-2165: --- [~vasanthkumar], thanks for your contribution! Some comments about the patch. 1. TIMELINE_SERVICE_CLIENT_MAX_RETRIES can be -1 for endless retry. It's good to make it clear in yarn-default.xml too. 2. Instead of {{" property value should be positive and non-zero"}}, can we simply say {{" property value should be greater than zero}}? 3. You can use {{com.google.common.base.Preconditions.checkArgument}}. 4. Multiple lines are longer than 80 chars. 5. TIMELINE_SERVICE_LEVELDB_READ_CACHE_SIZE can be zero. 6. TIMELINE_SERVICE_LEVELDB_START_TIME_READ_CACHE_SIZE and TIMELINE_SERVICE_LEVELDB_START_TIME_WRITE_CACHE_SIZE seems to be > 0 because LRUMap requires this. However, ideally we should be able to disable cache completely. Let's deal with it separately. > Timelineserver should validate that yarn.timeline-service.ttl-ms is greater > than zero > - > > Key: YARN-2165 > URL: https://issues.apache.org/jira/browse/YARN-2165 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Karam Singh >Assignee: Vasanth kumar RJ > Attachments: YARN-2165.1.patch, YARN-2165.2.patch, YARN-2165.patch > > > Timelineserver should validate that yarn.timeline-service.ttl-ms is greater > than zero > Currently if set yarn.timeline-service.ttl-ms=0 > Or yarn.timeline-service.ttl-ms=-86400 > Timeline server start successfully with complaining > {code} > 2014-06-15 14:52:16,562 INFO timeline.LeveldbTimelineStore > (LeveldbTimelineStore.java:(247)) - Starting deletion thread with ttl > -60480 and cycle interval 30 > {code} > At starting timelinserver should that yarn.timeline-service-ttl-ms > 0 > otherwise specially for -ive value discard oldvalues timestamp will be set > future value. Which may lead to inconsistancy in behavior > {code} > public void run() { > while (true) { > long timestamp = System.currentTimeMillis() - ttl; > try { > discardOldEntities(timestamp); > Thread.sleep(ttlInterval); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-2838. --- Resolution: Not a Problem Close the ticket and work on separate jiras. > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0, 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2870) Update examples in document of Timeline Server
[ https://issues.apache.org/jira/browse/YARN-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2870: -- Component/s: timelineserver > Update examples in document of Timeline Server > -- > > Key: YARN-2870 > URL: https://issues.apache.org/jira/browse/YARN-2870 > Project: Hadoop YARN > Issue Type: Bug > Components: documentation, timelineserver >Reporter: Masatake Iwasaki >Priority: Trivial > Attachments: YARN-2870.1.patch > > > YARN-1982 renamed historyserver to timelineserver but there is still > deprecated name in docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2867) TimelineClient DT methods should check if the timeline service is enabled or not
[ https://issues.apache.org/jira/browse/YARN-2867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen resolved YARN-2867. --- Resolution: Invalid Per discussion on [YARN-2375|https://issues.apache.org/jira/browse/YARN-2375?focusedCommentId=14213002&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14213002], close this Jira as invalid > TimelineClient DT methods should check if the timeline service is enabled or > not > > > Key: YARN-2867 > URL: https://issues.apache.org/jira/browse/YARN-2867 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.6.0 >Reporter: Zhijie Shen > > DT related methods doesn't check if isEnabled == true. On the other side, the > internal stuff is only inited when isEnabled == true. NPE happens if users > call these methods when the timeline service config is not set to enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213002#comment-14213002 ] Zhijie Shen commented on YARN-2375: --- [~jeagles], thanks for the clarification. bq. I am proposing to retain the flag. However, the responsibility of checking whether the ats is enabled needs to be outside of the TimelineClientImpl. It makes sense to me. If we make this change. YARN-2867 is no longer necessary. Will go ahead to close it. > Allow enabling/disabling timeline server per framework > -- > > Key: YARN-2375 > URL: https://issues.apache.org/jira/browse/YARN-2375 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Mit Desai > > This JIRA is to remove the ats enabled flag check within the > TimelineClientImpl. Example where this fails is below. > While running secure timeline server with ats flag set to disabled on > resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2862) RM might not start if the machine was hard shutdown and FileSystemRMStateStore was used
[ https://issues.apache.org/jira/browse/YARN-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212989#comment-14212989 ] Zhijie Shen commented on YARN-2862: --- It is likely that the assumption we made in [YARN-1776|https://issues.apache.org/jira/browse/YARN-1776?focusedCommentId=13942201&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13942201] is not fully correct. When updating a state file, we (1) write the new file to .new, (2) delete the existing one, and (3) rename the .new to the existing file name. If crash happens before (2), we use .new to recover the state file when loading the state (see FileSystemRMStateStore#checkAndResumeUpdateOperation). According to the description here, RM can crash when (1) is in progress, and leave a corrupted .new file. It seems that we have to do additional validation to check if .new file is corrupted or not, or just simply ignore it . > RM might not start if the machine was hard shutdown and > FileSystemRMStateStore was used > --- > > Key: YARN-2862 > URL: https://issues.apache.org/jira/browse/YARN-2862 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Ming Ma > > This might be a known issue. Given FileSystemRMStateStore isn't used for HA > scenario, it might not be that important, unless there is something we need > to fix at RM layer to make it more tolerant to RMStore issue. > When RM was hard shutdown, OS might not get a chance to persist blocks. Some > of the stored application data end up with size zero after reboot. And RM > didn't like that. > {noformat} > ls -al > /var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351 > total 156 > drwxr-xr-x.2 x y 4096 Nov 13 16:45 . > drwxr-xr-x. 1524 x y 151552 Nov 13 16:45 .. > -rw-r--r--.1 x y 0 Nov 13 16:45 > appattempt_1412702189634_324351_01 > -rw-r--r--.1 x y 0 Nov 13 16:45 > .appattempt_1412702189634_324351_01.crc > -rw-r--r--.1 x y 0 Nov 13 16:45 application_1412702189634_324351 > -rw-r--r--.1 x y 0 Nov 13 16:45 .application_1412702189634_324351.crc > {noformat} > When RM starts up > {noformat} > 2014-11-13 16:55:25,844 WARN org.apache.hadoop.fs.FSInputChecker: Problem > opening checksum file: > file:/var/log/hadoop/rmstore/FSRMStateRoot/RMAppRoot/application_1412702189634_324351/application_1412702189634_324351. > Ignoring exception: > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at java.io.DataInputStream.readFully(DataInputStream.java:169) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:146) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:792) > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.readFile(FileSystemRMStateStore.java:501) > ... > 2014-11-13 17:40:48,876 ERROR > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to > load/recover state > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ApplicationState.getAppId(RMStateStore.java:184) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:306) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:425) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1027) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:484) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:834) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212932#comment-14212932 ] Zhijie Shen commented on YARN-2375: --- Filed YARN-2867 > Allow enabling/disabling timeline server per framework > -- > > Key: YARN-2375 > URL: https://issues.apache.org/jira/browse/YARN-2375 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Mit Desai > > This JIRA is to remove the ats enabled flag check within the > TimelineClientImpl. Example where this fails is below. > While running secure timeline server with ats flag set to disabled on > resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2867) TimelineClient DT methods should check if the timeline service is enabled or not
Zhijie Shen created YARN-2867: - Summary: TimelineClient DT methods should check if the timeline service is enabled or not Key: YARN-2867 URL: https://issues.apache.org/jira/browse/YARN-2867 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Zhijie Shen DT related methods doesn't check if isEnabled == true. On the other side, the internal stuff is only inited when isEnabled == true. NPE happens if users call these methods when the timeline service config is not set to enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2375) Allow enabling/disabling timeline server per framework
[ https://issues.apache.org/jira/browse/YARN-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212928#comment-14212928 ] Zhijie Shen commented on YARN-2375: --- bq. While running secure timeline server with ats flag set to disabled on resource manager, Timeline delegation token renewer throws an NPE. This is a bug. DT related API methods doesn't check if isEnabled == true. On the other side, the internal stuff is only inited when isEnabled == true. This is why NPE happens. Will file a separate Jira for it. As to removing the global flag, I'm not sure if we should do that. Nowadays, we still don't assume the timeline server is always up as other components in a YARN cluster: RM and NM. Then, if the timeline server is not setup but the YARN cluster assumes it is up, it will result in problems. For example, app submission fails at getting the timeline DT in a secure cluster. Therefore, this config should be kept to serve as the flag to indicate if we have setup the timeline server for the YARN cluster, until we promote it the be the always on daemon like RM and NM. Thoughts? > Allow enabling/disabling timeline server per framework > -- > > Key: YARN-2375 > URL: https://issues.apache.org/jira/browse/YARN-2375 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Mit Desai > > This JIRA is to remove the ats enabled flag check within the > TimelineClientImpl. Example where this fails is below. > While running secure timeline server with ats flag set to disabled on > resource manager, Timeline delegation token renewer throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2166) Timelineserver should validate that yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms is greater than zero when level db is for timeline store
[ https://issues.apache.org/jira/browse/YARN-2166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212600#comment-14212600 ] Zhijie Shen commented on YARN-2166: --- See the comments on [YARN-2165|https://issues.apache.org/jira/browse/YARN-2165?focusedCommentId=14212595&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14212595]. How about having one pass to do sanity check for all numeric configs. > Timelineserver should validate that > yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms is greater than > zero when level db is for timeline store > - > > Key: YARN-2166 > URL: https://issues.apache.org/jira/browse/YARN-2166 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Karam Singh > > Timelineserver should validate that > yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms is greater than > zero when level db is for timeline store > other if we start timelineserver with > yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms=-5000 > Timeline starts but Thread.sleep call in EntityDeletionThread.run keep on > throwing UncaughtException -ive value > {code} > 2014-06-16 10:22:03,537 ERROR yarn.YarnUncaughtExceptionHandler > (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread > Thread[Thread-4,5,main] threw an Exception. > java.lang.IllegalArgumentException: timeout value is negative > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore$EntityDeletionThread.run(LeveldbTimelineStore.java:257) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2165) Timelineserver should validate that yarn.timeline-service.ttl-ms is greater than zero
[ https://issues.apache.org/jira/browse/YARN-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14212595#comment-14212595 ] Zhijie Shen commented on YARN-2165: --- bq. should the check be (<= 0) instead of (< 0) ? Since 0 ttl and ttlinterval have no real meanings. Agree. To be more general, it's better to do the sanity check for all the numeric configurations while initializing the timeline server, making sure a valid number has been set. Here's the current list. {code} Time to live for timeline store data in milliseconds. yarn.timeline-service.ttl-ms 60480 Length of time to wait between deletion cycles of leveldb timeline store in milliseconds. yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms 30 Size of read cache for uncompressed blocks for leveldb timeline store in bytes. yarn.timeline-service.leveldb-timeline-store.read-cache-size 104857600 Size of cache for recently read entity start times for leveldb timeline store in number of entities. yarn.timeline-service.leveldb-timeline-store.start-time-read-cache-size 1 Size of cache for recently written entity start times for leveldb timeline store in number of entities. yarn.timeline-service.leveldb-timeline-store.start-time-write-cache-size 1 Handler thread count to serve the client RPC requests. yarn.timeline-service.handler-thread-count 10 Default maximum number of retires for timeline servive client. yarn.timeline-service.client.max-retries 30 Default retry time interval for timeline servive client. yarn.timeline-service.client.retry-interval-ms 1000 {code} > Timelineserver should validate that yarn.timeline-service.ttl-ms is greater > than zero > - > > Key: YARN-2165 > URL: https://issues.apache.org/jira/browse/YARN-2165 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Karam Singh > Attachments: YARN-2165.patch > > > Timelineserver should validate that yarn.timeline-service.ttl-ms is greater > than zero > Currently if set yarn.timeline-service.ttl-ms=0 > Or yarn.timeline-service.ttl-ms=-86400 > Timeline server start successfully with complaining > {code} > 2014-06-15 14:52:16,562 INFO timeline.LeveldbTimelineStore > (LeveldbTimelineStore.java:(247)) - Starting deletion thread with ttl > -60480 and cycle interval 30 > {code} > At starting timelinserver should that yarn.timeline-service-ttl-ms > 0 > otherwise specially for -ive value discard oldvalues timestamp will be set > future value. Which may lead to inconsistancy in behavior > {code} > public void run() { > while (true) { > long timestamp = System.currentTimeMillis() - ttl; > try { > discardOldEntities(timestamp); > Thread.sleep(ttlInterval); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2861) Timeline DT secret manager should not reuse the RM's configs.
[ https://issues.apache.org/jira/browse/YARN-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2861: -- Attachment: YARN-2861.1.patch Straightforward change: creating separate set of configs for the timeline DT > Timeline DT secret manager should not reuse the RM's configs. > - > > Key: YARN-2861 > URL: https://issues.apache.org/jira/browse/YARN-2861 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2861.1.patch > > > This is the configs for RM DT secret manager. We should create separate ones > for timeline DT only. > {code} > @Override > protected void serviceInit(Configuration conf) throws Exception { > long secretKeyInterval = > conf.getLong(YarnConfiguration.DELEGATION_KEY_UPDATE_INTERVAL_KEY, > YarnConfiguration.DELEGATION_KEY_UPDATE_INTERVAL_DEFAULT); > long tokenMaxLifetime = > conf.getLong(YarnConfiguration.DELEGATION_TOKEN_MAX_LIFETIME_KEY, > YarnConfiguration.DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT); > long tokenRenewInterval = > conf.getLong(YarnConfiguration.DELEGATION_TOKEN_RENEW_INTERVAL_KEY, > YarnConfiguration.DELEGATION_TOKEN_RENEW_INTERVAL_DEFAULT); > secretManager = new > TimelineDelegationTokenSecretManager(secretKeyInterval, > tokenMaxLifetime, tokenRenewInterval, > 360); > secretManager.startThreads(); > serviceAddr = TimelineUtils.getTimelineTokenServiceAddress(getConfig()); > super.init(conf); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2861) Timeline DT secret manager should not reuse the RM's configs.
Zhijie Shen created YARN-2861: - Summary: Timeline DT secret manager should not reuse the RM's configs. Key: YARN-2861 URL: https://issues.apache.org/jira/browse/YARN-2861 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen This is the configs for RM DT secret manager. We should create separate ones for timeline DT only. {code} @Override protected void serviceInit(Configuration conf) throws Exception { long secretKeyInterval = conf.getLong(YarnConfiguration.DELEGATION_KEY_UPDATE_INTERVAL_KEY, YarnConfiguration.DELEGATION_KEY_UPDATE_INTERVAL_DEFAULT); long tokenMaxLifetime = conf.getLong(YarnConfiguration.DELEGATION_TOKEN_MAX_LIFETIME_KEY, YarnConfiguration.DELEGATION_TOKEN_MAX_LIFETIME_DEFAULT); long tokenRenewInterval = conf.getLong(YarnConfiguration.DELEGATION_TOKEN_RENEW_INTERVAL_KEY, YarnConfiguration.DELEGATION_TOKEN_RENEW_INTERVAL_DEFAULT); secretManager = new TimelineDelegationTokenSecretManager(secretKeyInterval, tokenMaxLifetime, tokenRenewInterval, 360); secretManager.startThreads(); serviceAddr = TimelineUtils.getTimelineTokenServiceAddress(getConfig()); super.init(conf); } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2766) ApplicationHistoryManager is expected to return a sorted list of apps/attempts/containers
[ https://issues.apache.org/jira/browse/YARN-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211337#comment-14211337 ] Zhijie Shen commented on YARN-2766: --- +1. Will commit the patch. > ApplicationHistoryManager is expected to return a sorted list of > apps/attempts/containers > -- > > Key: YARN-2766 > URL: https://issues.apache.org/jira/browse/YARN-2766 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Robert Kanter >Assignee: Robert Kanter > Attachments: YARN-2766.patch, YARN-2766.patch, YARN-2766.patch, > YARN-2766.patch > > > {{TestApplicationHistoryClientService.testContainers}} and > {{TestApplicationHistoryClientService.testApplicationAttempts}} both fail > because the test assertions are assuming a returned Collection is in a > certain order. The collection comes from a HashMap, so the order is not > guaranteed, plus, according to [this > page|http://docs.oracle.com/javase/8/docs/technotes/guides/collections/changes8.html], > there are situations where the iteration order of a HashMap will be > different between Java 7 and 8. > We should fix the test code to not assume a specific ordering. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
[ https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14211328#comment-14211328 ] Zhijie Shen commented on YARN-2859: --- Binding the default port is not right for MiniYARNCluster. Will fix the problem. > ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster > -- > > Key: YARN-2859 > URL: https://issues.apache.org/jira/browse/YARN-2859 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Hitesh Shah >Priority: Critical > > In mini cluster, a random port should be used. > Also, the config is not updated to the host that the process got bound to. > {code} > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer > address: localhost:10200 > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer > web address: 0.0.0.0:8188 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2859) ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster
[ https://issues.apache.org/jira/browse/YARN-2859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2859: - Assignee: Zhijie Shen > ApplicationHistoryServer binds to default port 8188 in MiniYARNCluster > -- > > Key: YARN-2859 > URL: https://issues.apache.org/jira/browse/YARN-2859 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Hitesh Shah >Assignee: Zhijie Shen >Priority: Critical > > In mini cluster, a random port should be used. > Also, the config is not updated to the host that the process got bound to. > {code} > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(722)) - MiniYARN ApplicationHistoryServer > address: localhost:10200 > 2014-11-13 13:07:01,905 INFO [main] server.MiniYARNCluster > (MiniYARNCluster.java:serviceStart(724)) - MiniYARN ApplicationHistoryServer > web address: 0.0.0.0:8188 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14210199#comment-14210199 ] Zhijie Shen edited comment on YARN-2838 at 11/13/14 7:34 PM: - bq. 1. Whatever the CLI command user executes is historyserver or timelineserver it looks like ApplicationHistoryServer only run. So can we modify the name of the class ApplicationHistoryServer to TimelineHistoryServer (or any other suitable name as it seems like any command user runs ApplicationHistoryServer is started) Yes, not just the the main entry point class, but the whole sub-module needs to be refactor somehow to reflect the generalized conception (YARN-2043). bq. 2. Instead of the "Starting the History Server anyway..." deprecated msg, can we have "Starting the Timeline History Server anyway...". bq. 3. Based on start or stop, deprecated message should get modified to "Starting the Timeline History Server anyway..." or "Stopping the Timeline History Server anyway..." See the comment before. bq. But any way we need to fix this issue also right ? so already any jira is raised or shall i work as part of this jira ? See YARN-2522. We can work this issue there. bq. And also please inform if this issue needs to be split into mulitple jiras (apart from documentation which you have already raised) would like to split and work on them. If you agree, we can close this Jira, and work on separate Jiras that focus on each individual issues. bq. As already i have started looking into these issues, was also planning to work on documentation. If you don't mind can you assign the issue (YARN-2854) to me ? No problem, assigned it to you. was (Author: zjshen): bq. 1. Whatever the CLI command user executes is historyserver or timelineserver it looks like ApplicationHistoryServer only run. So can we modify the name of the class ApplicationHistoryServer to TimelineHistoryServer (or any other suitable name as it seems like any command user runs ApplicationHistoryServer is started) Yes, not just the the main entry point class, but the whole sub-module needs to be refactor somehow to reflect the generalized conception (YARN-2043). bq. 2. Instead of the "Starting the History Server anyway..." deprecated msg, can we have "Starting the Timeline History Server anyway...". bq. 3. Based on start or stop, deprecated message should get modified to "Starting the Timeline History Server anyway..." or "Stopping the Timeline History Server anyway..." See the comment before. bq. But any way we need to fix this issue also right ? so already any jira is raised or shall i work as part of this jira ? See YARN-2522. We can work this issue there. bq. And also please inform if this issue needs to be split into mulitple jiras (apart from documentation which you have already raised) would like to split and work on them. If you agree, we can close this Jira, and work on separate Jiras that focus on each individual issues. bq. As already i have started looking into these issues, was also planning to work on documentation. If you don't mind can you assign the issue (YARN-2854) to me ? No problem, assigned it to you. > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0, 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2838: -- Affects Version/s: 2.6.0 > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0, 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14210199#comment-14210199 ] Zhijie Shen commented on YARN-2838: --- bq. 1. Whatever the CLI command user executes is historyserver or timelineserver it looks like ApplicationHistoryServer only run. So can we modify the name of the class ApplicationHistoryServer to TimelineHistoryServer (or any other suitable name as it seems like any command user runs ApplicationHistoryServer is started) Yes, not just the the main entry point class, but the whole sub-module needs to be refactor somehow to reflect the generalized conception (YARN-2043). bq. 2. Instead of the "Starting the History Server anyway..." deprecated msg, can we have "Starting the Timeline History Server anyway...". bq. 3. Based on start or stop, deprecated message should get modified to "Starting the Timeline History Server anyway..." or "Stopping the Timeline History Server anyway..." See the comment before. bq. But any way we need to fix this issue also right ? so already any jira is raised or shall i work as part of this jira ? See YARN-2522. We can work this issue there. bq. And also please inform if this issue needs to be split into mulitple jiras (apart from documentation which you have already raised) would like to split and work on them. If you agree, we can close this Jira, and work on separate Jiras that focus on each individual issues. bq. As already i have started looking into these issues, was also planning to work on documentation. If you don't mind can you assign the issue (YARN-2854) to me ? No problem, assigned it to you. > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2854) The document about timeline service and generic service needs to be updated
[ https://issues.apache.org/jira/browse/YARN-2854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2854: -- Assignee: Naganarasimha G R (was: Zhijie Shen) > The document about timeline service and generic service needs to be updated > --- > > Key: YARN-2854 > URL: https://issues.apache.org/jira/browse/YARN-2854 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Naganarasimha G R >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2794) Fix log msgs about distributing system-credentials
[ https://issues.apache.org/jira/browse/YARN-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208455#comment-14208455 ] Zhijie Shen commented on YARN-2794: --- Please ignore the previous comment. I missed that the existing code already has {{if (LOG.isDebugEnabled()) {}}. +1 Will commit this patch. > Fix log msgs about distributing system-credentials > --- > > Key: YARN-2794 > URL: https://issues.apache.org/jira/browse/YARN-2794 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2794.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2794) Fix log msgs about distributing system-credentials
[ https://issues.apache.org/jira/browse/YARN-2794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14208431#comment-14208431 ] Zhijie Shen commented on YARN-2794: --- Put this code in {{if (LOG.isDebugEnabled()) {}}? {code} + for (Map.Entry entry : map.entrySet()) { +LOG.debug("Retrieved credentials form RM for " + entry.getKey() + ": " ++ entry.getValue().getAllTokens()); + } {code} > Fix log msgs about distributing system-credentials > --- > > Key: YARN-2794 > URL: https://issues.apache.org/jira/browse/YARN-2794 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Jian He >Assignee: Jian He > Attachments: YARN-2794.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206671#comment-14206671 ] Zhijie Shen edited comment on YARN-2838 at 11/12/14 12:44 AM: -- [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service (YARN-2854). Does it sound good? was (Author: zjshen): [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service. Does it sound good? > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2854) The document about timeline service and generic service needs to be updated
Zhijie Shen created YARN-2854: - Summary: The document about timeline service and generic service needs to be updated Key: YARN-2854 URL: https://issues.apache.org/jira/browse/YARN-2854 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2838) Issues with TimeLineServer (Application History)
[ https://issues.apache.org/jira/browse/YARN-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206671#comment-14206671 ] Zhijie Shen commented on YARN-2838: --- [~Naganarasimha], sorry for not responding you immediately as being busy on finalizing 2.6. A quick scan through your issue document. Here's my clarification: 1. While the entry point of the this sub-module is still called ApplicationHistoryServer, it is actually generalized to be TimelineServer right now (definitely we need to refactor the code at some point). The baseline service provided the the timeline server is to allow the cluster and its apps to store their history information, metrics and so on by complying with the defined timeline data model. Later on, users and admins can query this information to do the analysis. 2. Application history (or we prefer to call it generic history service) is now a built-in service in the timeline server to record the generic history information of YARN apps. It was on a separate store (on FS), but after YARN-2033, it has been moved to the timeline store too, as a payload. We replace the old storage layer, but keep the existing interfaces (web UI, services, CLI) not changed to be the analog of what RM provides for running apps. We still didn't integrate TimelineClient and AHSClient, the latter of which is RPC interface of getting generic history information via RPC interface. APPLICATION_HISTORY_ENABLED is the only remaining old config to control whether we also want to pull the app info from the generic history service inside the timeline server. You may want to take a look at YARN-2033 to get more context about the change. Moreover, as a number of limitation of the old history store, we're no longer going to support it. 3. The document is definitely staled. I'll file separate document Jira, however, it's too late for 2.6. Let's target 2.7 for an up-to-date document about timeline service and its built-in generic history service. Does it sound good? > Issues with TimeLineServer (Application History) > > > Key: YARN-2838 > URL: https://issues.apache.org/jira/browse/YARN-2838 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.5.1 >Reporter: Naganarasimha G R >Assignee: Naganarasimha G R > Attachments: IssuesInTimelineServer.pdf > > > Few issues in usage of Timeline server for generic application history access -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205768#comment-14205768 ] Zhijie Shen commented on YARN-2837: --- Test the patch on a single node secure cluster: 1. Start and restart the timeline server, and the DT information is recovered properly. 2. The DT generated in before the timeline server can be renewed properly afterwards. Some other issues I've observed while doing test: At the very early seconds the http server is started, the MR job, which tries to emit the timeline data, gets a number of 404 error. I guess the server is not fully ready before it taking the incoming requests. > Timeline server needs to recover the timeline DT when restarting > > > Key: YARN-2837 > URL: https://issues.apache.org/jira/browse/YARN-2837 > Project: Hadoop YARN > Issue Type: New Feature > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2837.1.patch > > > Timeline server needs to recover the stateful information when restarting as > RM/NM/JHS does now. So far the stateful information only includes the > timeline DT. Without recovery, the timeline DT of the existing YARN apps is > not long valid, and cannot be renewed any more after the timeline server is > restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2837: -- Attachment: (was: YARN-2834.1.patch) > Timeline server needs to recover the timeline DT when restarting > > > Key: YARN-2837 > URL: https://issues.apache.org/jira/browse/YARN-2837 > Project: Hadoop YARN > Issue Type: New Feature > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2837.1.patch > > > Timeline server needs to recover the stateful information when restarting as > RM/NM/JHS does now. So far the stateful information only includes the > timeline DT. Without recovery, the timeline DT of the existing YARN apps is > not long valid, and cannot be renewed any more after the timeline server is > restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2837: -- Attachment: YARN-2837.1.patch > Timeline server needs to recover the timeline DT when restarting > > > Key: YARN-2837 > URL: https://issues.apache.org/jira/browse/YARN-2837 > Project: Hadoop YARN > Issue Type: New Feature > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2837.1.patch > > > Timeline server needs to recover the stateful information when restarting as > RM/NM/JHS does now. So far the stateful information only includes the > timeline DT. Without recovery, the timeline DT of the existing YARN apps is > not long valid, and cannot be renewed any more after the timeline server is > restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
[ https://issues.apache.org/jira/browse/YARN-2837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2837: -- Attachment: YARN-2834.1.patch Create a patch to make the timeline state store, I choose to use Leveldb impl because: 1. Timeline server already uses leveldb. 2. It provides atomic operations, and isolate the system dependent FS. 3. Less heavy and complex than using HDFS (in particular in secure mode) 4. Easy to implement the operations. > Timeline server needs to recover the timeline DT when restarting > > > Key: YARN-2837 > URL: https://issues.apache.org/jira/browse/YARN-2837 > Project: Hadoop YARN > Issue Type: New Feature > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2834.1.patch > > > Timeline server needs to recover the stateful information when restarting as > RM/NM/JHS does now. So far the stateful information only includes the > timeline DT. Without recovery, the timeline DT of the existing YARN apps is > not long valid, and cannot be renewed any more after the timeline server is > restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2837) Timeline server needs to recover the timeline DT when restarting
Zhijie Shen created YARN-2837: - Summary: Timeline server needs to recover the timeline DT when restarting Key: YARN-2837 URL: https://issues.apache.org/jira/browse/YARN-2837 Project: Hadoop YARN Issue Type: New Feature Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Timeline server needs to recover the stateful information when restarting as RM/NM/JHS does now. So far the stateful information only includes the timeline DT. Without recovery, the timeline DT of the existing YARN apps is not long valid, and cannot be renewed any more after the timeline server is restarted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2834) Resource manager crashed with Null Pointer Exception
[ https://issues.apache.org/jira/browse/YARN-2834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14204132#comment-14204132 ] Zhijie Shen commented on YARN-2834: --- bq. Anyways, treating renewal failures is broken today. I am okay ignoring renewal failures during recovery in this ticket. But let's file a blocker for handling them correctly in 2.7. Thanks for your comments. +1 for this proposal. > Resource manager crashed with Null Pointer Exception > > > Key: YARN-2834 > URL: https://issues.apache.org/jira/browse/YARN-2834 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Jian He >Priority: Critical > Attachments: YARN-2834.1.patch > > > Resource manager failed after restart. > {noformat} > 2014-11-09 04:12:53,013 INFO capacity.CapacityScheduler > (CapacityScheduler.java:initializeQueues(467)) - Initialized root queue root: > numChildQueue= 2, capacity=1.0, absoluteCapacity=1.0, > usedResources=usedCapacity=0.0, numApps=0, numContainers=0 > 2014-11-09 04:12:53,013 INFO capacity.CapacityScheduler > (CapacityScheduler.java:initializeQueueMappings(436)) - Initialized queue > mappings, override: false > 2014-11-09 04:12:53,013 INFO capacity.CapacityScheduler > (CapacityScheduler.java:initScheduler(305)) - Initialized CapacityScheduler > with calculator=class > org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator, > minimumAllocation=<>, maximumAllocation=< vCores:32>>, asynchronousScheduling=false, asyncScheduleInterval=5ms > 2014-11-09 04:12:53,015 INFO service.AbstractService > (AbstractService.java:noteFailure(272)) - Service ResourceManager failed in > state STARTED; cause: java.lang.NullPointerException > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:734) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1089) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:114) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1041) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1005) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:757) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recoverAppAttempts(RMAppImpl.java:821) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.access$1900(RMAppImpl.java:101) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:843) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:826) > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:701) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:312) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:413) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1207) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:590) > at > org.apache.hadoop.service.Abstra
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203215#comment-14203215 ] Zhijie Shen commented on YARN-2505: --- Committed to trunk, branch-2 and branch-2.6. Thanks Craig for the patch, and Wangda and Xuan for the review. > Support get/add/remove/change labels in RM REST API > --- > > Key: YARN-2505 > URL: https://issues.apache.org/jira/browse/YARN-2505 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Craig Welch > Fix For: 2.6.0 > > Attachments: YARN-2505.1.patch, YARN-2505.11.patch, > YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, > YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, > YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.19.patch, > YARN-2505.20.patch, YARN-2505.21.patch, YARN-2505.21.patch, > YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, > YARN-2505.7.patch, YARN-2505.8.patch, YARN-2505.9.patch, YARN-2505.9.patch, > YARN-2505.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203195#comment-14203195 ] Zhijie Shen commented on YARN-2505: --- Kick the Jenkins again > Support get/add/remove/change labels in RM REST API > --- > > Key: YARN-2505 > URL: https://issues.apache.org/jira/browse/YARN-2505 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Craig Welch > Attachments: YARN-2505.1.patch, YARN-2505.11.patch, > YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, > YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, > YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.19.patch, > YARN-2505.20.patch, YARN-2505.21.patch, YARN-2505.21.patch, > YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, > YARN-2505.7.patch, YARN-2505.8.patch, YARN-2505.9.patch, YARN-2505.9.patch, > YARN-2505.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14203116#comment-14203116 ] Zhijie Shen commented on YARN-2505: --- +1 for the latest patch. I'll commit it once jenkins +1 too > Support get/add/remove/change labels in RM REST API > --- > > Key: YARN-2505 > URL: https://issues.apache.org/jira/browse/YARN-2505 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Craig Welch > Attachments: YARN-2505.1.patch, YARN-2505.11.patch, > YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, > YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, > YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.19.patch, > YARN-2505.20.patch, YARN-2505.21.patch, YARN-2505.3.patch, YARN-2505.4.patch, > YARN-2505.5.patch, YARN-2505.6.patch, YARN-2505.7.patch, YARN-2505.8.patch, > YARN-2505.9.patch, YARN-2505.9.patch, YARN-2505.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2505) Support get/add/remove/change labels in RM REST API
[ https://issues.apache.org/jira/browse/YARN-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202992#comment-14202992 ] Zhijie Shen commented on YARN-2505: --- I just have one concern about ConverterUtils#toNodeId(). The behavior is changed when the arg nodeId string is invalid. It may affect NodeCLI and AggregatedLogsBlock when people puts an invalid nodeId string, or the webapp generates an url with invalid nodeId string. BTW, while ConverterUtils is marked \@Private, it's in yarn-common. I'm not sure if other components have already make use of this actually useful "APIs". Any thoughts? > Support get/add/remove/change labels in RM REST API > --- > > Key: YARN-2505 > URL: https://issues.apache.org/jira/browse/YARN-2505 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Craig Welch > Attachments: YARN-2505.1.patch, YARN-2505.11.patch, > YARN-2505.12.patch, YARN-2505.13.patch, YARN-2505.14.patch, > YARN-2505.15.patch, YARN-2505.16.patch, YARN-2505.16.patch, > YARN-2505.16.patch, YARN-2505.18.patch, YARN-2505.19.patch, > YARN-2505.3.patch, YARN-2505.4.patch, YARN-2505.5.patch, YARN-2505.6.patch, > YARN-2505.7.patch, YARN-2505.8.patch, YARN-2505.9.patch, YARN-2505.9.patch, > YARN-2505.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
[ https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201602#comment-14201602 ] Zhijie Shen commented on YARN-2819: --- I've done the following experiments locally: 1. Run timeline server 2.5 to generate the old timeline data without domainId field in leveldb. 2. Run timeline server 2.6 (current trunk actually), and try to update the old entity and relate to it. I can reproduce the same NPE as is mentioned in the description. 3. Run timeline server 2.6 with the attached patch, and try to update the old entity and relate to it. The problem is gone. > NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 > -- > > Key: YARN-2819 > URL: https://issues.apache.org/jira/browse/YARN-2819 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Zhijie Shen >Priority: Critical > Labels: Upgrade > Attachments: YARN-2819.1.patch > > > {code} > Caused by: java.lang.NullPointerException > at java.lang.String.(String.java:554) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) > at > org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) > {code} > triggered by > {code} > entity.getRelatedEntities(); > ... > } else { > byte[] domainIdBytes = db.get(createDomainIdKey( > relatedEntityId, relatedEntityType, > relatedEntityStartTime)); > // This is the existing entity > String domainId = new String(domainIdBytes); > if (!domainId.equals(entity.getDomainId())) { > {code} > The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
[ https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2819: -- Attachment: YARN-2819.1.patch Create a patch to make the leveldb store be compatible to existing data. Basically, we're going to treat the entity without domain Id as the one having a DEFAULT domain. > NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 > -- > > Key: YARN-2819 > URL: https://issues.apache.org/jira/browse/YARN-2819 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Zhijie Shen >Priority: Critical > Labels: Upgrade > Attachments: YARN-2819.1.patch > > > {code} > Caused by: java.lang.NullPointerException > at java.lang.String.(String.java:554) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) > at > org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) > {code} > triggered by > {code} > entity.getRelatedEntities(); > ... > } else { > byte[] domainIdBytes = db.get(createDomainIdKey( > relatedEntityId, relatedEntityType, > relatedEntityStartTime)); > // This is the existing entity > String domainId = new String(domainIdBytes); > if (!domainId.equals(entity.getDomainId())) { > {code} > The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
[ https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200687#comment-14200687 ] Zhijie Shen commented on YARN-2819: --- The NPE happens because the data integrity assumes that no entity has a null domainId. However, if leveldb already contains the timeline data that are generated by prior 2.6 timeline server, the integrity is broken. Previously, the entity doesn't have the domain information. Will work on a fix to be compatible to the existing store. > NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 > -- > > Key: YARN-2819 > URL: https://issues.apache.org/jira/browse/YARN-2819 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Zhijie Shen > Labels: Upgrade > > {code} > Caused by: java.lang.NullPointerException > at java.lang.String.(String.java:554) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) > at > org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) > {code} > triggered by > {code} > entity.getRelatedEntities(); > ... > } else { > byte[] domainIdBytes = db.get(createDomainIdKey( > relatedEntityId, relatedEntityType, > relatedEntityStartTime)); > // This is the existing entity > String domainId = new String(domainIdBytes); > if (!domainId.equals(entity.getDomainId())) { > {code} > The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
[ https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2819: -- Priority: Critical (was: Major) > NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 > -- > > Key: YARN-2819 > URL: https://issues.apache.org/jira/browse/YARN-2819 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Zhijie Shen >Priority: Critical > Labels: Upgrade > > {code} > Caused by: java.lang.NullPointerException > at java.lang.String.(String.java:554) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) > at > org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) > {code} > triggered by > {code} > entity.getRelatedEntities(); > ... > } else { > byte[] domainIdBytes = db.get(createDomainIdKey( > relatedEntityId, relatedEntityType, > relatedEntityStartTime)); > // This is the existing entity > String domainId = new String(domainIdBytes); > if (!domainId.equals(entity.getDomainId())) { > {code} > The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2819) NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6
[ https://issues.apache.org/jira/browse/YARN-2819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-2819: - Assignee: Zhijie Shen > NPE in ATS Timeline Domains when upgrading from 2.4 to 2.6 > -- > > Key: YARN-2819 > URL: https://issues.apache.org/jira/browse/YARN-2819 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Affects Versions: 2.6.0 >Reporter: Gopal V >Assignee: Zhijie Shen > Labels: Upgrade > > {code} > Caused by: java.lang.NullPointerException > at java.lang.String.(String.java:554) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:873) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.put(LeveldbTimelineStore.java:1014) > at > org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:330) > at > org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:260) > {code} > triggered by > {code} > entity.getRelatedEntities(); > ... > } else { > byte[] domainIdBytes = db.get(createDomainIdKey( > relatedEntityId, relatedEntityType, > relatedEntityStartTime)); > // This is the existing entity > String domainId = new String(domainIdBytes); > if (!domainId.equals(entity.getDomainId())) { > {code} > The new String(domainIdBytes); throws an NPE. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2818) Remove the logic to inject entity owner as the primary filter
[ https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2818: -- Attachment: YARN-2818.2.patch Remove one more unnecessary method. > Remove the logic to inject entity owner as the primary filter > - > > Key: YARN-2818 > URL: https://issues.apache.org/jira/browse/YARN-2818 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2818.1.patch, YARN-2818.2.patch > > > In 2.5, we inject owner info as a primary filter to support entity-level > acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, > there's no need to inject owner info. There're two motivations: > 1. For leveldb timeline store, the primary filter is expensive. When we have > a primary filter, we need to make a complete copy of the entity on the logic > index table. > 2. Owner info is incomplete. Say we want to put E1 (owner = "tester", > relatedEntity = "E2"). If E2 doesn't exist before, leveldb timeline store > will create an empty E2 without owner info (at the db point of view, it > doesn't know owner is a "special" primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2818) Remove the logic to inject entity owner as the primary filter
[ https://issues.apache.org/jira/browse/YARN-2818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2818: -- Attachment: YARN-2818.1.patch Put a patch to remove this logic. The change should be mostly compatible. 2.6 server can still read the data created by 2.5, but take the owner as the normal primary filter. 2.5 server can also read the 2.6 data. The only drawback is that no owner info is available for entity-level acl control. However, as I've mentioned in description, the owner info will be incomplete. So anyway, there's a bug. > Remove the logic to inject entity owner as the primary filter > - > > Key: YARN-2818 > URL: https://issues.apache.org/jira/browse/YARN-2818 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2818.1.patch > > > In 2.5, we inject owner info as a primary filter to support entity-level > acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, > there's no need to inject owner info. There're two motivations: > 1. For leveldb timeline store, the primary filter is expensive. When we have > a primary filter, we need to make a complete copy of the entity on the logic > index table. > 2. Owner info is incomplete. Say we want to put E1 (owner = "tester", > relatedEntity = "E2"). If E2 doesn't exist before, leveldb timeline store > will create an empty E2 without owner info (at the db point of view, it > doesn't know owner is a "special" primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2818) Remove the logic to inject entity owner as the primary filter
Zhijie Shen created YARN-2818: - Summary: Remove the logic to inject entity owner as the primary filter Key: YARN-2818 URL: https://issues.apache.org/jira/browse/YARN-2818 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical In 2.5, we inject owner info as a primary filter to support entity-level acls. Since 2.6, we have a different acls solution (YARN-2102). Therefore, there's no need to inject owner info. There're two motivations: 1. For leveldb timeline store, the primary filter is expensive. When we have a primary filter, we need to make a complete copy of the entity on the logic index table. 2. Owner info is incomplete. Say we want to put E1 (owner = "tester", relatedEntity = "E2"). If E2 doesn't exist before, leveldb timeline store will create an empty E2 without owner info (at the db point of view, it doesn't know owner is a "special" primary filter). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2813) NPE from MemoryTimelineStore.getDomains
[ https://issues.apache.org/jira/browse/YARN-2813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2813: -- Attachment: YARN-2813.1.patch Upload a patch to fix the npe > NPE from MemoryTimelineStore.getDomains > --- > > Key: YARN-2813 > URL: https://issues.apache.org/jira/browse/YARN-2813 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2813.1.patch > > > {code} > 2014-11-04 20:50:05,146 WARN > org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR > javax.ws.rs.WebApplicationException: java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) > at > com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) > at > com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:269) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:542) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1204) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) >
[jira] [Created] (YARN-2813) NPE from MemoryTimelineStore.getDomains
Zhijie Shen created YARN-2813: - Summary: NPE from MemoryTimelineStore.getDomains Key: YARN-2813 URL: https://issues.apache.org/jira/browse/YARN-2813 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen {code} 2014-11-04 20:50:05,146 WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR javax.ws.rs.WebApplicationException: java.lang.NullPointerException at org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.getDomains(TimelineWebServices.java:356) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834) at com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795) at com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572) at org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:269) at org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:542) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1204) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle
[jira] [Updated] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine
[ https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2812: -- Attachment: YARN-2812.1.patch Upload a patch to fix the issue. > TestApplicationHistoryServer is likely to fail on less powerful machine > --- > > Key: YARN-2812 > URL: https://issues.apache.org/jira/browse/YARN-2812 > Project: Hadoop YARN > Issue Type: Test > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > Attachments: YARN-2812.1.patch > > > {code:title=testFilteOverrides} > java.lang.Exception: test timed out after 5 milliseconds > at java.net.Inet4AddressImpl.getHostByAddr(Native Method) > at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898) > at java.net.InetAddress.getHostFromNameService(InetAddress.java:583) > at java.net.InetAddress.getHostName(InetAddress.java:525) > at java.net.InetAddress.getHostName(InetAddress.java:497) > at > java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) > at > java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) > at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) > at > org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) > at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104) > {code} > {code:title=testStartStopServer, testLaunch} > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock > /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK: > already held by process > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine
[ https://issues.apache.org/jira/browse/YARN-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199075#comment-14199075 ] Zhijie Shen commented on YARN-2812: --- The root causes of the test failures are: 1. testFilteOverrides actually starts and stops 4 times, and the timeout allowance given to it is similar to the other cases that just does it once. It seems to be to short for a slow machine. 2. While testFilteOverrides is timeout, the lock of the dir of leveldb is still not released, while the other two cases that want to access the same dir (by default) encounter the lock exception. Will fix the test failures. > TestApplicationHistoryServer is likely to fail on less powerful machine > --- > > Key: YARN-2812 > URL: https://issues.apache.org/jira/browse/YARN-2812 > Project: Hadoop YARN > Issue Type: Test > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen > > {code:title=testFilteOverrides} > java.lang.Exception: test timed out after 5 milliseconds > at java.net.Inet4AddressImpl.getHostByAddr(Native Method) > at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898) > at java.net.InetAddress.getHostFromNameService(InetAddress.java:583) > at java.net.InetAddress.getHostName(InetAddress.java:525) > at java.net.InetAddress.getHostName(InetAddress.java:497) > at > java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) > at > java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) > at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) > at > org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) > at > org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) > at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104) > {code} > {code:title=testStartStopServer, testLaunch} > org.apache.hadoop.service.ServiceStateException: > org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock > /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK: > already held by process > at > org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at > org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2812) TestApplicationHistoryServer is likely to fail on less powerful machine
Zhijie Shen created YARN-2812: - Summary: TestApplicationHistoryServer is likely to fail on less powerful machine Key: YARN-2812 URL: https://issues.apache.org/jira/browse/YARN-2812 Project: Hadoop YARN Issue Type: Test Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen {code:title=testFilteOverrides} java.lang.Exception: test timed out after 5 milliseconds at java.net.Inet4AddressImpl.getHostByAddr(Native Method) at java.net.InetAddress$1.getHostByAddr(InetAddress.java:898) at java.net.InetAddress.getHostFromNameService(InetAddress.java:583) at java.net.InetAddress.getHostName(InetAddress.java:525) at java.net.InetAddress.getHostName(InetAddress.java:497) at java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82) at java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56) at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.createServer(RpcServerFactoryPBImpl.java:169) at org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl.getServer(RpcServerFactoryPBImpl.java:132) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.getServer(HadoopYarnProtoRPC.java:65) at org.apache.hadoop.yarn.ipc.YarnRPC.getServer(YarnRPC.java:54) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.serviceStart(ApplicationHistoryClientService.java:87) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceStart(ApplicationHistoryServer.java:111) at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testFilteOverrides(TestApplicationHistoryServer.java:104) {code} {code:title=testStartStopServer, testLaunch} org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /grid/0/jenkins/workspace/UT-hadoop-champlain-chunks/workspace/UT-hadoop-champlain-chunks/commonarea/hdp-BUILDS/hadoop-2.6.0.2.2.0.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/build/test/yarn/timeline/leveldb-timeline-store.ldb/LOCK: already held by process at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:219) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:99) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.applicationhistoryservice.TestApplicationHistoryServer.testStartStopServer(TestApplicationHistoryServer.java:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2767) RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode
[ https://issues.apache.org/jira/browse/YARN-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198846#comment-14198846 ] Zhijie Shen commented on YARN-2767: --- +1 will commit the patch. > RM web services - add test case to ensure the http static user cannot kill or > submit apps in secure mode > > > Key: YARN-2767 > URL: https://issues.apache.org/jira/browse/YARN-2767 > Project: Hadoop YARN > Issue Type: Test > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2767.0.patch, apache-yarn-2767.1.patch, > apache-yarn-2767.2.patch, apache-yarn-2767.3.patch > > > We should add a test to ensure that the http static user used to access the > RM web interface can't submit or kill apps if the cluster is running in > secure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2506) TimelineClient should NOT be in yarn-common project
[ https://issues.apache.org/jira/browse/YARN-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197442#comment-14197442 ] Zhijie Shen commented on YARN-2506: --- Sure, I can take care of it. > TimelineClient should NOT be in yarn-common project > --- > > Key: YARN-2506 > URL: https://issues.apache.org/jira/browse/YARN-2506 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen >Priority: Critical > > YARN-2298 incorrectly moved TimelineClient to yarn-common project. It doesn't > belong there, we should move it back to yarn-client module. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2752) ContainerExecutor always append "nice -n" in command on branch-2
[ https://issues.apache.org/jira/browse/YARN-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2752: -- Summary: ContainerExecutor always append "nice -n" in command on branch-2 (was: TestContainerExecutor.testRunCommandNoPriority fails in branch-2) > ContainerExecutor always append "nice -n" in command on branch-2 > > > Key: YARN-2752 > URL: https://issues.apache.org/jira/browse/YARN-2752 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Critical > Fix For: 2.6.0 > > Attachments: YARN-2752.1-branch-2.patch, YARN-2752.2-branch-2.patch > > > TestContainerExecutor.testRunCommandNoPriority fails in branch-2. But it > passed in trunk. > The function code ContainerExecutor.getRunCommand() in trunk is different > from that in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2752) TestContainerExecutor.testRunCommandNoPriority fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197143#comment-14197143 ] Zhijie Shen commented on YARN-2752: --- +1. The jenkins doesn't work because the patch only applies to branch-2. I've verified it locally. It can compile and fix the test failure. Will commit the patch. > TestContainerExecutor.testRunCommandNoPriority fails in branch-2 > > > Key: YARN-2752 > URL: https://issues.apache.org/jira/browse/YARN-2752 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong >Priority: Critical > Attachments: YARN-2752.1-branch-2.patch, YARN-2752.2-branch-2.patch > > > TestContainerExecutor.testRunCommandNoPriority fails in branch-2. But it > passed in trunk. > The function code ContainerExecutor.getRunCommand() in trunk is different > from that in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2804) Timeline server .out log have JAXB binding exceptions and warnings.
[ https://issues.apache.org/jira/browse/YARN-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197087#comment-14197087 ] Zhijie Shen commented on YARN-2804: --- In case folks want to know the .out output afterwards, I posted it here: {code} core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 256 pipe size(512 bytes, -p) 1 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 709 virtual memory (kbytes, -v) unlimited Nov 04, 2014 2:32:55 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information. Nov 04, 2014 2:32:55 PM com.google.inject.servlet.InternalServletModule$BackwardsCompatibleServletContextProvider get WARNING: You are attempting to use a deprecated API (specifically, attempting to @Inject ServletContext inside an eagerly created singleton. While we allow this for backwards compatibility, be warned that this MAY have unexpected behavior if you have more than one injector (with ServletModule) running in the same JVM. Please consult the Guice documentation at http://code.google.com/p/google-guice/wiki/Servlets for more information. Nov 04, 2014 2:32:55 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.webapp.YarnJacksonJaxbJsonProvider as a provider class Nov 04, 2014 2:32:55 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices as a root resource class Nov 04, 2014 2:32:55 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices as a root resource class Nov 04, 2014 2:32:55 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register INFO: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class Nov 04, 2014 2:32:55 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM' Nov 04, 2014 2:32:56 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton" Nov 04, 2014 2:32:56 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.webapp.YarnJacksonJaxbJsonProvider to GuiceManagedComponentProvider with the scope "Singleton" Nov 04, 2014 2:32:56 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.server.applicationhistoryservice.webapp.AHSWebServices to GuiceManagedComponentProvider with the scope "Singleton" Nov 04, 2014 2:32:56 PM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider INFO: Binding org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices to GuiceManagedComponentProvider with the scope "Singleton" {code} It WON'T increase with the number of RESTful requests. > Timeline server .out log have JAXB binding exceptions and warnings. > --- > > Key: YARN-2804 > URL: https://issues.apache.org/jira/browse/YARN-2804 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2804.1.patch, YARN-2804.2.patch > > > Unlike other daemon, timeline server binds JacksonJaxbJsonProvider to resolve > the resources. However, there are noises in .out log: > {code} > SEVERE: Failed to generate the schema for the JAX-B elements > com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of > IllegalAnnotationExceptions > java.util.Map is an interface, and JAXB can't handle interfaces. > this problem is related to the following location: > at java.util.Map > at public java.util.Map > org.apache.hadoop
[jira] [Updated] (YARN-2804) Timeline server .out log have JAXB binding exceptions and warnings.
[ https://issues.apache.org/jira/browse/YARN-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2804: -- Attachment: YARN-2804.2.patch Thanks for the comments. I've moved the logic to setters, and validated it on my local cluster too, and it still suppressed all the exceptions and warning logs in .out file. In addition, I added a test case to verify that the changed POJO setters/getters are working properly. > Timeline server .out log have JAXB binding exceptions and warnings. > --- > > Key: YARN-2804 > URL: https://issues.apache.org/jira/browse/YARN-2804 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2804.1.patch, YARN-2804.2.patch > > > Unlike other daemon, timeline server binds JacksonJaxbJsonProvider to resolve > the resources. However, there are noises in .out log: > {code} > SEVERE: Failed to generate the schema for the JAX-B elements > com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of > IllegalAnnotationExceptions > java.util.Map is an interface, and JAXB can't handle interfaces. > this problem is related to the following location: > at java.util.Map > at public java.util.Map > org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities > java.util.Map does not have a no-arg default constructor. > this problem is related to the following location: > at java.util.Map > at public java.util.Map > org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities > at > com.sun.xml.bind.v2.runtime.IllegalAnnotationsException$Builder.check(IllegalAnnotationsException.java:106) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getTypeInfoSet(JAXBContextImpl.java:489) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:319) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) > at > com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:248) > at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:235) > at javax.xml.bind.ContextFinder.find(ContextFinder.java:432) > at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:637) > at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:584) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352) > at com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:115) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120) > at > com.sun.jersey.server.impl.wadl.WadlMethodFactory$WadlOptionsMethodDispatcher.dispatch(WadlMethodFactory.java:98) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.acc
[jira] [Commented] (YARN-2767) RM web services - add test case to ensure the http static user cannot kill or submit apps in secure mode
[ https://issues.apache.org/jira/browse/YARN-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14196717#comment-14196717 ] Zhijie Shen commented on YARN-2767: --- Sorry for not raising it early, but just notice a nit. It is using another class name as the dir, which may have problem if the two test cases rum simultaneously, it may have some conflicts. {code} + private static final File testRootDir = new File("target", +TestRMWebServicesDelegationTokenAuthentication.class.getName() + "-root"); {code} > RM web services - add test case to ensure the http static user cannot kill or > submit apps in secure mode > > > Key: YARN-2767 > URL: https://issues.apache.org/jira/browse/YARN-2767 > Project: Hadoop YARN > Issue Type: Test > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2767.0.patch, apache-yarn-2767.1.patch, > apache-yarn-2767.2.patch > > > We should add a test to ensure that the http static user used to access the > RM web interface can't submit or kill apps if the cluster is running in > secure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2804) Timeline server .out log have JAXB binding exceptions and warnings.
[ https://issues.apache.org/jira/browse/YARN-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2804: -- Attachment: YARN-2804.1.patch In the patch, I made a compromise when changing TimelineEntity and TimelineEvent, to ensure java API compatible as well as satisfy JAXB. For put domain response, I change to return an empty TimelinePutResponse instead of using Jersey Response. After these changes, the exceptions and the warnings are gone from .out. > Timeline server .out log have JAXB binding exceptions and warnings. > --- > > Key: YARN-2804 > URL: https://issues.apache.org/jira/browse/YARN-2804 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2804.1.patch > > > Unlike other daemon, timeline server binds JacksonJaxbJsonProvider to resolve > the resources. However, there are noises in .out log: > {code} > SEVERE: Failed to generate the schema for the JAX-B elements > com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of > IllegalAnnotationExceptions > java.util.Map is an interface, and JAXB can't handle interfaces. > this problem is related to the following location: > at java.util.Map > at public java.util.Map > org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities > java.util.Map does not have a no-arg default constructor. > this problem is related to the following location: > at java.util.Map > at public java.util.Map > org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities > at > com.sun.xml.bind.v2.runtime.IllegalAnnotationsException$Builder.check(IllegalAnnotationsException.java:106) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getTypeInfoSet(JAXBContextImpl.java:489) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:319) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) > at > com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:248) > at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:235) > at javax.xml.bind.ContextFinder.find(ContextFinder.java:432) > at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:637) > at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:584) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412) > at > com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352) > at com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:115) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104) > at > com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120) > at > com.sun.jersey.server.impl.wadl.WadlMethodFactory$WadlOptionsMethodDispatcher.dispatch(WadlMethodFactory.java:98) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules
[jira] [Commented] (YARN-2804) Timeline server .out log have JAXB binding exceptions and warnings.
[ https://issues.apache.org/jira/browse/YARN-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195475#comment-14195475 ] Zhijie Shen commented on YARN-2804: --- If the map interface issue is resolved, another issue which didn't occur before will show up too: {code} java.lang.IllegalAccessException: Class com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator$8 can not access a member of class javax.ws.rs.core.Response with modifiers "protected" at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:65) at java.lang.Class.newInstance0(Class.java:349) at java.lang.Class.newInstance(Class.java:308) at com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator$8.resolve(WadlGeneratorJAXBGrammarGenerator.java:467) at com.sun.jersey.server.wadl.WadlGenerator$ExternalGrammarDefinition.resolve(WadlGenerator.java:181) at com.sun.jersey.server.wadl.ApplicationDescription.resolve(ApplicationDescription.java:81) at com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.attachTypes(WadlGeneratorJAXBGrammarGenerator.java:518) at com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:124) at com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104) at com.sun.jersey.server.impl.wadl.WadlResource.getWadl(WadlResource.java:89) {code} This needs to be fixed together to completely avoid the excessive log though it seems not to be necessary if we upgrade jersey (See [here|https://java.net/projects/jersey/lists/users/archive/2011-10/message/117]) > Timeline server .out log have JAXB binding exceptions and warnings. > --- > > Key: YARN-2804 > URL: https://issues.apache.org/jira/browse/YARN-2804 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > > Unlike other daemon, timeline server binds JacksonJaxbJsonProvider to resolve > the resources. However, there are noises in .out log: > {code} > SEVERE: Failed to generate the schema for the JAX-B elements > com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of > IllegalAnnotationExceptions > java.util.Map is an interface, and JAXB can't handle interfaces. > this problem is related to the following location: > at java.util.Map > at public java.util.Map > org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities > java.util.Map does not have a no-arg default constructor. > this problem is related to the following location: > at java.util.Map > at public java.util.Map > org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity > at public java.util.List > org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() > at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities > at > com.sun.xml.bind.v2.runtime.IllegalAnnotationsException$Builder.check(IllegalAnnotationsException.java:106) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.getTypeInfoSet(JAXBContextImpl.java:489) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:319) > at > com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) > at > com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:248) > at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:235) > at javax.xml.bind.ContextFinder.find(ContextFinder.jav
[jira] [Created] (YARN-2804) Timeline server .out log have JAXB binding exceptions and warnings.
Zhijie Shen created YARN-2804: - Summary: Timeline server .out log have JAXB binding exceptions and warnings. Key: YARN-2804 URL: https://issues.apache.org/jira/browse/YARN-2804 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical Unlike other daemon, timeline server binds JacksonJaxbJsonProvider to resolve the resources. However, there are noises in .out log: {code} SEVERE: Failed to generate the schema for the JAX-B elements com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions java.util.Map is an interface, and JAXB can't handle interfaces. this problem is related to the following location: at java.util.Map at public java.util.Map org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent at public java.util.List org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity at public java.util.List org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities java.util.Map does not have a no-arg default constructor. this problem is related to the following location: at java.util.Map at public java.util.Map org.apache.hadoop.yarn.api.records.timeline.TimelineEvent.getEventInfo() at org.apache.hadoop.yarn.api.records.timeline.TimelineEvent at public java.util.List org.apache.hadoop.yarn.api.records.timeline.TimelineEntity.getEvents() at org.apache.hadoop.yarn.api.records.timeline.TimelineEntity at public java.util.List org.apache.hadoop.yarn.api.records.timeline.TimelineEntities.getEntities() at org.apache.hadoop.yarn.api.records.timeline.TimelineEntities at com.sun.xml.bind.v2.runtime.IllegalAnnotationsException$Builder.check(IllegalAnnotationsException.java:106) at com.sun.xml.bind.v2.runtime.JAXBContextImpl.getTypeInfoSet(JAXBContextImpl.java:489) at com.sun.xml.bind.v2.runtime.JAXBContextImpl.(JAXBContextImpl.java:319) at com.sun.xml.bind.v2.runtime.JAXBContextImpl$JAXBContextBuilder.build(JAXBContextImpl.java:1170) at com.sun.xml.bind.v2.ContextFactory.createContext(ContextFactory.java:145) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:248) at javax.xml.bind.ContextFinder.newInstance(ContextFinder.java:235) at javax.xml.bind.ContextFinder.find(ContextFinder.java:432) at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:637) at javax.xml.bind.JAXBContext.newInstance(JAXBContext.java:584) at com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.buildModelAndSchemas(WadlGeneratorJAXBGrammarGenerator.java:412) at com.sun.jersey.server.wadl.generators.WadlGeneratorJAXBGrammarGenerator.createExternalGrammar(WadlGeneratorJAXBGrammarGenerator.java:352) at com.sun.jersey.server.wadl.WadlBuilder.generate(WadlBuilder.java:115) at com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:104) at com.sun.jersey.server.impl.wadl.WadlApplicationContextImpl.getApplication(WadlApplicationContextImpl.java:120) at com.sun.jersey.server.impl.wadl.WadlMethodFactory$WadlOptionsMethodDispatcher.dispatch(WadlMethodFactory.java:98) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at
[jira] [Comment Edited] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194982#comment-14194982 ] Zhijie Shen edited comment on YARN-2798 at 11/3/14 8:03 PM: I don't have a quick setup for RM HA and secure cluster, but the mapping rule is applied every where in this cluster, I think it should work fine. In fact, this issue is not HA related problem. However, in general, if we want the DT renew to work across RMs, we have to run these RMs as the same operating user name. Otherwise, if DT renewer is set to yarn of RM1, and RM2 is run by yarn'. RM2 can no longer renew the DT. This is not applied just to timeline DT, but all the DTs that we assign RM to renew. Correct me if I'm wrong. was (Author: zjshen): I don't have a quick setup for RM HA and secure cluster, but the mapping rule is applied every where in this cluster, I think it should work fine. > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2798.1.patch, YARN-2798.2.patch > > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194982#comment-14194982 ] Zhijie Shen commented on YARN-2798: --- I don't have a quick setup for RM HA and secure cluster, but the mapping rule is applied every where in this cluster, I think it should work fine. > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2798.1.patch, YARN-2798.2.patch > > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194225#comment-14194225 ] Zhijie Shen commented on YARN-2798: --- Test failures are not related. > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2798.1.patch, YARN-2798.2.patch > > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2798: -- Attachment: YARN-2798.2.patch bq. I don't understand why you are using timelineHost to resolve the renewer to be the ResourceManager. Good catch! We should use rmHost. In addition, it's not necessary to parse RM principal every time we request a timeline DT, move the logic of constructing the renewer to serviceInit. > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2798.1.patch, YARN-2798.2.patch > > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194071#comment-14194071 ] Zhijie Shen edited comment on YARN-2798 at 11/2/14 11:50 PM: - Created patch to remove the translation logic from the client, and at the client side we just need to ensure _HOST is going to be mapped to the right timeline server. Add the test cases to verify the responsibility at both the client and server-side DT creating. Please note that to make this work, core-site.xml that is presented to the timeline server should have proper auth_to_local configuration. was (Author: zjshen): Created patch to remove the translation logic from the client, and at the client side we just need to ensure _HOST is going to be mapped to the right timeline server. Add the test cases to verify the responsibility at both the client and server-side DT creating. Please note that to make this work, core-site.xml and yarn-site.xml that are presented to the timeline server should have proper auth_to_local and rm principal configurations. > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2798.1.patch > > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2785) TestContainerResourceUsage fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194083#comment-14194083 ] Zhijie Shen commented on YARN-2785: --- Commit the patch to trunk, branch-2 and branch-2.6. Thanks Varun! > TestContainerResourceUsage fails intermittently > --- > > Key: YARN-2785 > URL: https://issues.apache.org/jira/browse/YARN-2785 > Project: Hadoop YARN > Issue Type: Test >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: apache-yarn-2785.0.patch, apache-yarn-2785.1.patch, > apache-yarn-2785.2.patch > > > TestContainerResourceUsage fails sometimes due to the timeout values being > low. > From the test failures - > {noformat} > -- > Running > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage > Tests run: 4, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 71.264 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResource > testUsageWithMultipleContainersAndRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 60.032 sec <<< ERROR! > java.lang.Exception: test timed out after 6 milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:209) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:198) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithMultipleContainersAndRMRestart(TestContainerResourceUsage.java: > testUsageWithOneAttemptAndOneContainer(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 3.375 sec <<< FAILURE! > java.lang.AssertionError: While app is running, memory seconds should be >0 > but is 0 > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithOneAttemptAndOneContainer(TestContainerResourceUsage.java:108) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2785) TestContainerResourceUsage fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194074#comment-14194074 ] Zhijie Shen commented on YARN-2785: --- +1 will commit the patch > TestContainerResourceUsage fails intermittently > --- > > Key: YARN-2785 > URL: https://issues.apache.org/jira/browse/YARN-2785 > Project: Hadoop YARN > Issue Type: Test >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2785.0.patch, apache-yarn-2785.1.patch, > apache-yarn-2785.2.patch > > > TestContainerResourceUsage fails sometimes due to the timeout values being > low. > From the test failures - > {noformat} > -- > Running > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage > Tests run: 4, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 71.264 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResource > testUsageWithMultipleContainersAndRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 60.032 sec <<< ERROR! > java.lang.Exception: test timed out after 6 milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:209) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:198) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithMultipleContainersAndRMRestart(TestContainerResourceUsage.java: > testUsageWithOneAttemptAndOneContainer(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 3.375 sec <<< FAILURE! > java.lang.AssertionError: While app is running, memory seconds should be >0 > but is 0 > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithOneAttemptAndOneContainer(TestContainerResourceUsage.java:108) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2798: -- Attachment: YARN-2798.1.patch Created patch to remove the translation logic from the client, and at the client side we just need to ensure _HOST is going to be mapped to the right timeline server. Add the test cases to verify the responsibility at both the client and server-side DT creating. Please note that to make this work, core-site.xml and yarn-site.xml that are presented to the timeline server should have proper auth_to_local and rm principal configurations. > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > Attachments: YARN-2798.1.patch > > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193993#comment-14193993 ] Zhijie Shen commented on YARN-2798: --- Report the issue on behalf of [~arpitgupta] > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2798: -- Reporter: Arpit Gupta (was: Zhijie Shen) > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Arpit Gupta >Assignee: Zhijie Shen >Priority: Blocker > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
[ https://issues.apache.org/jira/browse/YARN-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2798: -- Component/s: timelineserver > YarnClient doesn't need to translate Kerberos name of timeline DT renewer > - > > Key: YARN-2798 > URL: https://issues.apache.org/jira/browse/YARN-2798 > Project: Hadoop YARN > Issue Type: Bug > Components: timelineserver >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Blocker > > Now YarnClient will automatically get a timeline DT when submitting an app in > a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get > the RM daemon operating system user. However, the RM principal and > auth_to_local may not be properly presented to the client, and the client > cannot translate the principal to the daemon user properly. On the other > hand, AbstractDelegationTokenIdentifier will do this translation when create > the token. However, since the client has already translated the full > principal into a short user name (which may not be correct), the server can > no longer apply the translation any more, where RM principal and > auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2798) YarnClient doesn't need to translate Kerberos name of timeline DT renewer
Zhijie Shen created YARN-2798: - Summary: YarnClient doesn't need to translate Kerberos name of timeline DT renewer Key: YARN-2798 URL: https://issues.apache.org/jira/browse/YARN-2798 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Blocker Now YarnClient will automatically get a timeline DT when submitting an app in a secure mode. It will try to parse the yarn-site.xml/core-site.xml to get the RM daemon operating system user. However, the RM principal and auth_to_local may not be properly presented to the client, and the client cannot translate the principal to the daemon user properly. On the other hand, AbstractDelegationTokenIdentifier will do this translation when create the token. However, since the client has already translated the full principal into a short user name (which may not be correct), the server can no longer apply the translation any more, where RM principal and auth_to_local are always correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2752) TestContainerExecutor.testRunCommandNoPriority fails in branch-2
[ https://issues.apache.org/jira/browse/YARN-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193494#comment-14193494 ] Zhijie Shen commented on YARN-2752: --- The fix makes sense. Currently the branch-2 code will always add "nice -n" argument whether the priority is using the default 0 or a user customized value. One suggestion: maybe it's better to apply the diff between trunk and branch-2 here. It prevents the merge failure if we modify this code on the trunk, and cherry pick it to branch-2 in the future. > TestContainerExecutor.testRunCommandNoPriority fails in branch-2 > > > Key: YARN-2752 > URL: https://issues.apache.org/jira/browse/YARN-2752 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Xuan Gong >Assignee: Xuan Gong > Attachments: YARN-2752.1-branch-2.patch > > > TestContainerExecutor.testRunCommandNoPriority fails in branch-2. But it > passed in trunk. > The function code ContainerExecutor.getRunCommand() in trunk is different > from that in branch-2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2785) TestContainerResourceUsage fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193449#comment-14193449 ] Zhijie Shen commented on YARN-2785: --- Why don't we need to prolong timeout for testUsageWithOneAttemptAndOneContainer too? {code} - @Test (timeout = 6) + @Test (timeout = 12) public void testUsageWithMultipleContainersAndRMRestart() throws Exception { {code} > TestContainerResourceUsage fails intermittently > --- > > Key: YARN-2785 > URL: https://issues.apache.org/jira/browse/YARN-2785 > Project: Hadoop YARN > Issue Type: Test >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2785.0.patch, apache-yarn-2785.1.patch > > > TestContainerResourceUsage fails sometimes due to the timeout values being > low. > From the test failures - > {noformat} > -- > Running > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage > Tests run: 4, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 71.264 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResource > testUsageWithMultipleContainersAndRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 60.032 sec <<< ERROR! > java.lang.Exception: test timed out after 6 milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:209) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:198) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithMultipleContainersAndRMRestart(TestContainerResourceUsage.java: > testUsageWithOneAttemptAndOneContainer(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 3.375 sec <<< FAILURE! > java.lang.AssertionError: While app is running, memory seconds should be >0 > but is 0 > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithOneAttemptAndOneContainer(TestContainerResourceUsage.java:108) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2783) TestApplicationClientProtocolOnHA fails on trunk intermittently
[ https://issues.apache.org/jira/browse/YARN-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2783: -- Summary: TestApplicationClientProtocolOnHA fails on trunk intermittently (was: TestApplicationClientProtocolOnHA) > TestApplicationClientProtocolOnHA fails on trunk intermittently > --- > > Key: YARN-2783 > URL: https://issues.apache.org/jira/browse/YARN-2783 > Project: Hadoop YARN > Issue Type: Test >Reporter: Zhijie Shen > > {code} > Running org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA > Tests run: 17, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 147.881 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA > testGetContainersOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) > Time elapsed: 12.928 sec <<< ERROR! > java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 > to asf905.gq1.ygridcore.net:28032 failed on connection exception: > java.net.ConnectException: Connection refused; For more details see: > http://wiki.apache.org/hadoop/ConnectionRefused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) > at > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) > at org.apache.hadoop.ipc.Client.call(Client.java:1438) > at org.apache.hadoop.ipc.Client.call(Client.java:1399) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy17.getContainers(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getContainers(ApplicationClientProtocolPBClientImpl.java:400) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) > at com.sun.proxy.$Proxy18.getContainers(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getContainers(YarnClientImpl.java:639) > at > org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetContainersOnHA(TestApplicationClientProtocolOnHA.java:154) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2785) TestContainerResourceUsage fails intermittently
[ https://issues.apache.org/jira/browse/YARN-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192892#comment-14192892 ] Zhijie Shen commented on YARN-2785: --- Add more time should a solution for slow computers. My question is whether all test cases in TestContainerResourceUsage are subject to timeout? And it seems that both testUsageWithOneAttemptAndOneContainer and testUsageWithMultipleContainersAndRMRestart need to sleep to let metrics to accumulate. Hence should the fix be applied to all the test cases here? > TestContainerResourceUsage fails intermittently > --- > > Key: YARN-2785 > URL: https://issues.apache.org/jira/browse/YARN-2785 > Project: Hadoop YARN > Issue Type: Test >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2785.0.patch > > > TestContainerResourceUsage fails sometimes due to the timeout values being > low. > From the test failures - > {noformat} > -- > Running > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage > Tests run: 4, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 71.264 sec > <<< FAILURE! - in > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResource > testUsageWithMultipleContainersAndRMRestart(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 60.032 sec <<< ERROR! > java.lang.Exception: test timed out after 6 milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:209) > at > org.apache.hadoop.yarn.server.resourcemanager.MockRM.waitForState(MockRM.java:198) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithMultipleContainersAndRMRestart(TestContainerResourceUsage.java: > testUsageWithOneAttemptAndOneContainer(org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage) > Time elapsed: 3.375 sec <<< FAILURE! > java.lang.AssertionError: While app is running, memory seconds should be >0 > but is 0 > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.assertTrue(Assert.java:41) > at > org.apache.hadoop.yarn.server.resourcemanager.TestContainerResourceUsage.testUsageWithOneAttemptAndOneContainer(TestContainerResourceUsage.java:108) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2711) TestDefaultContainerExecutor#testContainerLaunchError fails on Windows
[ https://issues.apache.org/jira/browse/YARN-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2711: -- Issue Type: Test (was: Bug) > TestDefaultContainerExecutor#testContainerLaunchError fails on Windows > -- > > Key: YARN-2711 > URL: https://issues.apache.org/jira/browse/YARN-2711 > Project: Hadoop YARN > Issue Type: Test >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Fix For: 2.6.0 > > Attachments: apache-yarn-2711.0.patch > > > The testContainerLaunchError test fails on Windows with the following error - > {noformat} > java.io.FileNotFoundException: File file:/bin/echo does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at org.apache.hadoop.fs.FilterFs.getFileStatus(FilterFs.java:120) > at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1117) > at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1113) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1113) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2019) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1978) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:145) > at > org.apache.hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor.testContainerLaunchError(TestDefaultContainerExecutor.java:289) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2711) TestDefaultContainerExecutor#testContainerLaunchError fails on Windows
[ https://issues.apache.org/jira/browse/YARN-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192829#comment-14192829 ] Zhijie Shen commented on YARN-2711: --- Junping is offline and has network issue with git repository. I'll go ahead to commit the patch. > TestDefaultContainerExecutor#testContainerLaunchError fails on Windows > -- > > Key: YARN-2711 > URL: https://issues.apache.org/jira/browse/YARN-2711 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2711.0.patch > > > The testContainerLaunchError test fails on Windows with the following error - > {noformat} > java.io.FileNotFoundException: File file:/bin/echo does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:524) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:737) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:514) > at > org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) > at org.apache.hadoop.fs.FilterFs.getFileStatus(FilterFs.java:120) > at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1117) > at org.apache.hadoop.fs.FileContext$14.next(FileContext.java:1113) > at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) > at org.apache.hadoop.fs.FileContext.getFileStatus(FileContext.java:1113) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2019) > at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1978) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:145) > at > org.apache.hadoop.yarn.server.nodemanager.TestDefaultContainerExecutor.testContainerLaunchError(TestDefaultContainerExecutor.java:289) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2767) RM web services - add test case to ensure the http static user can kill or submit apps in secure mode
[ https://issues.apache.org/jira/browse/YARN-2767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192761#comment-14192761 ] Zhijie Shen commented on YARN-2767: --- [~vvasudev], thanks for the patch. The test cases look good. Just some minor comments for code refactoring: 1. Use Assert.fail()? {code} assertTrue("Couldn't create MiniKDC", false); {code} 2. miniKDCStarted is not necessary. {code} miniKDCStarted = true; {code} 3. It seems not to be necessary. Maybe we can refactor the code of setUp(). {code} private static MiniKdc getKdc() { return testMiniKDC; } {code} > RM web services - add test case to ensure the http static user can kill or > submit apps in secure mode > - > > Key: YARN-2767 > URL: https://issues.apache.org/jira/browse/YARN-2767 > Project: Hadoop YARN > Issue Type: Test > Components: resourcemanager >Reporter: Varun Vasudev >Assignee: Varun Vasudev > Attachments: apache-yarn-2767.0.patch, apache-yarn-2767.1.patch > > > We should add a test to ensure that the http static user used to access the > RM web interface can't submit or kill apps if the cluster is running in > secure mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2770) Timeline delegation tokens need to be automatically renewed by the RM
[ https://issues.apache.org/jira/browse/YARN-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191308#comment-14191308 ] Zhijie Shen commented on YARN-2770: --- The two test failures are not related, and happen on other Jiras, too: file two tickets for them - YARN-2782 an YARN-2783. > Timeline delegation tokens need to be automatically renewed by the RM > - > > Key: YARN-2770 > URL: https://issues.apache.org/jira/browse/YARN-2770 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.5.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2770.1.patch, YARN-2770.2.patch > > > YarnClient will automatically grab a timeline DT for the application and pass > it to the app AM. Now the timeline DT renew is still dummy. If an app is > running for more than 24h (default DT expiry time), the app AM is no longer > able to use the expired DT to communicate with the timeline server. Since RM > will cache the credentials of each app, and renew the DTs for the running > app. We should provider renew hooks similar to what HDFS DT has for RM, and > set RM user as the renewer when grabbing the timeline DT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2783) TestApplicationClientProtocolOnHA
Zhijie Shen created YARN-2783: - Summary: TestApplicationClientProtocolOnHA Key: YARN-2783 URL: https://issues.apache.org/jira/browse/YARN-2783 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen {code} Running org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA Tests run: 17, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 147.881 sec <<< FAILURE! - in org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA testGetContainersOnHA(org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA) Time elapsed: 12.928 sec <<< ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy17.getContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getContainers(ApplicationClientProtocolPBClientImpl.java:400) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy18.getContainers(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getContainers(YarnClientImpl.java:639) at org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA.testGetContainersOnHA(TestApplicationClientProtocolOnHA.java:154) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2782) TestResourceTrackerOnHA fails on trunk
Zhijie Shen created YARN-2782: - Summary: TestResourceTrackerOnHA fails on trunk Key: YARN-2782 URL: https://issues.apache.org/jira/browse/YARN-2782 Project: Hadoop YARN Issue Type: Test Reporter: Zhijie Shen {code} Running org.apache.hadoop.yarn.client.TestResourceTrackerOnHA Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 12.684 sec <<< FAILURE! - in org.apache.hadoop.yarn.client.TestResourceTrackerOnHA testResourceTrackerOnHA(org.apache.hadoop.yarn.client.TestResourceTrackerOnHA) Time elapsed: 12.518 sec <<< ERROR! java.net.ConnectException: Call From asf905.gq1.ygridcore.net/67.195.81.149 to asf905.gq1.ygridcore.net:28031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521) at org.apache.hadoop.ipc.Client.call(Client.java:1438) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy87.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) at com.sun.proxy.$Proxy88.registerNodeManager(Unknown Source) at org.apache.hadoop.yarn.client.TestResourceTrackerOnHA.testResourceTrackerOnHA(TestResourceTrackerOnHA.java:64) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2779) SystemMetricsPublisher can use Kerberos directly instead of timeline DT
[ https://issues.apache.org/jira/browse/YARN-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191177#comment-14191177 ] Zhijie Shen commented on YARN-2779: --- I've verified it in a secure cluster, and SystemMetricsPublisher works fine with kerberos directly. > SystemMetricsPublisher can use Kerberos directly instead of timeline DT > --- > > Key: YARN-2779 > URL: https://issues.apache.org/jira/browse/YARN-2779 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2779.1.patch > > > SystemMetricsPublisher is going to grab a timeline DT in secure mode as well. > The timeline DT will expiry after 24h. No DT renewer will handle renewing > work for SystemMetricsPublisher, but this has to been handled by itself. In > addition, SystemMetricsPublisher should cancel the timeline DT when it is > stopped, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2770) Timeline delegation tokens need to be automatically renewed by the RM
[ https://issues.apache.org/jira/browse/YARN-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2770: -- Attachment: YARN-2770.2.patch > Timeline delegation tokens need to be automatically renewed by the RM > - > > Key: YARN-2770 > URL: https://issues.apache.org/jira/browse/YARN-2770 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.5.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2770.1.patch, YARN-2770.2.patch > > > YarnClient will automatically grab a timeline DT for the application and pass > it to the app AM. Now the timeline DT renew is still dummy. If an app is > running for more than 24h (default DT expiry time), the app AM is no longer > able to use the expired DT to communicate with the timeline server. Since RM > will cache the credentials of each app, and renew the DTs for the running > app. We should provider renew hooks similar to what HDFS DT has for RM, and > set RM user as the renewer when grabbing the timeline DT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2770) Timeline delegation tokens need to be automatically renewed by the RM
[ https://issues.apache.org/jira/browse/YARN-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14191174#comment-14191174 ] Zhijie Shen commented on YARN-2770: --- bq. SecurityUtil#getServerPrincipal may be useful. bq. Let's make sure the renewer name mangling imitates MR JobClient, it is easy to get this wrong. I think we should use HadoopKerberosName#getShortName (AbstractDelegationTokenSecretManager is using it as well) and RM_Principal (which should be there in secure mode) to get the RM daemon user, and HadoopKerberosName will automatically handle auth_to_local if we need to map the auth name to the real operating system name. bq. It'll be great to also test separately that renewal can work fine when https is enabled. I've verified it will work with SSL. BTW, SystemMetricsPublisher works fine with SSL too. To make it work, we must make sure RM have seen the proper configuration for SSL and the truststore. bq. the same DelegationTokenAuthenticatedURL is instantiated multiple times, is it possible to store it as a variable ? It's probably okay to reuse DelegationTokenAuthenticatedURL. However, I'd like to construct one for each request to isolate the possible resource sharing, preventing introducing potential bugs. Actually Jersey client also construct a new URL for each request. It won't be a big overhead, as it doesn't deeply construct something. bq. similarly for the timeline client instantiation. I'm not sure, but guess you're talking about TokenRenewer. Actually I'm following the way that RMDelegationTokenIdentifier does. If we don't construct the client per call, we need to make it a service, and have separate stage for init/start and stop. It may complex the change. Please let me know if you want this change. bq. We may replace the token after renew is really succeeded. According to the design of DelegationTokenAuthenticatedURL, I need to put the DT into the current DelegationTokenAuthenticatedURL.Token, which will be fetched internally to do the corresponding operations. So to renew a given DT, I need to set DT there. However, if it already cached there, the client can skip the set step. Otherwise, I've addressed the remaining comments. Thanks Jian and Vinod! > Timeline delegation tokens need to be automatically renewed by the RM > - > > Key: YARN-2770 > URL: https://issues.apache.org/jira/browse/YARN-2770 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.5.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2770.1.patch > > > YarnClient will automatically grab a timeline DT for the application and pass > it to the app AM. Now the timeline DT renew is still dummy. If an app is > running for more than 24h (default DT expiry time), the app AM is no longer > able to use the expired DT to communicate with the timeline server. Since RM > will cache the credentials of each app, and renew the DTs for the running > app. We should provider renew hooks similar to what HDFS DT has for RM, and > set RM user as the renewer when grabbing the timeline DT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2779) SystemMetricsPublisher can use Kerberos directly instead of timeline DT
[ https://issues.apache.org/jira/browse/YARN-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2779: -- Attachment: YARN-2779.1.patch Upload a patch to remove the code of getting the timeline DT in the SystemMetricsPublisher > SystemMetricsPublisher can use Kerberos directly instead of timeline DT > --- > > Key: YARN-2779 > URL: https://issues.apache.org/jira/browse/YARN-2779 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > Attachments: YARN-2779.1.patch > > > SystemMetricsPublisher is going to grab a timeline DT in secure mode as well. > The timeline DT will expiry after 24h. No DT renewer will handle renewing > work for SystemMetricsPublisher, but this has to been handled by itself. In > addition, SystemMetricsPublisher should cancel the timeline DT when it is > stopped, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2771) DistributedShell's DSConstants are badly named
[ https://issues.apache.org/jira/browse/YARN-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2771: -- Attachment: YARN-2771.3.patch Fix the test failure > DistributedShell's DSConstants are badly named > -- > > Key: YARN-2771 > URL: https://issues.apache.org/jira/browse/YARN-2771 > Project: Hadoop YARN > Issue Type: Bug > Components: applications/distributed-shell >Reporter: Vinod Kumar Vavilapalli >Assignee: Zhijie Shen > Attachments: YARN-2771.1.patch, YARN-2771.2.patch, YARN-2771.3.patch > > > I'd rather have underscores (DISTRIBUTED_SHELL_TIMELINE_DOMAIN instead of > DISTRIBUTEDSHELLTIMELINEDOMAIN). > DISTRIBUTEDSHELLTIMELINEDOMAIN is added in this release, can we rename it to > be DISTRIBUTED_SHELL_TIMELINE_DOMAIN? > For the old envs, we can just add new envs that point to the old-one and > deprecate the old ones. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2779) SystemMetricsPublisher can use Kerberos directly instead of timeline DT
[ https://issues.apache.org/jira/browse/YARN-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2779: -- Summary: SystemMetricsPublisher can use Kerberos directly instead of timeline DT (was: SystemMetricsPublisher needs to renew and cancel timeline DT too) > SystemMetricsPublisher can use Kerberos directly instead of timeline DT > --- > > Key: YARN-2779 > URL: https://issues.apache.org/jira/browse/YARN-2779 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > > SystemMetricsPublisher is going to grab a timeline DT in secure mode as well. > The timeline DT will expiry after 24h. No DT renewer will handle renewing > work for SystemMetricsPublisher, but this has to been handled by itself. In > addition, SystemMetricsPublisher should cancel the timeline DT when it is > stopped, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2779) SystemMetricsPublisher needs to renew and cancel timeline DT too
[ https://issues.apache.org/jira/browse/YARN-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14190955#comment-14190955 ] Zhijie Shen commented on YARN-2779: --- [~vinodkv], in the current code base, we're SystemMetricsPublisher to grab a timeline DT to talk to the timeline server in secure mode. That's why we need this Jira to add renew and cancel work. But thinking of this issue again, it should be okay to let RM talk to the timeline server with kerberos directly. As this is the only process, which will not add too much workload to the kerberos server. So on the other hand, let's remove the getting DT logic, and let RM uses kerberos directly. > SystemMetricsPublisher needs to renew and cancel timeline DT too > > > Key: YARN-2779 > URL: https://issues.apache.org/jira/browse/YARN-2779 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > > SystemMetricsPublisher is going to grab a timeline DT in secure mode as well. > The timeline DT will expiry after 24h. No DT renewer will handle renewing > work for SystemMetricsPublisher, but this has to been handled by itself. In > addition, SystemMetricsPublisher should cancel the timeline DT when it is > stopped, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (YARN-2779) SystemMetricsPublisher needs to renew and cancel timeline DT too
[ https://issues.apache.org/jira/browse/YARN-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reopened YARN-2779: --- > SystemMetricsPublisher needs to renew and cancel timeline DT too > > > Key: YARN-2779 > URL: https://issues.apache.org/jira/browse/YARN-2779 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager, timelineserver >Affects Versions: 2.6.0 >Reporter: Zhijie Shen >Assignee: Zhijie Shen >Priority: Critical > > SystemMetricsPublisher is going to grab a timeline DT in secure mode as well. > The timeline DT will expiry after 24h. No DT renewer will handle renewing > work for SystemMetricsPublisher, but this has to been handled by itself. In > addition, SystemMetricsPublisher should cancel the timeline DT when it is > stopped, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2779) SystemMetricsPublisher needs to renew and cancel timeline DT too
Zhijie Shen created YARN-2779: - Summary: SystemMetricsPublisher needs to renew and cancel timeline DT too Key: YARN-2779 URL: https://issues.apache.org/jira/browse/YARN-2779 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Reporter: Zhijie Shen Assignee: Zhijie Shen Priority: Critical SystemMetricsPublisher is going to grab a timeline DT in secure mode as well. The timeline DT will expiry after 24h. No DT renewer will handle renewing work for SystemMetricsPublisher, but this has to been handled by itself. In addition, SystemMetricsPublisher should cancel the timeline DT when it is stopped, too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)