[jira] [Resolved] (YARN-10205) NodeManager stateful restart feature did not work as expected - information only (Resolved)
[ https://issues.apache.org/jira/browse/YARN-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anil Sadineni resolved YARN-10205. -- Resolution: Not A Problem > NodeManager stateful restart feature did not work as expected - information > only (Resolved) > --- > > Key: YARN-10205 > URL: https://issues.apache.org/jira/browse/YARN-10205 > Project: Hadoop YARN > Issue Type: Test > Components: graceful, nodemanager, rolling upgrade, yarn >Reporter: Anil Sadineni >Priority: Major > > *TL;DR* This is information only Jira on stateful restart of node manager > feature. Unexpected behavior of this feature was due to systemd process > configurations in this case. Please read below for more details - > Stateful restart of Node Manager(YARN-1336) i introduced in Hadoop 2.6. This > feature worked as expected in Hadoop 2.6 for us. Recently we upgraded our > clusters from 2.6 to 2.9.2 along with some OS upgrades. This feature was > broken after the upgrade. one of the initial suspicion was > LinuxContainerExecutor as we started using it in this upgrade. > yarn-site.xml has all required configurations to enable this feature - > {{yarn.nodemanager.recovery.enabled: 'true'}} > {{yarn.nodemanager.recovery.dir:''}} > {{yarn.nodemanager.recovery.supervised: 'true'}} > {{yarn.nodemanager.address: '0.0.0.0:8041'}} > While containers running and NM restarted, below is the exception constantly > observed in Node Manager logs - > {quote} > java.io.IOException: *Timeout while waiting for exit code from > container_e37_1583181000856_0008_01_43*2020-03-05 17:45:18,241 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: > Unable to recover container container_e37_1583181000856_0008_01_43 > {quote} > {quote}at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2020-03-05 17:45:18,241 ERROR > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: > Unable to recover container container_e37_1583181000856_0008_01_18 > java.io.IOException: Timeout while waiting for exit code from > container_e37_1583181000856_0008_01_18 > at > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2020-03-05 17:45:18,242 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: > Recovered container exited with a non-zero exit code 154 > {quote} > {quote}2020-03-05 17:45:18,243 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: > Recovered container exited with a non-zero exit code 154 > {quote} > After some digging on what was causing exitfile missing, at OS level > identified that running container processes are going down as soon as NM is > going down. Process tree looks perfectly fine as the container-executor takes > care of forking child process as expected. Dig deeper into various parts of > code to see if anything caused the failure. > One question was did we break anything in our internal repo after we forked > 2.9.2 from open source. Started looking into code at different areas like NM > shutdown
[jira] [Created] (YARN-10205) NodeManager stateful restart feature did not work as expected - information only (Resolved)
Anil Sadineni created YARN-10205: Summary: NodeManager stateful restart feature did not work as expected - information only (Resolved) Key: YARN-10205 URL: https://issues.apache.org/jira/browse/YARN-10205 Project: Hadoop YARN Issue Type: Test Components: graceful, nodemanager, rolling upgrade, yarn Reporter: Anil Sadineni *TL;DR* This is information only Jira on stateful restart of node manager feature. Unexpected behavior of this feature was due to systemd process configurations in this case. Please read below for more details - Stateful restart of Node Manager(YARN-1336) i introduced in Hadoop 2.6. This feature worked as expected in Hadoop 2.6 for us. Recently we upgraded our clusters from 2.6 to 2.9.2 along with some OS upgrades. This feature was broken after the upgrade. one of the initial suspicion was LinuxContainerExecutor as we started using it in this upgrade. yarn-site.xml has all required configurations to enable this feature - {{yarn.nodemanager.recovery.enabled: 'true'}} {{yarn.nodemanager.recovery.dir:''}} {{yarn.nodemanager.recovery.supervised: 'true'}} {{yarn.nodemanager.address: '0.0.0.0:8041'}} While containers running and NM restarted, below is the exception constantly observed in Node Manager logs - {quote} java.io.IOException: *Timeout while waiting for exit code from container_e37_1583181000856_0008_01_43*2020-03-05 17:45:18,241 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_e37_1583181000856_0008_01_43 {quote} {quote}at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-03-05 17:45:18,241 ERROR org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Unable to recover container container_e37_1583181000856_0008_01_18 java.io.IOException: Timeout while waiting for exit code from container_e37_1583181000856_0008_01_18 at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor.reacquireContainer(ContainerExecutor.java:274) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.reacquireContainer(LinuxContainerExecutor.java:631) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:84) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch.call(RecoveredContainerLaunch.java:47) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 2020-03-05 17:45:18,242 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Recovered container exited with a non-zero exit code 154 {quote} {quote}2020-03-05 17:45:18,243 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.RecoveredContainerLaunch: Recovered container exited with a non-zero exit code 154 {quote} After some digging on what was causing exitfile missing, at OS level identified that running container processes are going down as soon as NM is going down. Process tree looks perfectly fine as the container-executor takes care of forking child process as expected. Dig deeper into various parts of code to see if anything caused the failure. One question was did we break anything in our internal repo after we forked 2.9.2 from open source. Started looking into code at different areas like NM shutdown hook and clean up process, NM State store on container launch, NM aux services, container-executor, Shell launch and clean up related hooks, etc. Things were looking fine as expected. It was identified that hadoop-nodemanager systemd process configured to use default KillMode which is control-group. [https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=] This is causing
[jira] [Commented] (YARN-10077) Region in ats-hbase table 'prod.timelineservice.entity' failing to split
[ https://issues.apache.org/jira/browse/YARN-10077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17064461#comment-17064461 ] Anil Sadineni commented on YARN-10077: -- We are having the same issue with both entity and application tables. In our case, there are many clusters. Loads for each cluster are so huge that regions are getting overloaded without getting split in this policy for application table as well. I believe we need another flexibility region split policy which is similar to DelimitedKeyPrefixRegionSplitPolicy but gives flexibility to provide the ordinance of delimiter. Here is java doc for the custom region split policy that I am thinking - {quote}_/**_ _* A custom RegionSplitPolicy implementing a SplitPolicy that groups_ _rows by a prefix of the row-key with a delimiter and its nth ordinal._ _* This ensures that a region is not split "inside" a prefix of a row key. I.e. rows can be co-located in a region by their_ _* prefix._ _* As an example, if you have row keys delimited with_ ___!, like_ _*_ ___userid!clusterid!flowname!flowid___ _*_ _*_ __ _*_ ___use prefix delimiter ! and nth ordinal as 2, this split policy ensures that all rows starting with_ _* the same userid and clusterid belongs to the same region. (userid!clusterid)___ _*_ ___use prefix delimiter ! and nth ordinal as 3, this split policy ensures that all rows starting with_ _* the same userid, clusterid and flowename belongs to the same region. (userid!clusterid!flowname)___ _*_ __ _*/_ {quote} I will open hbase jira for this new policy. Please let me know your thoughts/feedback on this. > Region in ats-hbase table 'prod.timelineservice.entity' failing to split > > > Key: YARN-10077 > URL: https://issues.apache.org/jira/browse/YARN-10077 > Project: Hadoop YARN > Issue Type: Sub-task > Components: ATSv2 >Affects Versions: 3.3.0 >Reporter: Prabhu Joseph >Priority: Major > > Entity Table grows too large very quickly and the table is failing to split > when most of the entity rows belongs to an user. > # Need to set optimal TTL value for info and config column family > # Need to increase the prefix length for KeyPrefixRegionSplitPolicy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9924) TimelineSchemaCreator breaks after protobuffer upgrade
[ https://issues.apache.org/jira/browse/YARN-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anil Sadineni updated YARN-9924: Description: While creating schema using HBaseTimelineSchemaCreator, observed below issue. Looks like this one broke with proto buffers upgrade (HADOOP-16557). I ran TimelineSchemaCreator against hbase version 1.4.8 which is packaged with proto buffers 2.5. 2019-10-21 12:08:25,013 ERROR storage.HBaseTimelineSchemaCreator: Error in creating hbase tables:2019-10-21 12:08:25,013 ERROR storage.HBaseTimelineSchemaCreator: Error in creating hbase tables:org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.VerifyError: class com.google.protobuf.LiteralByteString overrides final method toString.(Ljava/lang/String;)Ljava/lang/String; at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(RpcRetryingCaller.java:248) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:221) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:388) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:362) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:142) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.VerifyError: class com.google.protobuf.LiteralByteString overrides final method toString.(Ljava/lang/String;)Ljava/lang/String; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.hbase.util.ByteStringer.(ByteStringer.java:44) at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:1053) at org.apache.hadoop.hbase.protobuf.RequestConverter.buildScanRequest(RequestConverter.java:496) at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:402) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:274) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:219) ... 7 more2019-10-21 12:08:25,014 WARN storage.HBaseTimelineSchemaCreator: Schema creation finished with the following exceptions2019-10-21 12:08:25,014 WARN storage.HBaseTimelineSchemaCreator: java.lang.VerifyError: class com.google.protobuf.LiteralByteString overrides final method toString.(Ljava/lang/String;)Ljava/lang/String; was: While creating schema using HBaseTimelineSchemaCreator, observed below issue. Looks like this one broke with proto buffers upgrade (HADOOP-16557). 2019-10-21 12:08:25,013 ERROR storage.HBaseTimelineSchemaCreator: Error in creating hbase tables:2019-10-21 12:08:25,013 ERROR storage.HBaseTimelineSchemaCreator: Error in creating hbase tables:org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.VerifyError: class com.google.protobuf.LiteralByteString overrides final method toString.(Ljava/lang/String;)Ljava/lang/String; at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(RpcRetryingCaller.java:248) at
[jira] [Updated] (YARN-9924) TimelineSchemaCreator breaks after protobuffer upgrade
[ https://issues.apache.org/jira/browse/YARN-9924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anil Sadineni updated YARN-9924: Parent: YARN-9802 Issue Type: Sub-task (was: Task) > TimelineSchemaCreator breaks after protobuffer upgrade > -- > > Key: YARN-9924 > URL: https://issues.apache.org/jira/browse/YARN-9924 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Anil Sadineni >Priority: Major > > While creating schema using HBaseTimelineSchemaCreator, observed below issue. > Looks like this one broke with proto buffers upgrade (HADOOP-16557). > 2019-10-21 12:08:25,013 ERROR storage.HBaseTimelineSchemaCreator: Error in > creating hbase tables:2019-10-21 12:08:25,013 ERROR > storage.HBaseTimelineSchemaCreator: Error in creating hbase > tables:org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.VerifyError: > class com.google.protobuf.LiteralByteString overrides final method > toString.(Ljava/lang/String;)Ljava/lang/String; at > org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(RpcRetryingCaller.java:248) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:221) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:388) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:362) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:142) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.VerifyError: > class com.google.protobuf.LiteralByteString overrides final method > toString.(Ljava/lang/String;)Ljava/lang/String; at > java.lang.ClassLoader.defineClass1(Native Method) at > java.lang.ClassLoader.defineClass(ClassLoader.java:763) at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at > java.net.URLClassLoader.access$100(URLClassLoader.java:74) at > java.net.URLClassLoader$1.run(URLClassLoader.java:369) at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) at > java.security.AccessController.doPrivileged(Native Method) at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) at > java.lang.ClassLoader.loadClass(ClassLoader.java:424) at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at > java.lang.ClassLoader.loadClass(ClassLoader.java:357) at > java.lang.ClassLoader.defineClass1(Native Method) at > java.lang.ClassLoader.defineClass(ClassLoader.java:763) at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at > java.net.URLClassLoader.access$100(URLClassLoader.java:74) at > java.net.URLClassLoader$1.run(URLClassLoader.java:369) at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) at > java.security.AccessController.doPrivileged(Native Method) at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) at > java.lang.ClassLoader.loadClass(ClassLoader.java:424) at > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at > java.lang.ClassLoader.loadClass(ClassLoader.java:357) at > org.apache.hadoop.hbase.util.ByteStringer.(ByteStringer.java:44) at > org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:1053) > at > org.apache.hadoop.hbase.protobuf.RequestConverter.buildScanRequest(RequestConverter.java:496) > at > org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:402) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:274) > at > org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62) > at > org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:219) > ... 7 more2019-10-21 12:08:25,014 WARN storage.HBaseTimelineSchemaCreator: > Schema creation finished with the following exceptions2019-10-21 12:08:25,014 > WARN storage.HBaseTimelineSchemaCreator: java.lang.VerifyError: class > com.google.protobuf.LiteralByteString overrides final method > toString.(Ljava/lang/String;)Ljava/lang/String; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For
[jira] [Created] (YARN-9924) TimelineSchemaCreator breaks after protobuffer upgrade
Anil Sadineni created YARN-9924: --- Summary: TimelineSchemaCreator breaks after protobuffer upgrade Key: YARN-9924 URL: https://issues.apache.org/jira/browse/YARN-9924 Project: Hadoop YARN Issue Type: Task Components: timelineserver Reporter: Anil Sadineni While creating schema using HBaseTimelineSchemaCreator, observed below issue. Looks like this one broke with proto buffers upgrade (HADOOP-16557). 2019-10-21 12:08:25,013 ERROR storage.HBaseTimelineSchemaCreator: Error in creating hbase tables:2019-10-21 12:08:25,013 ERROR storage.HBaseTimelineSchemaCreator: Error in creating hbase tables:org.apache.hadoop.hbase.DoNotRetryIOException: java.lang.VerifyError: class com.google.protobuf.LiteralByteString overrides final method toString.(Ljava/lang/String;)Ljava/lang/String; at org.apache.hadoop.hbase.client.RpcRetryingCaller.translateException(RpcRetryingCaller.java:248) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:221) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:388) at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:362) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:142) at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)Caused by: java.lang.VerifyError: class com.google.protobuf.LiteralByteString overrides final method toString.(Ljava/lang/String;)Ljava/lang/String; at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:362) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.hadoop.hbase.util.ByteStringer.(ByteStringer.java:44) at org.apache.hadoop.hbase.protobuf.RequestConverter.buildRegionSpecifier(RequestConverter.java:1053) at org.apache.hadoop.hbase.protobuf.RequestConverter.buildScanRequest(RequestConverter.java:496) at org.apache.hadoop.hbase.client.ScannerCallable.openScanner(ScannerCallable.java:402) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:274) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:62) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:219) ... 7 more2019-10-21 12:08:25,014 WARN storage.HBaseTimelineSchemaCreator: Schema creation finished with the following exceptions2019-10-21 12:08:25,014 WARN storage.HBaseTimelineSchemaCreator: java.lang.VerifyError: class com.google.protobuf.LiteralByteString overrides final method toString.(Ljava/lang/String;)Ljava/lang/String; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9666) Make async/sync writes to timeline service configurable
Anil Sadineni created YARN-9666: --- Summary: Make async/sync writes to timeline service configurable Key: YARN-9666 URL: https://issues.apache.org/jira/browse/YARN-9666 Project: Hadoop YARN Issue Type: Sub-task Components: ATSv2, timelineclient Reporter: Anil Sadineni Jira to introduce configuration option to control whether writes of events to timeline service to be sync or async. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9335) [atsv2] Restrict the number of elements held in NM timeline collector when backend is unreachable for async calls
[ https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795512#comment-16795512 ] Anil Sadineni edited comment on YARN-9335 at 3/19/19 12:31 AM: --- [~abmodi] I observed small correction needed in yarn-default.xml file. queue capacity key name has 'writer' repeated twice. {quote} The setting that decides the capacity of the queue to hold asynchronous timeline entities. yarn.timeline-service.writer.writer.async.queue.capacity 100 {quote} was (Author: sadineni): [~abmodi] I observed small correction needed in yarn-default.xml file. queue capacity key name has 'writer' repeated twice. {{}} {{ The setting that decides the capacity of the queue to hold}} {{ asynchronous timeline entities.}} {{ yarn.timeline-service.-writer-.writer.async.queue.capacity}} {{ 100}} {{}} > [atsv2] Restrict the number of elements held in NM timeline collector when > backend is unreachable for async calls > - > > Key: YARN-9335 > URL: https://issues.apache.org/jira/browse/YARN-9335 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9335.001.patch, YARN-9335.002.patch > > > For ATSv2 , if the backend is unreachable, the number/size of data held in > timeline collector's memory increases significantly. This is not good for the > NM memory. > Filing jira to set a limit on how many/much should be retained by the > timeline collector in memory in case the backend is not reachable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9335) [atsv2] Restrict the number of elements held in NM timeline collector when backend is unreachable for async calls
[ https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795512#comment-16795512 ] Anil Sadineni edited comment on YARN-9335 at 3/19/19 12:28 AM: --- [~abmodi] I observed small correction needed in yarn-default.xml file. queue capacity key name has 'writer' repeated twice. {{}} {{ The setting that decides the capacity of the queue to hold}} {{ asynchronous timeline entities.}} {{ yarn.timeline-service.-writer-.writer.async.queue.capacity}} {{ 100}} {{}} was (Author: sadineni): [~abmodi] I observed small correction needed in yarn-default.xml file. queue capacity key name has 'writer' repeated twice. {quote}{{}} {{ The setting that decides the capacity of the queue to hold}} {{ asynchronous timeline entities.}} {{ yarn.timeline-service.-writer.-writer.async.queue.capacity}} {{ 100}} {{}} {quote} > [atsv2] Restrict the number of elements held in NM timeline collector when > backend is unreachable for async calls > - > > Key: YARN-9335 > URL: https://issues.apache.org/jira/browse/YARN-9335 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9335.001.patch, YARN-9335.002.patch > > > For ATSv2 , if the backend is unreachable, the number/size of data held in > timeline collector's memory increases significantly. This is not good for the > NM memory. > Filing jira to set a limit on how many/much should be retained by the > timeline collector in memory in case the backend is not reachable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9335) [atsv2] Restrict the number of elements held in NM timeline collector when backend is unreachable for async calls
[ https://issues.apache.org/jira/browse/YARN-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16795512#comment-16795512 ] Anil Sadineni commented on YARN-9335: - [~abmodi] I observed small correction needed in yarn-default.xml file. queue capacity key name has 'writer' repeated twice. {quote}{{}} {{ The setting that decides the capacity of the queue to hold}} {{ asynchronous timeline entities.}} {{ yarn.timeline-service.-writer.-writer.async.queue.capacity}} {{ 100}} {{}} {quote} > [atsv2] Restrict the number of elements held in NM timeline collector when > backend is unreachable for async calls > - > > Key: YARN-9335 > URL: https://issues.apache.org/jira/browse/YARN-9335 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Vrushali C >Assignee: Abhishek Modi >Priority: Major > Attachments: YARN-9335.001.patch, YARN-9335.002.patch > > > For ATSv2 , if the backend is unreachable, the number/size of data held in > timeline collector's memory increases significantly. This is not good for the > NM memory. > Filing jira to set a limit on how many/much should be retained by the > timeline collector in memory in case the backend is not reachable. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org