[jira] [Commented] (AMBARI-24166) Metric Collector goes down after HDFS restart post EU

Hudson (JIRA) Fri, 22 Jun 2018 07:53:39 -0700


    [ 
https://issues.apache.org/jira/browse/AMBARI-24166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520448#comment-16520448
 ]


Hudson commented on AMBARI-24166:
---------------------------------

FAILURE: Integrated in Jenkins build Ambari-trunk-Commit #9509 (See 
[https://builds.apache.org/job/Ambari-trunk-Commit/9509/])
AMBARI-24166. Metric Collector goes down after HDFS restart post EU (github: 
[https://gitbox.apache.org/repos/asf?p=ambari.git&a=commit&h=00274d4e25bbffbd21d4eb8fdecf47b70c68c7da])
* (edit) 
ambari-server/src/main/java/org/apache/ambari/server/agent/RecoveryConfigHelper.java
* (edit) 
ambari-server/src/test/java/org/apache/ambari/server/agent/TestHeartbeatHandler.java
* (edit) ambari-agent/src/main/python/ambari_agent/RecoveryManager.py
* (edit) 
ambari-server/src/test/java/org/apache/ambari/server/agent/stomp/HostLevelParamsHolderTest.java
* (edit) 
ambari-server/src/test/java/org/apache/ambari/server/configuration/RecoveryConfigHelperTest.java
* (edit) ambari-agent/src/test/python/ambari_agent/TestActionQueue.py
* (edit) 
ambari-agent/src/main/python/ambari_agent/listeners/ConfigurationEventListener.py
* (edit) 
ambari-agent/src/main/python/ambari_agent/listeners/HostLevelParamsEventListener.py
* (edit) 
ambari-server/src/main/java/org/apache/ambari/server/agent/RecoveryConfig.java
* (edit) ambari-agent/src/test/python/ambari_agent/TestRecoveryManager.py
* (edit) ambari-agent/src/main/python/ambari_agent/InitializerModule.py


> Metric Collector goes down after HDFS restart post EU
> -----------------------------------------------------
>
>                 Key: AMBARI-24166
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24166
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Andrew Onischuk
>            Assignee: Andrew Onischuk
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0
>
>         Attachments: AMBARI-24166.patch, AMBARI-24166.patch, 
> AMBARI-24166.patch, AMBARI-24166.patch, AMBARI-24166.patch, 
> AMBARI-24166.patch, AMBARI-24166.patch, AMBARI-24166.patch
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> **STR**
>   1. Deployed cluster with Ambari version: 2.6.1.5-3 and HDP version: 
> 2.6.1.0-129
>   2. Upgrade Ambari to Target Version: 2.7.0.0-709
>   3. Upgrade AMS and Smartsense (keeping them stopped)
>   4. Perform EU to HDP-3.0 and let it complete
>   5. Restart HDFS
>   6. Observe state of Metrics Collectors (AMS is configured in distributed 
> mode)
> **Result**  
> Both metrics collectors are down (auto start is enabled for Metrics Collector)
> From logs:
>     
>     
>     
>     2018-06-13 16:45:05,620 ERROR 
> org.apache.ambari.metrics.core.timeline.discovery.TimelineMetricMetadataManager:
>  TimelineMetricMetadataKey is null for : [-8, 31, -72, 32, 88, -8, -51, -88, 
> -104, 12, -123, 99, 55, -90, 45, -12, 115, 0, -6, 13]
>     2018-06-13 16:45:05,622 WARN 
> org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
>     java.lang.NullPointerException
>             at 
> org.apache.ambari.metrics.core.timeline.aggregators.TimelineMetricReadHelper.getTimelineMetricCommonsFromResultSet(TimelineMetricReadHelper.java:116)
>             at 
> org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLastTimelineMetricFromResultSet(PhoenixHBaseAccessor.java:446)
>             at 
> org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLatestMetricRecords(PhoenixHBaseAccessor.java:1134)
>             at 
> org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getMetricRecords(PhoenixHBaseAccessor.java:953)
>             at 
> org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.getTimelineMetrics(HBaseTimelineMetricsService.java:288)
>             at 
> org.apache.ambari.metrics.webapp.TimelineWebServices.getTimelineMetrics(TimelineWebServices.java:261)
>             at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
>             at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>             at java.lang.reflect.Method.invoke(Method.java:498)
>     
>     2018-06-13 16:45:07,887 INFO org.apache.zookeeper.ZooKeeper: Initiating 
> client connection, 
> connectString=ctr-e138-1518143905142-361872-01-000005.hwx.site:2181,ctr-e138-1518143905142-361872-01-000006.hwx.site:2181,ctr-e138-1518143905142-361872-01-000003.hwx.site:2181
>  sessionTimeout=120000 
> watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$13/572967831@60474c94
>     2018-06-13 16:45:07,889 INFO 
> org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as 
> SASL mechanism.
>     2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Opening 
> socket connection to server 
> ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181. Will 
> attempt to SASL-authenticate using Login Context section 'Client'
>     2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to 
> ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, 
> initiating session
>     2018-06-13 16:45:07,894 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server 
> ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, 
> sessionid = 0x363f94c8d6d0059, negotiated timeout = 90000
>     2018-06-13 16:45:11,938 INFO 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, 
> tries=6, retries=6, started=4153 ms ago, cancelled=false, msg=Call to 
> ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed 
> on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: 
> ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, 
> details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, 
> hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963,
>  seqNum=-1
>     2018-06-13 16:45:15,954 INFO 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, 
> tries=7, retries=7, started=8169 ms ago, cancelled=false, msg=Call to 
> ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed 
> on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This 
> server is in the failed servers list: 
> ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, 
> details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, 
> hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963,
>  seqNum=-1
>     



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (AMBARI-24166) Metric Collector goes down after HDFS restart post EU

Reply via email to