[ https://issues.apache.org/jira/browse/HIVE-18263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287929#comment-16287929 ]
Ashutosh Chauhan commented on HIVE-18263: ----------------------------------------- HIVE-15102 looks related and had some discussion on the topic. > Ptest execution are multiple times slower sometimes due to dying executor > slaves > -------------------------------------------------------------------------------- > > Key: HIVE-18263 > URL: https://issues.apache.org/jira/browse/HIVE-18263 > Project: Hive > Issue Type: Bug > Components: Testing Infrastructure > Reporter: Adam Szita > Assignee: Adam Szita > Attachments: HIVE-18263.0.patch > > > PreCommit-HIVE-Build job has been seen running very long from time to time. > Usually it should take about 1.5 hours, but in some cases it took over 4-5 > hours. > Looking in the logs of one such execution I've seen that some commands that > were sent to test executing slaves returned 255. Here this typically means > that there is unknown return code for the remote call since hiveptest-server > can't reach these slaves anymore. > In the hiveptest-server logs it is seen that some slaves were killed while > running the job normally, and here is why: > * Hive's ptest-server checks periodically in every 60 minutes the status of > slaves. It also keeps track of slaves that were terminated. > ** If upon such check it is found that a slave that was already killed > ([mTerminatedHosts > map|https://github.com/apache/hive/blob/master/testutils/ptest2/src/main/java/org/apache/hive/ptest/execution/context/CloudExecutionContextProvider.java#L93] > contains its IP) is still running, it will try and terminate it again. > * The server also maintains a file on its local FS that contains the IP of > hosts that were used before. (This probably for resilience reasons) > ** This file is read when tomcat server starts and if any of the IPs in the > file are seen as running slaves, ptest will terminate these first so it can > begin with a fresh start > ** The IPs of these terminated instances already make their way into > {{mTerminatedHosts}} upon initialization... > * The cloud provider may reuse some older IPs, so it is not too rare that the > same IP that belonged to a terminated host is assigned to a new one. > This is problematic: Hive ptest's slave caretaker thread kicks in every 60 > minutes and might see a running host that has the same IP as an old slave had > which was terminated at startup. It will think that this host should be > terminated since it already tried 60 minutes ago as its IP is in > {{mTerminatedHosts}} > We have to fix this by making sure that if a new slave is created, we check > the contents of {{mTerminatedHosts}} and remove this IP from it if it is > there. -- This message was sent by Atlassian JIRA (v6.4.14#64029)