Unable to receive offers / long delays when starting or restarting.

Rodrick Brown Mon, 01 Feb 2016 14:56:21 -0800

My cluster consist of 9 slaves server split in 1/2 for two primary
applications (Spark | Scala Microservices)


  * Spark - (server 1,2,3,4,8)  attributes: "rack:spark"
  * Long running Microservices (server 5,6,7,9) attributes "rack:ms"

  

The spark jobs run in coarse mode and the majority of them are short lived
they run for about  ~10-15 minutes via Chronos and shutdown. They start every
15 minutes about ~45 jobs.

  

We do lots of deploys daily mostly to the "rack:ms" nodes where these jobs are
started via Marathon and run until we need to deploy a new release of code.

  

Recently I started noticing jobs are taking forever to restart or startup like
they're not receiving valid offers.

The cluster resources consists of the following resources I always have more
than enough idle resources available to bring up/down new services yet I've
seen one scenario where a service took almost 10 minutes to restart.  

  

  

CPUs| Mem  
---|---  
Total| 120| 456.8 GB  
Used| 53.6| 140.5 GB  
Offered| 0| 0 B  
Idle| 66.4| 316.3 GB  
  

How can I combat this delay? I'm not using roles could this be the problem?

Chronos jobs always seem to run fine but they require much less resource than
my long running Scala services.

Here is a sample job definition for in Marathon.

  

{  
   "id": "production/index-service",  
   "cmd": "env &amp;&amp; /opt/orchard/production/index-
server/bin/run_jar.sh",  
   "cpus": 1.0,  
   "mem": 4096,  
   "disk": 1000,  
   "user": "orchard",  
   "instances": 2,  
   "constraints": [  
     [  
       "hostname","UNIQUE"  
     ],  
     [  
       "rack", "LIKE", "ms"  
     ]  
   ],  
   "requirePorts": true,  
   "labels": {  
     "ENV": "production",  
     "HAPROXY_GROUP": "microservice"  
   },  
 "ports": [  
     31703,  
     31803,  
     31903  
   ],  
   "maxLaunchDelaySeconds": 3,  
   "backoffFactor": 1.20,  
   "healthChecks": [  
     {  
       "gracePeriodSeconds": 3,  
       "intervalSeconds": 5,  
       "maxConsecutiveFailures": 3,  
       "protocol": "TCP",  
       "portIndex": 1,  
       "timeoutSeconds": 5  
     }  
   ],  
"upgradeStrategy": {  
       "minimumHealthCapacity": 0.5,  
       "maximumOverCapacity": 0.2  
   }  
}  

  

Any advice appreciated thanks.


-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.

Unable to receive offers / long delays when starting or restarting.

Reply via email to