Hello again. I am not sure this has been resolved yet, because I am still unable to get Marathon deployments to start.
I have deleted the /marathon/ node from Zookeeper, and I now have the Marathon WebUI accessible again. I try to add a new task to deploy, and there seem to be available resources, but it is still stuck in a 'Waiting' status. While deploying I am looking at mesos-master.WARNING, mesos-master.INFO and mesos-master.ERROR log files, but I never see anything show up that would indicate a problem, or even an attempt. Where am I going wrong? Thanks, June Taylor System Administrator, Minnesota Population Center University of Minnesota On Sat, Apr 9, 2016 at 6:07 AM, Pradeep Chhetri <pradeep.chhetr...@gmail.com > wrote: > Hi Greg & June, > > By looking at the above command, I can say that you are running spark in > client mode because you are invoking the pyspark-shell. > > One simple way to distinguish is that in cluster mode, it's mandatory to > start MesosClusterDispatcher in your mesos cluster which is the spark > framework scheduler. > > As everyone told above, I guess the reason you are observing orphaned > tasks is because the scheduler is getting killed before the tasks getting > finished. > > I would suggest June to run Spark in clustered mode ( > http://spark.apache.org/docs/latest/running-on-mesos.html#cluster-mode) > > Also, as Radek suggested above, run spark in coarse grained (default run > mode) which will save you much of the JVM startup time. > > Keep us informed how it goes. > > > On Sat, Apr 9, 2016 at 12:28 AM, Rad Gruchalski <ra...@gruchalski.com> > wrote: > >> Greg, >> >> All you need to do is tell Spark that the master is mesos://…, as in the >> example from June. >> It’s all nicely documented here: >> >> http://spark.apache.org/docs/latest/running-on-mesos.html >> >> I’d suggest running in coarse mode as fine grained is a bit choppy. >> >> Best regards, >> Radek Gruchalski >> ra...@gruchalski.com <ra...@gruchalski.com> >> de.linkedin.com/in/radgruchalski/ >> >> >> *Confidentiality:*This communication is intended for the above-named >> person and may be confidential and/or legally privileged. >> If it has come to you in error you must take no action based on it, nor >> must you copy or show it to anyone; please delete/destroy and inform the >> sender immediately. >> >> On Saturday, 9 April 2016 at 00:48, Greg Mann wrote: >> >> Unfortunately I'm not able to glean much from that command, but perhaps >> someone out there with more Spark experience can? I do know that there are >> a couple ways to launch Spark jobs on a cluster: you can run them in client >> mode, where the Spark driver runs locally on your machine and exits when >> it's finished, or they can be run in cluster mode where the Spark driver >> runs persistently on the cluster as a Mesos framework. How exactly are you >> launching these tasks on the Mesos cluster? >> >> On Fri, Apr 8, 2016 at 5:41 AM, June Taylor <j...@umn.edu> wrote: >> >> Greg, >> >> I'm on the ops side and fairly new to spark/mesos, so I'm not quite sure >> I understand your question, here's how the task shows up in a process >> listing: >> >> /usr/lib/jvm/java-8-oracle/bin/java -cp /path/to/spark/spark- >> installations/spark-1.6.0-bin-hadoop2.6/conf/:/path/to/spark/spark- >> installations/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly- >> 1.6.0-hadoop2.6.0.jar:/path/to/spark/spark-installations/spark-1.6.0-bin- >> hadoop2.6/lib/datanucleus-core-3.2.10.jar:/path/to/spark/spark- >> installations/spark-1.6.0-bin-hadoop2.6/lib/datanucleus- >> rdbms-3.2.9.jar:/path/to/spark/spark-installations/spark-1.6.0-bin- >> hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms10G -Xmx10G >> org.apache.spark.deploy.SparkSubmit --master mesos://master.ourdomain.com >> :5050 --conf spark.driver.memory=10G --executor-memory 100G >> --total-executor-cores 90 pyspark-shell >> >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> On Thu, Apr 7, 2016 at 3:37 PM, Greg Mann <g...@mesosphere.io> wrote: >> >> Hi June, >> Are these Spark tasks being run in cluster mode or client mode? If it's >> client mode, then perhaps your local Spark scheduler is tearing itself down >> before the executors exit, thus leaving them orphaned. >> >> I'd love to see master/agent logs during the time that the tasks are >> becoming orphaned if you have them available. >> >> Cheers, >> Greg >> >> >> On Thu, Apr 7, 2016 at 1:08 PM, June Taylor <j...@umn.edu> wrote: >> >> Just a quick update... I was only able to get the orphans cleared by >> stopping mesos-slave, deleting the contents of the scratch directory, and >> then restarting mesos-slave. >> >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> On Thu, Apr 7, 2016 at 12:01 PM, Vinod Kone <vinodk...@apache.org> wrote: >> >> A task/executor is called "orphaned" if the corresponding scheduler >> doesn't register with Mesos. Is your framework scheduler running or gone >> for good? The resources should be cleaned up if the agent (and consequently >> the master) have realized that the executor exited. >> >> Can you paste the master and agent logs for one of orphaned >> tasks/executors (grep the log with the task/executor id)? >> >> On Thu, Apr 7, 2016 at 9:00 AM, haosdent <haosd...@gmail.com> wrote: >> >> Hmm, sorry for didn't express my idea clear. I mean kill those orphan >> tasks here. >> >> On Thu, Apr 7, 2016 at 11:57 PM, June Taylor <j...@umn.edu> wrote: >> >> Forgive my ignorance, are you literally saying I should just sigkill >> these instances? How will that clean up the mesos orphans? >> >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> On Thu, Apr 7, 2016 at 10:44 AM, haosdent <haosd...@gmail.com> wrote: >> >> Support you --work_dir=/tmp/mesos. So you could >> >> $ find /tmp/mesos -name $YOUR_EXECUTOR_ID >> >> Then you could get a folder list and then could use lsof on them. >> >> As a example, my executor id is "test" here. >> >> $ find /tmp/mesos/ -name 'test' >> >> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test >> >> When I execute >> lsof >> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0002/executors/test/runs/latest/ >> (Keep in mind I append runs/latest) here. >> >> Then you could see the pid list: >> >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >> mesos-exe 21811 haosdent cwd DIR 8,3 6 3221463220 >> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >> sleep 21847 haosdent cwd DIR 8,3 6 3221463220 >> /tmp/mesos/0/slaves/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-S0/frameworks/138ee255-c8ef-4caa-8ff2-c0c02f70b4f5-0003/executors/test/runs/efecb119-1019-4629-91ab-fec7724a0f11 >> >> Kill all of them. >> >> On Thu, Apr 7, 2016 at 11:23 PM, June Taylor <j...@umn.edu> wrote: >> >> I do have the executor ID. Can you advise how to kill it? >> >> I have one master and three slaves. Each slave has one of these orphans. >> >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> On Thu, Apr 7, 2016 at 10:14 AM, haosdent <haosd...@gmail.com> wrote: >> >> >Going to this slave I can find an executor within the mesos working >> directory which matches this framework ID >> The quickest way here is use kill in slave if you could find the >> mesos-executor id. You make use lsof/fuser or dig log to find out the >> executor pid. >> >> However, it still wired according your feedbacks. Do you have multiple >> masters and fail over happens in your master? So that the slave could not >> collect to the new master and tasks become orphan. >> >> On Thu, Apr 7, 2016 at 11:06 PM, June Taylor <j...@umn.edu> wrote: >> >> Here is one of three orphaned tasks (first two octets of IP removed): >> >> "orphan_tasks": [ >> { >> "executor_id": "", >> "name": "Task 1", >> "framework_id": "14cddded-e692-4838-9893-6e04a81481d8-0006", >> "state": "TASK_RUNNING", >> "statuses": [ >> { >> "timestamp": 1459887295.05554, >> "state": "TASK_RUNNING", >> "container_status": { >> "network_infos": [ >> { >> "ip_addresses": [ >> { >> "ip_address": "xxx.xxx.163.205" >> } >> ], >> "ip_address": "xxx.xxx.163.205" >> } >> ] >> } >> } >> ], >> "slave_id": "182cf09f-0843-4736-82f1-d913089d7df4-S83", >> "id": "1", >> "resources": { >> "mem": 112640.0, >> "disk": 0.0, >> "cpus": 30.0 >> } >> } >> >> Going to this slave I can find an executor within the mesos working >> directory which matches this framework ID. Reviewing the stdout messaging >> within indicates the program has finished its work. But, it is still >> holding these resources open. >> >> This framework ID is not shown as Active in the main Mesos Web UI, but >> does show up if you display the Slave's web UI. >> >> The resources consumed count towards the Idle pool, and have resulted in >> zero available resources for other Offers. >> >> >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> On Thu, Apr 7, 2016 at 9:46 AM, haosdent <haosd...@gmail.com> wrote: >> >> > pyspark executors hanging around and consuming resources marked as >> Idle in mesos Web UI >> >> Do you have some logs about this? >> >> >is there an API call I can make to kill these orphans? >> >> As I know, mesos agent would try to clean orphan containers when restart. >> But I not sure the orphan I mean here is same with yours. >> >> On Thu, Apr 7, 2016 at 10:21 PM, June Taylor <j...@umn.edu> wrote: >> >> Greetings mesos users! >> >> I am debugging an issue with pyspark executors hanging around and >> consuming resources marked as Idle in mesos Web UI. These tasks also show >> up in the orphaned_tasks key in `mesos state`. >> >> I'm first wondering how to clear them out - is there an API call I can >> make to kill these orphans? Secondly, how it happened at all. >> >> Thanks, >> June Taylor >> System Administrator, Minnesota Population Center >> University of Minnesota >> >> >> >> >> -- >> Best Regards, >> Haosdent Huang >> >> >> >> >> >> -- >> Best Regards, >> Haosdent Huang >> >> >> >> >> >> -- >> Best Regards, >> Haosdent Huang >> >> >> >> >> >> -- >> Best Regards, >> Haosdent Huang >> >> >> >> >> >> >> >> > > > -- > Regards, > Pradeep Chhetri >