[ https://issues.apache.org/jira/browse/AURORA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693643#comment-15693643 ]
Kostiantyn Bokhan edited comment on AURORA-1830 at 11/24/16 4:30 PM: --------------------------------------------------------------------- The problem may be related to the DC/OS mesos configuration. I'm trying to integrated Aurora with DC/OS in order to provide gpu batch scheduling. Mesos-agents are executed with the next options: {code} mesos-agent[2270]: kages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://zk-1.zk:2181,zk-2.zk:2181,zk-3.zk:2181,zk-4.zk:2181,zk-5.zk:2181/mesos" --modules_dir="/opt/mesosphere/etc/mesos-slave-modules" --network_cni_config_dir="/opt/mesosphere/etc/dcos/network/cni" --network_cni_plugins_dir="/opt/mesosphere/active/cni/" --nvidia_gpu_devices="[ 0, 1 ]" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resources="[{"name": "ports", "ranges": {"range": [{"begin": 1025, "end": 2180}, {"begin": 2182, "end": 3887}, {"begin": 3889, "end": 5049}, {"begin": 5052, "end": 8079}, {"begin": 8082, "end": 8180}, {"begin": 8182, "end": 32000}]}, "type": "RANGES"}, {"scalar": {"value": 2}, "name": "gpus", "type": "SCALAR"}, {"scalar": {"value": 428201}, "name": "disk", "type": "SCALAR", "role": "*"}]" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos/slave" {code} So, --sandbox_directory is default. But *mesos-docker-executor* is executed with the next options: {noformat} mesos-docker-executor --container=mesos-195fbdc8-6720-443b-b036-7fa5608b27cc-S21.4bbf7f29-3467-4583-8ca1-94539d698911 --docker=docker --docker_socket=/var/run/docker.sock --help=false --launcher_dir=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S21/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0000/executors/aurora_aurora-executor.d8e82d61-ad8c-11e6-879b-70b3d5800003/runs/4bbf7f29-3467-4583-8ca1-94539d698911 --stop_timeout=20secs {noformat} Where --launcher_dir=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos This path leads to the mesos package in DC/OS installation.... I'v tried configuring the thermos_executor : {noformat} thermos_executor --announcer-ensemble 127.0.0.1:2181 --mesos-containerizer-path=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos {noformat} But the issue is still here... was (Author: kr0t): The problem may be related to the DC/OS mesos configuration. I'm trying to integrated Aurora with DC/OS in order to provide gpu batch scheduling. Mesos-agents are executed with the next options: {noformat} mesos-agent[2270]: kages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://zk-1.zk:2181,zk-2.zk:2181,zk-3.zk:2181,zk-4.zk:2181,zk-5.zk:2181/mesos" --modules_dir="/opt/mesosphere/etc/mesos-slave-modules" --network_cni_config_dir="/opt/mesosphere/etc/dcos/network/cni" --network_cni_plugins_dir="/opt/mesosphere/active/cni/" --nvidia_gpu_devices="[ 0, 1 ]" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resources="[{"name": "ports", "ranges": {"range": [{"begin": 1025, "end": 2180}, {"begin": 2182, "end": 3887}, {"begin": 3889, "end": 5049}, {"begin": 5052, "end": 8079}, {"begin": 8082, "end": 8180}, {"begin": 8182, "end": 32000}]}, "type": "RANGES"}, {"scalar": {"value": 2}, "name": "gpus", "type": "SCALAR"}, {"scalar": {"value": 428201}, "name": "disk", "type": "SCALAR", "role": "*"}]" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos/slave" {noformat} So, --sandbox_directory is default. But *mesos-docker-executor* is executed with the next options: {noformat} mesos-docker-executor --container=mesos-195fbdc8-6720-443b-b036-7fa5608b27cc-S21.4bbf7f29-3467-4583-8ca1-94539d698911 --docker=docker --docker_socket=/var/run/docker.sock --help=false --launcher_dir=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos --mapped_directory=/mnt/mesos/sandbox --sandbox_directory=/var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S21/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0000/executors/aurora_aurora-executor.d8e82d61-ad8c-11e6-879b-70b3d5800003/runs/4bbf7f29-3467-4583-8ca1-94539d698911 --stop_timeout=20secs {noformat} Where --launcher_dir=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos This path leads to the mesos package in DC/OS installation.... I'v tried configuring the thermos_executor : {noformat} thermos_executor --announcer-ensemble 127.0.0.1:2181 --mesos-containerizer-path=/opt/mesosphere/packages/mesos--55e36b7783f1549d26b7567b11090ff93b89487a/libexec/mesos {noformat} But the issue is still here... > Unknown exception initializing sandbox > -------------------------------------- > > Key: AURORA-1830 > URL: https://issues.apache.org/jira/browse/AURORA-1830 > Project: Aurora > Issue Type: Bug > Components: Executor > Affects Versions: 0.16.0 > Reporter: Kostiantyn Bokhan > > When launching a job using the Mesos containerizer and a docker image, the > sandbox setup fails with the following error: > {quote} > FAILED • Unknown exception initializing sandbox: [Errno 2] No such file or > directory > {quote} > Aurora file: > {code} > # run the script > python = Process( > name = 'python', > cmdline = 'python --version') > # describe the task > python_task = Task( > processes = [python], > resources = Resources(cpu = 1, ram = 1*GB, disk=8*GB)) > jobs = [ > Service(cluster = 'MY Cluster', > environment = 'devel', > role = 'root', > name = 'python', > task = python_task, > container = Mesos( image = DockerImage (name = 'python', tag = > '2'))) > ] > {code} > *__main__.log*: > {noformat} > Log file created at: 2016/11/24 14:45:44 > Running on machine: gnode1 > [DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg > Command line: > /var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc/thermos_executor > --announcer-ensemble 127.0.0.1:2181 > I1124 14:45:44.041621 25610 executor_base.py:45] Executor [None]: > registered() called with: > I1124 14:45:44.042294 25610 executor_base.py:45] Executor [None]: > ExecutorInfo: executor_id { > value: "thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8" > } > resources { > name: "cpus" > type: SCALAR > scalar { > value: 0.25 > } > role: "*" > } > resources { > name: "mem" > type: SCALAR > scalar { > value: 128.0 > } > role: "*" > } > command { > uris { > value: "/usr/bin/thermos_executor" > executable: true > } > value: "${MESOS_SANDBOX=.}/thermos_executor --announcer-ensemble > 127.0.0.1:2181" > } > framework_id { > value: "195fbdc8-6720-443b-b036-7fa5608b27cc-0014" > } > name: "AuroraExecutor" > source: "root.devel.python.0" > container { > type: MESOS > volumes { > container_path: "taskfs" > mode: RO > image { > type: DOCKER > docker { > name: python:2" > } > } > } > mesos { > } > } > labels { > labels { > key: "source" > value: "root.devel.python.0" > } > } > I1124 14:45:44.042458 25610 executor_base.py:45] Executor [None]: > FrameworkInfo: user: "root" > name: "Aurora" > id { > value: "195fbdc8-6720-443b-b036-7fa5608b27cc-0014" > } > failover_timeout: 1814400.0 > checkpoint: true > hostname: "vnode7" > capabilities { > type: GPU_RESOURCES > } > I1124 14:45:44.043046 25610 executor_base.py:45] Executor [None]: > SlaveInfo: hostname: "000.000.00.001" > resources { > name: "gpus" > type: SCALAR > scalar { > value: 2.0 > } > role: "*" > } > resources { > name: "ports" > type: RANGES > ranges { > range { > begin: 1025 > end: 2180 > } > range { > begin: 2182 > end: 3887 > } > range { > begin: 3889 > end: 5049 > } > range { > begin: 5052 > end: 8079 > } > range { > begin: 8082 > end: 8180 > } > range { > begin: 8182 > end: 32000 > } > } > role: "*" > } > resources { > name: "disk" > type: SCALAR > scalar { > value: 428201.0 > } > role: "*" > } > resources { > name: "cpus" > type: SCALAR > scalar { > value: 8.0 > } > role: "*" > } > resources { > name: "mem" > type: SCALAR > scalar { > value: 14957.0 > } > role: "*" > } > attributes { > name: "hostname" > type: TEXT > text { > value: "gnode1" > } > } > attributes { > name: "ip" > type: TEXT > text { > value: "000.000.00.001" > } > } > attributes { > name: "rack" > type: TEXT > text { > value: "gpu" > } > } > attributes { > name: "gputype" > type: TEXT > text { > value: "titanz" > } > } > id { > value: "195fbdc8-6720-443b-b036-7fa5608b27cc-S24" > } > checkpoint: true > port: 5051 > I1124 14:45:44.043673 25610 executor_base.py:45] Executor [None]: launchTask > got task: > root/devel/python:root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8 > I1124 14:45:44.044601 25610 executor_base.py:45] Executor > [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]: Updating > root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8 => STARTING > I1124 14:45:44.044718 25610 executor_base.py:45] Executor > [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]: Reason: Initializing sandbox. > F1124 14:45:44.049196 25610 aurora_executor.py:85] Unknown exception > initializing sandbox: [Errno 2] No such file or directory > I1124 14:45:44.049439 25610 executor_base.py:45] Executor > [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]: Updating > root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8 => FAILED > I1124 14:45:44.049519 25610 executor_base.py:45] Executor > [195fbdc8-6720-443b-b036-7fa5608b27cc-S24]: Reason: Unknown exception > initializing sandbox: [Errno 2] No such file or directory > I1124 14:45:49.152787 25610 thermos_executor_main.py:299] > MesosExecutorDriver.run() has finished. > {noformat} > *stderr* > {noformat} > I1124 14:45:43.559283 25614 fetcher.cpp:498] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/195fbdc8-6720-443b-b036-7fa5608b27cc-S24\/root","items":[{"action":"BYPASS_CACHE","uri":{"executable":true,"extract":true,"value":"\/usr\/bin\/thermos_executor"}}],"sandbox_directory":"\/var\/lib\/mesos\/slave\/slaves\/195fbdc8-6720-443b-b036-7fa5608b27cc-S24\/frameworks\/195fbdc8-6720-443b-b036-7fa5608b27cc-0014\/executors\/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8\/runs\/e25e2e98-0b65-4e9f-a86d-13a18dff01bc","user":"root"} > I1124 14:45:43.561226 25614 fetcher.cpp:409] Fetching URI > '/usr/bin/thermos_executor' > I1124 14:45:43.561242 25614 fetcher.cpp:250] Fetching directly into the > sandbox directory > I1124 14:45:43.561266 25614 fetcher.cpp:187] Fetching URI > '/usr/bin/thermos_executor' > I1124 14:45:43.561285 25614 fetcher.cpp:167] Copying resource with command:cp > '/usr/bin/thermos_executor' > '/var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc/thermos_executor' > I1124 14:45:43.569787 25614 fetcher.cpp:547] Fetched > '/usr/bin/thermos_executor' to > '/var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc/thermos_executor' > twitter.common.app debug: Initializing: twitter.common.log (Logging > subsystem.) > Writing log files to disk in > /var/lib/mesos/slave/slaves/195fbdc8-6720-443b-b036-7fa5608b27cc-S24/frameworks/195fbdc8-6720-443b-b036-7fa5608b27cc-0014/executors/thermos-root-devel-python-0-e33ad106-90dd-481a-8d45-c320990b67d8/runs/e25e2e98-0b65-4e9f-a86d-13a18dff01bc > I1124 14:45:44.033974 25610 exec.cpp:161] Version: 1.0.0 > I1124 14:45:44.040127 25639 exec.cpp:236] Executor registered on agent > 195fbdc8-6720-443b-b036-7fa5608b27cc-S24 > FATAL] Unknown exception initializing sandbox: [Errno 2] No such file or > directory > twitter.common.app debug: Shutting application down. > twitter.common.app debug: Running exit function for twitter.common.log > (Logging subsystem.) > twitter.common.app debug: Finishing up module teardown. > twitter.common.app debug: Active thread: <_MainThread(MainThread, started > 139772146038592)> > twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-2, > started daemon 139771946940160)> > twitter.common.app debug: Exiting cleanly. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)