[ https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870440#comment-15870440 ]
Pierre Cheynier commented on MESOS-7130: ---------------------------------------- Update: I tried to test with Intel interface & driver instead of vif (docs.aws.amazon.com/en_en/AWSEC2/latest/UserGuide/sriov-networking.html), but I now have issues related to networking, my box is just not able to fetch its config, SSH keys etc. I probably have to check the Intel ixgbevf driver... > port_mapping isolator: executor hangs when running on EC2 > --------------------------------------------------------- > > Key: MESOS-7130 > URL: https://issues.apache.org/jira/browse/MESOS-7130 > Project: Mesos > Issue Type: Bug > Components: ec2, executor > Reporter: Pierre Cheynier > > Hi, > I'm experiencing a weird issue: I'm using a CI to do testing on > infrastructure automation. > I recently activated the {{network/port_mapping}} isolator. > I'm able to make the changes work and pass the test for bare-metal servers > and virtualbox VMs using this configuration. > But when I try on EC2 (on which my CI pipeline rely) it systematically fails > to run any container. > It appears that the sandbox is created and the port_mapping isolator seems to > be OK according to the logs in stdout and stderr and the {tc} output : > {noformat} > + mount --make-rslave /run/netns > + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6 > + echo 1 > + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up > + ethtool -K eth0 rx off > (...) > + tc filter show dev eth0 parent ffff:0 > + tc filter show dev lo parent ffff:0 > I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2 > {noformat} > Then the executor never come back in REGISTERED state and hang indefinitely. > {GLOG_v=3} doesn't help here. > My skills in this area are limited, but trying to load the symbols and attach > a gdb to the mesos-executor process, I'm able to print this stack: > {noformat} > #0 0x00007feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x00007feffbed69ec in > std::condition_variable::wait(std::unique_lock<std::mutex>&) () from > /usr/lib64/libstdc++.so.6 > #2 0x00007ff0003dd8ec in void synchronized_wait<std::condition_variable, > std::mutex>(std::condition_variable*, std::mutex*) () from > /usr/lib64/libmesos-1.0.2.so > #3 0x00007ff0017d595d in Gate::arrive(long) () from > /usr/lib64/libmesos-1.0.2.so > #4 0x00007ff0017c00ed in process::ProcessManager::wait(process::UPID const&) > () from /usr/lib64/libmesos-1.0.2.so > #5 0x00007ff0017c5c05 in process::wait(process::UPID const&, Duration > const&) () from /usr/lib64/libmesos-1.0.2.so > #6 0x00000000004ab26f in process::wait(process::ProcessBase const*, Duration > const&) () > #7 0x00000000004a3903 in main () > {noformat} > I concluded that the underlying shell script launched by the isolator or the > task itself is just .. blocked. But I don't understand why. > Here is a process tree to show that I've no task running but the executor is: > {noformat} > root 28420 0.8 3.0 1061420 124940 ? Ssl 17:56 0:25 > /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 > --attributes=platform:centos;platform_major_version:7;type:base > --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup > --cgroups_net_cls_primary_handle=0xC370 > --container_logger=org_apache_mesos_LogrotateContainerLogger > --containerizers=mesos,docker > --credential=file:///etc/mesos-chef/slave-credential > --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]} > --default_role=default --docker_registry=/usr/share/mesos/users > --docker_store_dir=/var/opt/mesos/store/docker > --egress_unique_flow_per_container --enforce_container_disk_quota > --ephemeral_ports_per_container=128 > --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"} > --image_providers=docker --image_provisioner_backend=copy > --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping > --logging_level=INFO > --master=zk://mesos:test@localhost.localdomain:2181/mesos > --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 > --recover=reconnect > --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict > --work_dir=/var/opt/mesos > root 28484 0.0 2.3 433676 95016 ? Ssl 17:56 0:00 \_ > mesos-logrotate-logger --help=false > --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout > --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB > root 28485 0.0 2.3 499212 94724 ? Ssl 17:56 0:00 \_ > mesos-logrotate-logger --help=false > --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-0000/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stderr > --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB > marathon 28487 0.0 2.4 635780 97388 ? Ssl 17:56 0:00 \_ > mesos-executor --launcher_dir=/usr/libexec/mesos > {noformat} > If someone has a clue about the issue I could experience on EC2, I would be > interested to talk... -- This message was sent by Atlassian JIRA (v6.3.15#6346)